On the future of HPC on Windows
Just a few weeks ago during SC11 Microsoft released two new or updated HPC products, namely Windows Azure HPC Scheduler and Windows HPC Server 2008 R2 SP3. However, what I saw and heard during the last few months as well as during SC11 did not give me the best feeling for the future of Microsoft’s HPC Server product. This post is on my impressions and thoughts not only on the product, but also on doing HPC on the Windows platform in general.
What disturbed me a little was the absence of any roadmap presentation. Well, over the last few years Windows HPC Server clearly has become mature enough to not lack any significant feature necessary for deployment and use on a medium-sized HPC installation. However, Microsoft publically outlining a product roadmap with several key features always felt right, and it’s absence at SC11 has been noted by the community. Furthermore, they quietly killed their Dryad project (including LINQ to HPC), which was prominently displayed at SC10, now betting on a yet-to-be-released distribution of Apache Hadoop for Windows HPC Server and Azure. Finally, there have been several business restructuring activities inside Microsoft. For example, here in Germany Microsoft apparently shut down the HPC group and moved (some of) the people under the hood of Azure. From what I heard, all these activities caused some confusion in the community on how Microsoft sees the future of the Windows HPC Server product and how much support and innovations may be expected from the company on this regard.
What Microsoft now talks a lot about is the Azure integration. If you followed the development of Windows HPC Server up to release R2 SP3, you could clearly see this coming. From a technology point of view, I am impressed. However, I am not convinced yet, for several reasons – the most important one being the offer much too expensive for our application needs. Of course we are following what is going on regarding Clouds and HPC, and in fact in one project we are extending one application to make use of both on-premise and off-premis compute power based on availability (and maybe even price). But for the time being, our local clusters, including the one running Windows, will clearly dominate (or, as we Germans say, set the tone).
Finally, I am missing a clear picture of HPC-related improvements in the Windows Server roadmap. Just recently we added a frontend system with 160 (logical) cores, this is 8 sockets, 512 GB of memory. Windows just works on such a machine – but it could do better. It could serve HPC applications better. And given that next-gen ordinary (HPC) systems probably have a similar core count, Windows really has to serve applications better on such machines in order to stay competitive. Furthermore, smooth and stable integration of accelerators – be it GPGPUs, or something different but similar in spirit – will be as important at least.
I will stop here. Our user base is clearly showing a demand for Windows HPC Server-based clusters, and in fact the demand is growing. Trying to combine my personal opinion with the feedback and opinions I got from the (German) community, Microsoft has to improve the communication regarding Windows HPC Server. It is time for a clear statement regarding the future of the product and the directions it will be going to.
OpenMP and OpenACC
If you attended SC11, you might have noticed some buzz around OpenACC. Well, at least I did. For example, today’s OpenMP BOF had some information on this. I want to use this blog post to add some general comments and insights on the developments and direction of the OpenMP language committee as well as what has lead to OpenACC. As always you have to understand that these statements are mine only, on this blog I do not speak in any official role.
Since quite a while now, OpenMP is moving into the accelerator space, with the work done by the OpenMP for Accelerators subcommittee of the OpenMP Language Committee. That subcommittee publically presented the status of their work at the last IWOMP, where James Beyer et al had a paper on that particular topic (PDF of their presentation). They invested a lot of effort and made good progress since then. In order to make support for accelerators happen in OpenMP, they have to achieve three goals: (i) provide support for Slicing and Shaping expressions, (ii) provide support for data management constructs and clauses, and finally (iii) provide support to denote kernels and constructs for execution on the accelerator. For all three items the subcommittee looked at existing other proposals, particularly from PGI, BSC and CAPS, but also from others. There are good proposals underway for (i) and (ii) which probably are backed by a majority in the language committee, since this functionality may turn out to be very handy to drive other features and proposals as well. Just as an example we are aiming for improved support for Affinity of threads and data, which requires Slicing and Shaping of array expressions.
However, support for (iii) is really tough, if one wants to integrate well with the rest of OpenMP and allow for future extensions. An important design goal is that OpenMP will support not just one particular type of accelerator, but rather be widely applicable to different kinds of devices from different vendors. These are the reasons for OpenMP developing with the slow speed it is. We are planning for a public draft of OpenMP 4.0 for SC12, one year from now.
In order to allow for faster development and ignoring the OpenMP integration just for a moment, the OpenACC standard initiative was formed and basically is a spin-off of the OpenMP Language Committee. Personally, I see this as a beta of OpenMP for Accelerators, and I hope that this initiative will help to collect valuable feedback on how pragma-based accelerator programming has to look like. Cray, PGI and CAPS all have announced to implement the specification as it is currently. When it comes to getting the resources for that, it is much easier to implement this spin-off spec, instead of implementing an incompleted proposal draft. This is what I like the OpenACC effort for. Any by the way, it was prominently promoted during the NVIDIA keynote at SC11 on Tuesday morning.
However, what I do not like is, how it was marketed. People did not get the relation to OpenMP. They way it was published it was not clear that effort from other parties was involved in the development as well, not just the ones mentioned on the website. In fact, many people who visited the booth thought that OpenACC is about to become a competitor for OpenMP in the accelerator domain. This is not true, it is clearly the intend to feed back the OpenACC development into the next OpenMP specification. While clearly hope for the SC12 time frame to release a draft, but until then we have several technical problems to solve.
Dan Reed on Technical (Cloud) Computing with Microsoft: Vision
During ISC 2011 in Hamburg I got the opportunity to talk to Microsoft’s Dan Reed, Corporate Vice President, Technology Policy and Extreme Computing Group. It was a very nice discussion soon targeting towards HPC in the Cloud, touching the topics of Microsoft’s Vision, Standards, and Education. Karsten Reineck from the Fraunhofer SCAI was also present, he already put an excerpt of the interview on his blog (in German). The following is my recapitulation of the discussion pointing out his most important statements – part 1 of 2.
Being the person I am, I started the talk with a nasty question on the pricing scheme of Azure (and similar commercial offerings), claiming that it is pretty expensive both per CPU hour as well as per byte of I/O. Just recently we did a full cost accounting to calculate our price per CPU hour for our HPC service, and we found us to be cheaper by a notable factor.
Dan Reed: Academic sites, of reasonable size such as yours, can do HPC cheaper because they are utilizing the hardware on a 24×7 basis. Traditionally, they do not offer service-level agreements on how fast any job starts, they just queue the jobs. Azure is different, and it has to be, one can get the resources available in a guaranteed time frame. As of today, HPC in the Cloud is interesting for burst scenarios where the on-promise resources are not sufficient, or for people for whom traditional HPC is too complex (regardless of Windows vs. Linux, just maintaining an on-premise cluster versus buying HPC time when it is needed).
I am completely in line with that. I expressed my belief that we will need (and have!) academic HPC centers for the foreseeable future. Basically, we are just a (local) HPC cloud service provider for our users – which of course we call customers, internally. To conclude this topic, he said something very interesting:
Dan Reed: In industry, the cost is not the main constraint, the skill is.
Ok, since we are offering HPC services on Linux and Windows, and since there was quite some buzz around the future of the Windows HPC Server product during ISC, I asked where the Windows HPC Server product is heading to in the future.
Dan Reed: The foremost goal is to better integrate and support cloud issues. For example, currently, there are two schedulers, the Azure scheduler and the traditional Windows HPC Server scheduler. Basically, that is one scheduler too much. Regarding improvements in Azure, we will see support for high-speed interconnects soon.
Azure support for MPI programs has just been introduced with Windows HPC Server 2008 R2 SP2 (a long product name, hm?). By the way, he assumes that future x GigaBit Ethernet will be favoured over InfiniBand.
For us it is clearly interesting to see where Azure, and other similar offerings, are heading to, and we can learn something from that for our own HPC service. For example, we already offer service-level agreements for some customers under some circumstances. However, on-premise resources will play the dominating role for academic HPC in the foreseeable future. Thus I am interested in the future of the product and asked specifically about the future of the Windows HPC Server.
Dan Reed: Microsoft, as a company, is strongly committed to a service-based business model. This has to be understood in order to realize what is driving some of the shifts we are seeing right now, both in the products and the organization itself. The focus on Cloud Computing elevated the HPC Server team, the Technical Computing division is now part of the Azure organization. The emphasis of the future product development thus is clearly shifting towards cloud computing, that is true, although the product remains to be improved and features will be added for a few releases (already in planning).
Well, as a MVP for Windows HPC Server, and a member of the Customer Advisory Board, I know something about the planning of upcoming product release, so I believe Microsoft is still committed to the product (as opposed to some statements made by other people during ISC). However, I do not see the Windows Server itself moving in the right direction for HPC. Obviously HPC is just a niche market for Microsoft, but better support for multi- and many-core processors and hierarchical memory architectures (NUMA !) would be desirable. Asking (again) on that, I got the following answer:
Dan Reed: Windows HPC Server is derived from Windows Server, which itself is derived from Windows. So, if you want to know where Windows HPC Server is going with regard to its base technologies, you have to see (and understand) where Windows itself is going.
Uhm, ok, so we better take a close look at Windows 8
. Regarding Microsoft’ way towards Cloud Computing, I will write a second blog post later to cover more of our discussion on the topics of Standards and Education. This this blog post is on the Vision, I just want to share a brief discussion we had when heading back to the ISC show floor. I asked him on his personal (!) opinion on the race towards Exascale. Will we get an Exascale system by (the end of) 2019?
Dan Reed: Given the political will and money, we will overcome the technical issues we are facing today.
Ok. Given that someone has that will and the money, would such a system be usable? Do you see any single application for such a system?
Dan Reed: Big question mark. I would rather see money being invested in solving the software issues. If we get such powerful systems, we have to be able to make use of them for more than just a single project.
Again, I am pretty much in line with that. By no means I am claiming to fully understand all challenges and opportunities of Exascale systems, but what I do see are the challenges to make use of today’s Petaflop systems with applications other than LINPACK, especially from the domain of Computational Engineering. Taking the opportunity, my last question was: Who do you guess would have the political will and the money to build an Exascale system first, the US, or Europe, or rather Asia?
Dan Reed: Uhm. If I would have to bet, I would bet on Asia. And if such a system comes from Asia, all critical system components will be designed and manufactured in Asia.
Interesting. And clearly a challenge.
An Update on Building and Using BOOST.MPI on Windows HPC Server 2008 R2
My 2008 blog post on Building and Using BOOST.MPI on Windows HPC Server 2008 still generates quite some traffic. Since some things have changed since then, I thought it could help those visitors to provide an updated howto. Again, this post puts the focus on building boost.mpi with various versions of MS-MPI, and does not cover all aspects of building boost on Windows (go to Getting Started on Windows for that).
The problem that still remains is, that the MPI auto-configuration only looks for MS-MPI v1, which came with the Compute Cluster Pack and was typically installed to the directory C:\Program Files\Microsoft Compute Cluster Pack. MS-MPI v2, that comes with the Microsoft HPC Pack 2008 [R2], is typically installed to the directory C:\Program Files\Microsoft HPC Pack 2008 [R2] SDK, but the auto-configuration does not examine these directories. In the old post I explained where to change the path the auto-configurator is looking at. Of course, this is not what one expects from an “auto”-configuration tool. Extending the mpi.jam file to search for all possible standard directories where MS-MPI might be installed in turned out to be pretty simple. You can download my modified mpi.jam for boost 1.46.1 supporting MS-MPI v1 and v2 and replace the mpi.jam file that comes with the boost package. As a summary, below are the basic steps to build boost with boost.mpi on Windows (HPC) Server 2008 using Visual Studio and MS-MPI.
- Download boost 1.46.1 (82 MB), which is the most current version by the time of this writing (May 13th, 2011).
- Extract the archive. For the rest of the instructions I will assume X:\src.boost_1_46_1 as the directory the archive has been extracted into.
- Open a Visual Studio command prompt from the Visual Studio Tools submenu. Depending on what you intend to build, you have to use the 32-bit or 64-bit compiler environment. Execute all commands listed in the rest of the instructions from within this command prompt.
- Run bootstrap.bat. This will build bjam.exe.
- Modify the mpi.jam file located in the tools\build\v2\tools subdirectory to search for MS-MPI in the right place, or use my modified mpi.jam for boost 1.46.1 supporting MS-MPI v1 and v2 instead.
- Edit the user-config.jam file located in the tools\build\v2 subdirectory to contain the following line: using mpi ;.
- Execute the following to command to start the build and installation process: bjam.exe –build-dir=x:\src.boost_1_46_1\build\vs90-64 –prefix=x:\boost_1_46_1\vs90-64 install. Please note that I use different directories in the –build-dir and –prefix options, since I intend to remove the X:\src.boost_1_46_1 directory once boost is installed. Especially a debug build may use a significant amount of disc storage.
- Wait…
- There are several other options that you might want to explore, but in many cases the default does just fine. Using the command line from above, on Windows you will get static multi-threaded libraries in debug and release mode using shared runtime. On Windows, the default toolset is msvc, which is the Visual Studio compiler. You can change that via the toolset=xxx option, for example insert toolset=intel to the command line above just before install if you want to build using the Intel compilers.
Since it is uncomfortable to change mpi.jam whenever you are going to build a new version of boost, I filed a bug report on this and proposed to extend the search path to include MS-MPI v2 locations as well.
In order to use this build of boost, in your projects you have to add X:\boost_1_46_1\vs90-32\include\boost-1_46_1 to the list of include directories, and X:\boost_1_46_1\vs90-32\lib to the list of library directories (all acording to the directory scheme I used above). In your code you do #include <boost/mpi.hpp>. The boost header files contain directives to link the correct boost libraries automatically, but of course you have to linke with the MS-MPI library you used to build boost with.
HPC Server 2008 R2 Failover Cluster deployment guide
I just learned about a Deployment of HPC Server 2008 R2 failover cluster, including a detailed step-by-step guide on how to configure head node failover and remote database installation to employ a SQL server failover cluster. This document has been provided by Microsoft, enjoy.
Recap of the 4th Meeting of the German Windows-HPC User Group
The 4th Meeting of the German Windows-HPC User Group took place on March 31st and April 1st in Karlsruhe, hosted by the Karlsruhe Institute for Technology (KIT). The event was attended by over 70 participants from Industry and Academia. This event has been sponsored by Bull, COMSOL, EMCL @ KIT, Intel, Microsoft and NVIDIA.
After a brief welcome address by the organizators (Wolfgang Dreyer from Microsoft and myself), Rudolf Lohner (KIT) gave an overview of the Steinbuch Centre for Computing (SCC) at the KIT. He was followed by the keynote speak from Microsoft, given by Xavier Pillons (Microsoft Corporation) on Windows HPC Server 2008 R2 and Azure as well as Dryad/DryadLinq. We specifically asked for these two topics, and it turned out that Cloud Computing as well as Data-intensive Computing was the subject of many discussions during this event. After that, Axel Köhler (now NVIDIA) gave a glimpse into the current HPC developments at NVIDIA, including how a pure accelerator-driven supercomputer might look like. He was followed by Dagmar Kremer (BCC), who presented their solution for real-time super-computing on the desktop using Excel. This topic was also on the agendy by popular demand, and apparently the combination of the two keywords “Excel” and “HPC” makes many people interested. The first day was closed by Achim Streit (KIT), who gave his vision on HPC and the Cloud, outlining current projects around HPC as a Service (HPCaaS) for technical computing.
The evening event took place in the ZetKaeM restaurant, after touring the Media Museum, the world’s first and only museum for interactive art. We all experienced some funny exhibits
. Such an evening event serves well the role of a user group – leading to discussions and thought exchange over a good glass of wine.
The second day started with a keynote address from Vincent Heuveline (KIT) on HPC and hardware-aware computing at the EMCL @ KIT. He was followed by Joachim Redmber (Bull), presenting the Bull way of Supercomputing. Representing a Windows-HPC user, Shiqing Fan (HLRS) outlined their work on implementing and integrating OpenMPI with Windows HPC environments, and apparently they can outperform Microsoft MPI in some benchmarks. Horst Schwichtenberg (Fraunhofer SCAI) gave an example of Excel HPC integration via WCF. As another user contribution, Stefan Truthähn and Martin Steinert (both hhpberlin Ingenieure für Brandschutz GmbH) gave a vivid talk on how they came to use Windows-HPC and HPC in general (more by accident than by master plan
) and how they see the future of their CFD computations on on-premise as well as Cloud HPC offerings. They were followed by Michael Klemm (Intel), giving an overview of Intel Technology for HPC on Windows. Henrik Nordborg (University of Applied Sciences in Rapperswil) from the Microsoft Technical Computing Innovation Center (MICTC) outlined where he sees an increasing demand for expertise in technical computing (and why) as well as he gave a report on the first activities of the MICTC. The second and final day of the meeting was closed by a talk given by Michael Wirtz and myself on our experience and setting for Windows-HPC for 1000+ users.
All in all, I think this meeting was successful and so far we got positive feedback from the attendees. We plan to have the next meeting our March or April 2012 at an yet-to-be-decided-on location.
Upcoming Events in March 2011
Let me point you to some HPC events in March 2011.
3rd Parallel Programming in Computational Engineering and Science (PPCES) Workshop. This event will continue the tradition of previous annual week-long events taking place in Aachen every spring since 2001, this year from March 21st to March 25th. This year, the agenda is – as always – a little different from the previous one. Beginning with a series of overview presentations on Monday afternoon, we are very happy to announce the upcoming RWTH Compute Cluster to be delivered by Bull. Throughout the week, we will cover serial and parallel programming using OpenMP and MPI in Fortran and C / C++ as well as performance tuning addressing both, Linux and Windows platforms. Due to the positive experience of last year, we are happy to present a renowned speaker to give an introduction into GPGPU architectures and programming on Friday: Michael Wolfe from PGI. All further information can be found at the event website: http://www.rz.rwth-aachen.de/ppces.
4th Meeting of the German Windows-HPC User Group. The fourth meeting of the German Windows HPC User Group will take place in Karlsruhe on March 31st and April 1st, kindly hosted by the KIT. As in the previous years, we will learn about and discuss Microsoft’s current and future products, as well as users presenting their (good and not so good) experiences in doing HPC on Windows. This year, we will have an Expert Discussion Panel for which the audience is invited to ask (tough) question to fire up the discussion.
RWTH Aachen gets a new 300 Teraflops HPC system from Bull
While I usually do not repeat press releases in my blog, this one I do since we all are a little proud of the achievement: RWTH Aachen University orders Bull supercomputer to support its scientific, industrial and environmental research. Getting this system was a lot of work, and preparing for it still is. The compute power of that machine totals 300 Teraflops. The focus of our center is not just running this machine, but to provide HPC-specific support and to ensure efficient operation. We are confident that in Bull we found a competent partner to investigate these and other topcis in close collaboration.
OpenMP 3.1 spec published as Draft for Public Comment
You might have heard it already: The next incarnation of the OpenMP specification, which is targeted to be released as version 3.1 around June in time for IWOMP 2011 in Chicago, has been published as a Draft for Public Comment. You may think of it as beta
.
Back in October 2009, I already commented on some of the goals for versions 3.1 and 4.0. OpenMP 3.1 addresses some issues found in the 3.0 specification and brings only minor functional improvements, still it will be released with a delay of almost one year to our initially planned schedule. However, work on version 4.0 already made some significant progress, including support for accelerators (GPUs), further enhancements to the tasking model, and support for error handling. Taking the outline of my previous post on the development of OpenMP, this is the list of updates to be found in OpenMP 3.1 and the status of the development towards OpenMP 4.0 (expressed in my own words and stating my own beliefs and opinions):
1: Development of an OpenMP Error Model. There is nothing new on this topic in OpenMP 3.1. However, with respect to OpenMP 4.0, the so-called done directive has been discussed for quite some time already. It can be used to terminate the execution of a Parallel Region, or a Worksharing construct, or a Task construct, and it is a prominent candidate for the next OpenMP spec. It would provide necessary functionality towards full-featured error handling capabilities, for which there is no good proposal that could be agreed upon yet.
2: Interoperability and Composability. There is nothing new on this topic in OpenMP 3.1. We made several experiments, gained some insights, and the goal is to come up with a set of reliable expectations and assertions in the OpenMP 4.0 timeframe.
3: Incorporating Tools Support into the OpenMP Specification. There is currently no activity on this topic in the OpenMP Language Committee in general.
4: Associating Computation or Memory across Workshares. There is little progress in this direction to be found in OpenMP 3.1. The environment variable OMP_PROC_BIND has been added to control the binding of threads to processors, it accepts a boolean value. If enabled, the OpenMP runtime is instructed to not move OpenMP threads between processors. The mapping of threads to processors is unspecified and thus depends on the implementation. In general, introducing this variable that controls program-wide behavior was intended to standardize behavior found in almost all current OpenMP implementations.
5: Accelerators, GPUs and More. While there is nothing new on this topic in OpenMP 3.1, the Accelerator subcommittee put a lot of effort into coming up with a first (preliminary!) proposal. This is clearly interesting. From my personal point of view, OpenMP 4.0 might provide basic support for programming accelerators such as GPUs, thus delivering a vendor-neutral standard. Do not expect anything full-featured similar to CUDA, the current proposal is rather similar in spirit to the PGI Accelerator approach (which I do like). However, this is still far from being done, and may (or may not) change directions completely. The crucial aspects are to integrate well with the rest of OpenMP, and to provide an easy to use but still powerful approach to allow for bringing certain important code patterns to accelerator devices.
6: Transactional Memory and Thread Level Speculation. There is in general no activity on this topic in the OpenMP Language Committee and apparently it dropped from the set of important topics. Personally, (now) I do not think TM should be a target for OpenMP in the forseable future.
7: Refinements to the OpenMP Tasking Model. There have been some improvements to the Tasking model, with some more on the roadmap for OpenMP 4.0.
- The taskyield directive has been added to allow for user-defined task scheduling (tsp) points. A tsp is a point in the execution of a task at which is can be suspended to be resumed later; or the event of task completion, after which the executing thread may switch to a different task.
- The mergeable clause has been added to the list of possible task clauses, indicating that the task may have the same data region as the generating task region.
- The final clause has been added to the list of possible task clauses, denoting the execution of all descending tasks sequentially in the same region. This implies immediate execution of final tasks, and ignoring any untied task clauses. An optional scalar expression allows for conditioning the application of the final clause.
8: Extending OpenMP to C++0x and FORTRAN 2003. There is nothing new on this topic in OpenMP 3.1. We closely follow the development of the base language and it has to be seen what can (or has to) be done for OpenMP 4.0. Anyhow, the fact that base languages are introducing threading and a thread-aware memory model leads to some simplifications on the one hand, but also could lead to potential conflicts on the other hand. We are not aware of any such conflict, but digging through the details and implification of a base language such as C++ as well as OpenMP is a pretty complex task.
9: Extending OpenMP to Additional Languages. There is nothing new on this topic in OpenMP 3.1, and currently there is no intention of doing so inside the OpenMP Language Committee. Personally, I would like to see an OpenMP binding for Java, since it would really help teaching parallel programming, but I do not see this happen.
10: Clarifications to the Existing Specifications. There have been plenty of clarification, corrections, and micro-updates. Most notably the examples and description in the appendix have been corrected, clarified, and expanded.
11: Miscellaneous Extensions. A couple of miscellaneous extensions made it into OpenMP 3.1:
- The atomic construct has been extended to accept the following new clauses: read, write, update and capture. If none is given, it defaults to update. Specifying an atomic region allows to atomically read / write / update the value of the variable affected by the construct. Note that not everything inside an atomic region is performed atomically, i.e. the evaluation of “other” variables is not. For example in an atomic write construct, only the left-hand variable (the one that is written to) is written atomically.
- The firstprivate clause now accepts const-qualified types in C/C++ as well as intent(in) in Fortran. This is just a reaction to annoyances reported by some users.
- The reduction clause has been extended to allow for min and max reductions for built-in datatypes in C/C++. This still excludes aggregate types (including arrays) as well as pointer and reference types from being used in an OpenMP reduction. We had a proposal for powerful user-defined reductions (UDRs) on the table for a long time, it was discussed heavily, but did not make it into OpenMP 3.1. That would have made this release of the spec much stronger. Adding UDRs is a high priority for OpenMP 4.0 for many OpenMP Language Committee members, though.
- omp_in_final() is as new API routine to determine whether it is called from within a final (aka included) task region.
12: Additional Task / Threads Synchronization Mechanisms. There is nothing new on this topic in OpenMP 3.1, and not much interest in the OpenMP Language Committee that I have noticed. However, we are thinking of task dependencies and task reductions for OpenMP 4.0, and both feature would probably fall into this category (and then there would be a high interest).
Examining the NUMA architecture of a 8-socket Nehalem-EX
I have been rather quiet on this blog for some while now, which is opposite to my intent – I plan to write more regularly again! And I will just continue with one of the topics I like most: NUMA architectures. Some while ago I talked about how different two systems equipped with exactly the same processors may look like and how this can influence the application performance. This blog post is about exploring the NUMA architecture of a very recent system in more detail.
Some days ago we got remote access to a very recent eight-way (meaning 8-socket) system equipped with Nehalem-EX processors. This makes 64 physical or 128 logical (hyper-threaded) cores per system! The system was kindly provided by Fujitsu. Since we soon will get plenty of those (not necessarily from Fujitsu, we really do not know yet), we took a close look on how it behaves, especially my colleague Dirk Schmidl performed lots of the benchmarks with the help of some student workers.
In the aforementioned previous blog post I pointed to the so-called System Locality Information Table (SLIT) provided by the BIOS. Does it help to understand how the eight sockets found in this server are related to each other? Taking a look at it, the answer is simple: No. It just know about two levels: Same socket (the diagonal: 10) and other socket (the rest: 12).
Our goal was to examine how the eight sockets are related to each other and how “deep” the NUMA architecture of that machine really is. Of course you can get that information from the system specification documentation, but in order to get a feeling of the performance characteristics of a machine it is good practice to examine it first on your own and then check whether your conclusions match what is described.
We used a simple benchmark: We placed eight threads (each processor has eight physical cores) on one selected socket and made all of them access memory at another socket (well, one thread access the local socket). We then measured the achieved memory bandwidth [MB/s]. This resulted in the following performance matrix:
By measuring the memory bandwidth in this particular way we do not get the optimal aggregated memory bandwidth the system could deliver, since all sockets are busy and there is also some cache coherency traffic. Instead, our benchmark results are more close to what the system delivers when it is fully loaded using a rather bad memory access behavior.
Our measurements revealed three significantly different performance levels, of which one can further be spitted into two separate ones. The different levels are colored accordingly in the figure below. Depending of which socket you label as “0”, you can come up with the following architectural plot (my colleague Dieter an Mey did this particular one):
One can see that we have two pairs of four sockets each that are connected by apparently slightly slower links. I do not yet know what is causing this. Looking at the number of hops you get this matrix:
The maximum number of hops to get from one socket to another socket is two. Since the Intel QuickPath interconnect allows to use (up to) three connectors to build a multi-socket system, each socket as three neighbors than can be reached with just one hop.
Well, an aggregated memory bandwidth of nearly 90 GB/s with this bad memory access pattern is pretty ok. But it is not a factor two over a system of four sockets. It is well-suited for shared memory parallel programs that can make use of that many cores (and a large total memory), but of course it odes not offer a price-performance sweet spot (the price trend of adding sockets is clearly over-linear). And last but not least, although the memory bandwidth is really important for most HPC applications, there are also other factors that play an important role in an application’s performance on a given architecture. We did many more benchmarks to evaluate this system, of which I do not want to speak here and now, but by doing some memory bandwidth benchmark we figured out how the system architecture looks like and how the eight sockets are related to each other.




