Archive for the ‘NUMA’ Category
Examining the NUMA architecture of a 8-socket Nehalem-EX
I have been rather quiet on this blog for some while now, which is opposite to my intent – I plan to write more regularly again! And I will just continue with one of the topics I like most: NUMA architectures. Some while ago I talked about how different two systems equipped with exactly the same processors may look like and how this can influence the application performance. This blog post is about exploring the NUMA architecture of a very recent system in more detail.
Some days ago we got remote access to a very recent eight-way (meaning 8-socket) system equipped with Nehalem-EX processors. This makes 64 physical or 128 logical (hyper-threaded) cores per system! The system was kindly provided by Fujitsu. Since we soon will get plenty of those (not necessarily from Fujitsu, we really do not know yet), we took a close look on how it behaves, especially my colleague Dirk Schmidl performed lots of the benchmarks with the help of some student workers.
In the aforementioned previous blog post I pointed to the so-called System Locality Information Table (SLIT) provided by the BIOS. Does it help to understand how the eight sockets found in this server are related to each other? Taking a look at it, the answer is simple: No. It just know about two levels: Same socket (the diagonal: 10) and other socket (the rest: 12).
Our goal was to examine how the eight sockets are related to each other and how “deep” the NUMA architecture of that machine really is. Of course you can get that information from the system specification documentation, but in order to get a feeling of the performance characteristics of a machine it is good practice to examine it first on your own and then check whether your conclusions match what is described.
We used a simple benchmark: We placed eight threads (each processor has eight physical cores) on one selected socket and made all of them access memory at another socket (well, one thread access the local socket). We then measured the achieved memory bandwidth [MB/s]. This resulted in the following performance matrix:
By measuring the memory bandwidth in this particular way we do not get the optimal aggregated memory bandwidth the system could deliver, since all sockets are busy and there is also some cache coherency traffic. Instead, our benchmark results are more close to what the system delivers when it is fully loaded using a rather bad memory access behavior.
Our measurements revealed three significantly different performance levels, of which one can further be spitted into two separate ones. The different levels are colored accordingly in the figure below. Depending of which socket you label as “0”, you can come up with the following architectural plot (my colleague Dieter an Mey did this particular one):
One can see that we have two pairs of four sockets each that are connected by apparently slightly slower links. I do not yet know what is causing this. Looking at the number of hops you get this matrix:
The maximum number of hops to get from one socket to another socket is two. Since the Intel QuickPath interconnect allows to use (up to) three connectors to build a multi-socket system, each socket as three neighbors than can be reached with just one hop.
Well, an aggregated memory bandwidth of nearly 90 GB/s with this bad memory access pattern is pretty ok. But it is not a factor two over a system of four sockets. It is well-suited for shared memory parallel programs that can make use of that many cores (and a large total memory), but of course it odes not offer a price-performance sweet spot (the price trend of adding sockets is clearly over-linear). And last but not least, although the memory bandwidth is really important for most HPC applications, there are also other factors that play an important role in an application’s performance on a given architecture. We did many more benchmarks to evaluate this system, of which I do not want to speak here and now, but by doing some memory bandwidth benchmark we figured out how the system architecture looks like and how the eight sockets are related to each other.
Daily cc-NUMA Craziness
Since cc-NUMA architectures have become ubiquitous in the x86 server world, it is very important to optimize memory and thread or process placement, especially for Shared-Memory parallelization. In doing so I was pretty successful in optimizing several of our user codes for cc-NUMA architectures by introducing manual binding strategies. I like the cpuinfo tool that comes with Intel MPI 3.x a lot, it is to query how all the cores are related (i.e. which cores share a cache). Based on that output I used to figure out my strategies for every architecture that we have in our center or that I have access to elsewhere. However, during the last couple of days I observed some benchmark results that did not make much sense to me, and today I stumbled upon the cause for that – something I just did not expect. I will tell you in a second, but my statement is: Manual Binding can be a bad thing, although one can achieve a nice speedup by doing it right even experts can easily be fooled, therefore it is high time to get access to a standardized interface to communicate with the threading runtime and the OS!
We have dual-socket Intel Nehalem-EP systems from two different vendors: Sun and HP. The Sun systems are intended for HPC and are equipped with Xeon X5570 (2.93 GHz) CPUs, the HP systems are intended for infrastructure services and are equipped with Xeon E5540 (2.53 GHz) CPUs. Anyhow, I got hold of both, put some jobs on the boxes and was really disappointed by the speedup measurements on the HP system. In investigating the reason for that I found out that the numbering of the logical cores on both systems is different. Oh dear, two dual-socket systems with Intel Nehalem-EP processors, in one system the cores 0 and 1 are on the same socket, but in the other system they are on a different socket. Lets take a look at the output of cpuinfo on the Sun system:
Sun Nehalem-EP (linux) |
Processor compositionProcessors(CPU) : 16 Packages(sockets) : 2 Cores per package : 4 Threads per core : 2 |
Processor identificationProcessor Thread Id. Core Id. Package Id. 0 0 0 0 1 0 1 0 2 0 2 0 3 0 3 0 4 0 0 1 5 0 1 1 6 0 2 1 7 0 3 1 8 1 0 0 9 1 1 0 10 1 2 0 11 1 3 0 12 1 0 1 13 1 1 1 14 1 2 1 15 1 3 1 |
Placement on packagesPackage Id. Core Id. Processors 0 0,1,2,3 (0,8)(1,9)(2,10)(3,11) 1 0,1,2,3 (4,12)(5,13)(6,14)(7,15) |
Cache sharingCache Size Processors
L1 32 KB (0,8)(1,9)(2,10)(3,11)
(4,12)(5,13)(6,14)(7,15)
L2 256 KB (0,8)(1,9)(2,10)(3,11)
(4,12)(5,13)(6,14)(7,15)
L3 8 MB (0,1,2,3,8,9,10,11)
(4,5,6,7,12,13,14,15)
|
And this is the output on the HP system:
HP Nehalem-EP (linux) |
Processor compositionProcessors(CPU) : 16 Packages(sockets) : 2 Cores per package : 4 Threads per core : 2 |
Processor identificationProcessor Thread Id. Core Id. Package Id. 0 0 0 1 1 0 0 0 2 0 2 1 3 0 2 0 4 0 1 1 5 0 1 0 6 0 3 1 7 0 3 0 8 1 0 1 9 1 0 0 10 1 2 1 11 1 2 0 12 1 1 1 13 1 1 0 14 1 3 1 15 1 3 0 |
Placement on packagesPackage Id. Core Id. Processors 1 0,2,1,3 (0,8)(2,10)(4,12)(6,14) 0 0,2,1,3 (1,9)(3,11)(5,13)(7,15) |
Cache sharingCache Size Processors
L1 32 KB (0,8)(1,9)(2,10)(3,11)
(4,12)(5,13)(6,14)(7,15)
L2 256 KB (0,8)(1,9)(2,10)(3,11)
(4,12)(5,13)(6,14)(7,15)
L3 8 MB (0,2,4,6,8,10,12,14)
(1,3,5,7,9,11,13,15)
|
Lets take a closer look at this table. Wherever you find the identification ‘processor’, this refers to the logical core as visible to the operating system. A ‘package’ is a socket, and we have two ‘(hyper-)threads’ per ‘package’. On the Sun system, the logical cores 0 and 1 are located on the same socket, the cores 0 to 8 refer to eight full cores on two sockets making use of all caches. On the HP system, the logical cores 0 and 1 are located on two sockets, the cores 0 to 8 refer to four hyper-threaded cores on two sockets making use of only half the caches. I am not saying one of the two strategies is better – but if you use one machine to determine what the best is for you application, put this into a start-up script and change the machines in between your measurements, that you will be surprised (and not to the good).
How is the core numbering determined? Well, the short answer is “not by the OS, but by the BIOS”; the honest answer is “I don’t know exactly”. The BIOS has a lot of influence, for example one can take a look at the Advanced Configuration and Power Interface Specification (ACPI: http://www.acpi.info/DOWNLOADS/ACPIspec40.pdf) in section 5.2.17 that there is a System Locality Distance Information Table (SLIT) that lists the distance between hardware resources on different NUMA nodes. In theory the OS kernel can make use of that table, and it does or it fills in constant values (i.e. 10 = local, 20 = remote) in case the table is empty. But the ACPI specification does not specify how the table is generated – that is up to the BIOS implementation itself, and probably up to BIOS settings. The important take-away is that (i) BIOS settings influence the core numbering scheme, (ii) obviously BIOS settings are not the same across vendors, (iii) the numbering can change over time anyhow and other OSes (i.e. Windows) do it differently -> (iv) do not rely on the numbering scheme being static.
What should you do instead? We do not have a standardized way to influence the thread / process binding. Using tools such as numactl (Linux) or start /affinity (Windows) accept core ids as argument, which is far from optimal. The same holds for explicit API calls to do the binding. Instead, the Intel compiler is following a good path: The environment variable KMP_AFFINITY can be used to define an explicit thread-to-core mapping, but it also accepts two strategies: scatter and compact. The idea of scatter is to bind the threads as far apart as possible (to use all the caches and to have all the memory bandwidth available); the idea of compact is to bind the threads as close together as possible (to profit from shared caches). Running a program with two threads using the scatter strategy on the Sun system results in binding thread 0 to the core set {0,8} and thread 1 to the core set {4,12} (-> two sockets). The same experiment on the HP systems results in binding thread 0 to the core set {1,9} and thread 1 to the core set {0,8} (-> two sockets, again). This abstracts from the hardware / system details and allows the user, who might not be an HPC expert, to concentrate on optimizing the application by choosing from just two strategies, still getting “portable performance” on Intel CPUs. A portable thread binding interface is under discussion for OpenMP 3.1 (see my previous blog post), and I am in a strong favor for allowing the user from choosing strategies. The one shortcoming of Intel’s current implementation occurs when you have multiple levels of Shared-Memory parallelization in one application and want to mix strategies – which might make sense. But this could easily be overcome. Let’s see what the future might bring, for now I just fixed my scripts to include a sanity check that the core numbering is indeed as expected…



