Christian’s corner on HPC

A Blog on Parallel Programming – covering all OSes :-) – by Christian Terboven.

Archive for the ‘OpenMP’ Category

How OpenMP is moving towards version 3.1 / 4.0

with one comment

Not yet carved in stone, but the current plan of the OpenMP Language Committee (LC) is to publish a draft OpenMP 3.1 standard for public comment by IWOMP 2010 and to have the OpenMP 3.1 specification finished for SC 2010 – given that the Architecture Review Board (ARB) accepts the new version. Bronis R. de Supinski (LLNL) has taken on the duty of acting as the chair of the LC and since introduced some process changes. Besides weekly telephone conference calls, there are three face-to-face meetings per year and attendance is required for voting rights. The first face-to-face meeting was held on June 1st and 2nd in Dresden attached to IWOMP 2009, the second one was on September 22nd and 23rd in Chicago. This blog post is intended to report on this last meeting and to present an overview of what is going on with OpenMP right now, obviously from my personal point of view.

In the course of resuming work on OpenMP after the 3.0 specification was published, the LC voted on the priority of (small) extensions and clarifications for 3.1 as well as new topics for 4.0. We ended up with 12 major topics and 5 subcommittees, as outlined in Bronis talk during IWOMP 2009, which are still in use as identifiers of the different topics people are working on.

1: Development of an OpenMP Error Model. This is the feature the LC people think OpenMP is missing most desperately, but in contrast to that it did not receive too much effort yet. A subcommittee has been formed to be lead by Tim Mattson (Intel) and Michael Wong (IBM), and currently there are three proposals on the table for discussion: (i) an extension of the API routines and some constructs to return error codes or the introduction of a global error indication variable, (ii) an exception-based mechanism to catch errors, and (iii) a callback-based mechanism allowing to react on errors based on the severity and origin. The absence of an error model is clearly a reason for not using OpenMP in applications with certain requirements on reliability, but introducing the wrong error model could easily spoil OpenMP for that audience. It seems that most LC people do not like error codes too much (I don’t either), using exceptions is not suitable for C and FORTRAN, so the third approach seems most promising by allowing a program to react on errors depending on the severity and to still allow the compiler to ignore OpenMP if it is not enabled. In fact, this mechanism has been proposed back in 2006 by Alex Duran (BSC) and friends already. Since nothing has been decided yet, I guess the error model is targeted for OpenMP 4.0.

2: Interoperability and Composability. This subcommittee is lead by myself (RWTH) and Bronis R. de Supinski (LLNL) and is looking for ways of allowing OpenMP to coexist with other threading packages, maybe even with other OpenMP runtime environments in the same application. We are also looking into how to allow the creation of parallel software components that can safely be plugged together, which I consider prominently missing in virtually all threading paradigms. This is a very broad topic and there is no OpenMP version number I would assign this topic as target for being solved to, but with a little bit of luck we can make some progress even for version 3.1. We have some ideas on the table of how to specify some basic aspects of OpenMP interacting with the native threading packages (POSIX-Threads on Linux/Unix, Win32-Threads on Windows), driven by application observations and known deficiencies in current OpenMP implementations. We might also attack the problem of orphaned reductions. I am not so certain of solving the issue of allowing or detecting nested Worksharing constructs, respectively.

3: Incorporating Tools Support into the OpenMP Specification. This has been on the feature wishlist for OpenMP 3.0 already, but there is hardly any activity regarding this topic. Most vendors provide their own tools to analyze the performance (or correctness) of OpenMP programs by making their own runtime talk to their specific tool, but this situation is far from optimal for research / academia tools. As early as back in 2004 there were some proposal (i.e. POMP by Bernd Mohr and friends), but they did not made it into the specification or into actual implementations.

4: Associating Computation or Memory across Workshares. Today, the world of OpenMP is flat (memory), so this topic is mostly about supporting cc-NUMA architectures in OpenMP. There are two subcommittees working on this issue, the first is lead by Dieter an Mey (RWTH) and the goal is to standardize common practices (used in today’s applications) of dealing with cc-NUMA optimizations. If nothing comes in between, OpenMP 3.1 will allow the user to bind threads to cores by either specifying an explicit mapping, or by telling the runtime a strategy (like compact vs. scatter). Of course there are more ideas (and features needed), like influencing the memory allocation scheme or using page migration if supported by the operating system or interacting with resource management systems (batch queuing systems), but these are very hard to specify in a portable and extensible fashion. The other subcommittee is lead by Barbara Chapman (UH) and deals with thread team control. Using the Worksharing in OpenMP, it is very hard to dedicate a special task (i.e. I/O) to just one thread of the Parallel Region. There are applications asking for that, but I don’t see a proposal that the LC would agree on for 3.1. Nevertheless, they presented some interesting ideas at the last F2F based based on HPCS language capabilities, which hopefully have the potential to influence OpenMP 4.0.

5: Accelerators, GPUs and More. Of course we have to follow the trend / hype ;-) . But since no one knows for sure in which directions the hardware is evolving, there are so many different ideas on how to deal with this. Out of my head I can enumerate that PGI has some directives loosely based on OpenMP Worksharing (plus they have CUDA for FORTRAN), IBM has OpenMP for cell with several ideas on extensions, BSC has a proposal that is in principle based on their *SS concept, and CAPS Entreprise has the HMPP constructs + compiler. In summary: No clear direction yet, nothing for OpenMP in the scope of 3.1.

6: Transactional Memory and Thread Level Speculation. Some people thought that OpenMP might need something for Transactional Memory. To the best of my knowledge no one from the LC did any work on this regard.

7: Refinements to the OpenMP Tasking Model. There are two things that most people agree Tasks are missing: Dependencies and Reductions. With respect to the former, there were three proposals on the table from Grant Haab (Intel), Federico Massaioli (Caspur) and Alex Duran (BSC) and the BSC proposal looks most promising because it avoid deadlocks. It employs existing program variables to define the dependencies between tasks, i.e. the result of a computation can be the input of another task. With a good portion of luck, Task Dependencies could actually make it into OpenMP 3.1, I think. With respect to the latter thing, namely Task Reductions, there has been only little progress so far.

8: Extending OpenMP to C++0x and FORTRAN 2003. Since the C++0x standard dropped Concepts, the work that Michael Wong (IBM) and myself (RWTH) made so far became obsolete. To the best of my knowledge there has been no progress made with respect to investigate the opportunities or issues that could arise with FORTRAN 2003.

9: Extending OpenMP to Additional Languages. Well, there are Java and C#, and at least for Java there are some implementations of OpenMP available (incomplete, though). Anyhow, there was never any real attempt to write a formal specification of OpenMP for Java, nor for C#, and I don’t think there is one now.

10: Clarifications to the Existing Specifications. The LC already approved several minor corrections (i.e. mistakes in the examples, improvements in the wording, and the like) that will make their way into OpenMP 3.1. Nothing spectacular, though, but this is something that has to be done.

11: Miscellaneous Extensions. I might be wrong, but I think that User-defined Reductions (UDR) belong to this topic. Yes, there is a chance that UDRs will make it into OpenMP 3.1! This will bring obvious things like min and max for C and C++, but we are aiming higher: The goal is to enable the programmer to write any type of reduction operation for any type in the base language (including non-PODs) and this is achieved by introducing an OpenMP declare statement to define a reduction operation that can be specified in a reduction clause. There are two problems that are under discussion right now: (i) C++ templates and (ii) pointers / arrays. The first can be addressed by an extension of the current proposal and I got the feeling that most LC people like the new approach, but the second is a bit more complex. If you want to reduce an array that is described by a pointer, you need to know how much space to allocate for the thread private copy and how many elements the array consists of. There has been some discussion on this, but no strong agreement on how to solve this issue in general, as it also arises with the private, firstprivate, … clauses. We only agreed that we need a one-fits-all solution. With some good portion of luck we can solve this issue, otherwise we hopefully get UDRs with some limitations in OpenMP 3.1 and the full functionality in a later version of the specification.

12: Additional Task / Threads Synchronization Mechanisms. Again I might be wrong, but I think that the Atomic Extension proposal by Grant Haab (Intel) belongs in here. This is a feature you will also find in threading-aware languages (such as C++0x), but the current base languages of OpenMP are not of that kind. This will almost certainly make it into OpenMP 3.1 and will allow for a portable way to write atomic updates that capture a value and atomic writes. This is already supported by most machines and using an atomic operations can be so much more efficient than using a Critical Region.

If you are interested in more details, you are invited to stop by the OpenMP booth at SC 2009 in Portland and ask the nice guy on booth duty some good questions :-) .

Written by terboven

October 4, 2009 at 5:44 pm

HPCS 2009 Workshop material: OpenMP + Visual Studio

without comments

As announced in a previous post already, I was involved in two workshops attached to the HPCS 2009, hosted by the HPCVL in Kinston, ON, Canada. Being back in the office now I found some time to upload my slide sets. Obviously I can only make my own slides public.

Using OpenMP 3.0 for Parallel Programming on Multicore Systems [abstract]

Ruud van der Pas, Sun Microsystems; Dieter an Mey and Christian Terboven, RWTH Aachen University.

Parallel Programming in Visual Studio 2008 on Windows HPC Server 2008 [abstract]

Christian Terboven, RWTH Aachen University.

Written by terboven

July 1, 2009 at 12:40 pm

Upcoming Events in June 2009

with one comment

Let me point you to some HPC events in June 2009.

5th International Workshop on OpenMP (IWOMP 2009) in Dresden, Germany. The IWOMP workshop series focuses on the development and usage of OpenMP. This year’s conference is titled Evolving OpenMP in an Age of Extreme Parallelism – I think this phrase is a but funny, but nevertheless one can clearly observe a trend towards Shared-Memory parallelization on the node of even the extremely parallel machines. Attached to the conference is a two day meeting of the OpenMP language committee. The language committee is currently discussing a long list of possible items for a future OpenMP 3.1 or 4.0 specification, including but not limited to my favorites Composability (especially for C++) and Performance on cc-NUMA system. Bronis de Supinski, the recently appointed Chair of the OpenMP Language Committee, will give a talk on the current activities of the LC and how the future of OpenMP might look like – I hope the slides will be made public soon after the talk. Right before the conference there will also be a one day tutorial for all people interested in learning OpenMP (mainly given by Ruud van der Pas – strongly recommended).

High Performance Computing Symposium 2009 (HPCS) in Kingston, Canada. HPCS is a multidisciplinary conference that focuses on research involving High Performance Computing and this year it takes place in Kingston. I’ve never been at that conference series, so I am pretty curious how it will look like. Attached to the conference are a couple of workshops, including Using OpenMP 3.0 for Parallel Programming on Multicore Systems – run again by Ruud van der Pas and us, and Parallel Programming in Visual Studio 2008 on Windows HPC Server 2008 – organized by us as well. Here in Aachen, the interest in our Windows-HPC compute service is still growing fine and thus we have usually around 50 new participants in our bi-yearly training events. The HPCVL people asked explicitly to cover parallel programming on Windows in the OpenMP workshop, so we separated this aspect out without further ado to serve it well. The workshop program can be found here.

International Supercomputing Conference (ISC 2009) in Hamburg, Germany. ISC titles itself as Europe’s premier HPC event – while this is probably true it is of course smaller than the SC events in the US, but usually better organized. Without question you will find numerous interesting exhibits and can listen to several talks (mostly by invited speakers), so please excuse the self-marketing of me pointing to the Jülich Aachen Research Alliance (JARA) booth in the research space where we will show an interactive visualization of large-scale numerical simulation (damage of blood cells by a ventricular device – pretty cool) as well as give an overview of our research activities focused on Shared-Memory parallelization (we will distribute OpenMP syntax references again). If you are interested in HPC software development on Windows, feel invited to stop by at our demo station at the Microsoft booth where we will have many demos regarding HPC Application Development on Windows (Visual Studio, Allinea DDTlite and Vampir are confirmed, maybe more …). And if you are closely monitoring the HPC market, you have probably heard about ScaleMP already, the company aggregating multiple x86 system into a single (virtual) system over InfiniBand – obviously very interesting for Shared-Memory parallelization. If you are interested, you can hear about our experiences with this architecture for HPC.

If you want to meet up during any of these events just drop me an email.

Written by terboven

May 19, 2009 at 9:13 pm

A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1.5 of 2]

with 2 comments

Although it is high time to deliver the second part of this blog post series, I decided to squeeze in one additional post which I named part “1.5″, as it will cover some experiments with SMXV in C#. Since I am currently preparing a lecture named Multi-Threading for Desktop Systems (it will be held in German, though) in which C# plays an important role, we took a closer look into how parallelism has made it’s way into the .NET framework version 3.5 and 4.0. The final post will then cover some more tools and performance experiments (especially regarding cc-NUMA architectures) with the focus back on native coding.

First, let us briefly recap how the SMXV was implemented and examine how this can look like in C#. As explained in my previous post, the CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector contains the column indices for each nonzero element and the row-vector points to the first nonzero element index (in val and col) for each matrix row. Having one class to represent a CRS matrix and using an array of doubles to represent a vector, the SMXV operation encapsulated by the operator* can be implemented like this, independent of whether you use managed or unmanaged arrays:

public static double[] operator *(matrix_crs lhs, double[] rhs)
{
   double[] result = new double[lhs.getNumRows()];
   for (long i = 0; i < lhs.getNumRows(); ++i)
   {
      double sum = 0;
      long rowbeg = lhs.row(i);
      long rowend = lhs.row(i + 1);
      for (long nz = rowbeg; nz < rowend; ++nz)
         sum += lhs.val(nz) * rhs[ lhs.col(nz) ];
      result[i] = sum;
   }
   return result;
}

We have several options to parallelize this code, which I wil present and briefly discuss in the rest of this post.

Threading. In this approach, the programmer is responsible for managing the threads and distributing the work onto the threads. It is not too hard to implement a static work-distribution for any given number of threads, but implementing a dynamic or adaptive work-distribution is a lot of work and also error-prone. In order to implement the static approach, we need an array of threads, have to compute the iteration chunk for each thread, put the threads to work and finally wait for the threads to finish their computation.

//Compute chunks of work:Thread[] threads = new Thread[lhs.NumThreads];
long chunkSize = lhs.getNumRows() / lhs.NumThreads;
//Start threads with respective chunks:
for (int t = 0; t < threads.Length; ++t)
{
   threads[t] = new Thread(delegate(object o)
   {
      int thread = (int)o;
      long firstRow = thread * chunkSize;
      long lastRow = (thread + 1) * chunkSize;
      if (thread == lhs.NumThreads - 1) lastRow = lhs.getNumRows();
      for (long i = firstRow; i < lastRow; ++i)
      { /* ... SMXV ... */ }
   });
   //Start the thread and pass the ID:
   threads[t].Start(t);
}
//Wait for all threads to complete:
for(int t = 0; t < threads.Length; ++t) threads[t].Join();
return result;

Instead of managing the threads on our own, we could use the thread pool of the runtime system. From a usage point of view, this is equivalent to the version shown above, so I will not discuss this any further.

Tasks. The problem of the approach discussed above is the static work-distribution that may lead to load imbalances, and implementing a dynamic work-distribution is error-prone and depending on the code it also may be a lot of work. The goal should be to distribute the workload into smaller packages, but doing this with threads is not optimal: Threads are quite costly in the sense that creating or destroying a thread takes quite a lot of time (in computer terms) since the OS is involved, and threads also need some amount of memory. A solution for this problem are Tasks. Well, tasks are quite “in” nowadays with many people thinking on how to program multicore systems and therefore there are many definitions of what a task really is. I have given mine in previous posts on OpenMP and repeat it here briefly: A task is a small package consisting of some code to execute and some private data (access to shared data is possible, of course) which the runtime schedules for execution by a team of threads. Actually it is pretty simple to parallelize the code from above using tasks: We have to manage a list of tasks and have to decide how much work a task should do (in terms of matrix lines), and of course we have to create and start the tasks and finally wait for them to finish. See below:

//Set the size of the tasks:
List<Task> taskList = new List<Task>();
int chunkSize = 1000;
//Create the tasks that calculate the parts of the result:
for (long i = 0; i < lhs.getNumRows(); i += chunkSize)
{
   taskList.Add(Task.Create(delegate(object o)
   {
      long chunkStart = (long)o;
      for(long index = (long)chunkStart;
      index < System.Math.Min(chunkStart + chunkSize, lhs.getNumRows()); index++)
      { /* ... SMXV ... */ }
   }, i));
}
//Wait for all tasks to finish:
Task.WaitAll(taskList.ToArray());
return result;

Using the TPL. The downside of the approach discussed so far is that we (= the programmer) has to distribute the work manually. In OpenMP, this is done by the compiler + runtime – at least when Worksharing constructs can be employed. In the case of for-loops, one would use Worksharing in OpenMP, With the upcoming .NET Framework version 4.0 there will be something similar (but not so powerful) available for C#: The Parallel class allows for the parallelization of for-loops, when certain conditions are fulfilled (always think about possible Data Races!). Using it is pretty simple thanks to support for delegates / lambda expressions in C#, as you can see below:

Parallel.For(0, (int)lhs.getNumRows(), delegate(int i)
{
   /* ... SMXV ... */
});
return result;

Nice? I certainly like this! It is very similar to Worksharing in the sense that you instrument your code with further knowledge to (incrementally) add parallelization, while it is also nicely integrated in the core language (which OpenMP isn’t). But you have to note that this Worksharing-like functionality is different from OpenMP in certain important aspects:

  • Tasks are used implicitly. There is a significant difference between using tasks underneath to implement this parallel for-loop, and Worksharing in OpenMP: Worksharing uses explicit threads that can be bound to cores / numa nodes, while tasks are scheduled onto threads on the behalf of the runtime system. Performance will be discussed in my next blog post, but tasks can easily be moved between numa nodes and that can spoil your performance really. OpenMP has no built-in support for affinity, but the tricks how to deal with Worksharing on cc-NUMA architectures are well-known.
  • Runtime system has full control. To my current knowledge, there is no reliably way of influencing how many threads will be used to execute the implicit tasks. Even more: I think this is by design. While it is probably nice for many users and applications when the runtime figures out how many threads should be used, this is bad for the well-educated programmer as he often has better knowledge of the application than the compiler + runtime could ever figure out (about data access pattern, for instance). If you want to fine-tune this parallelization, you have hardly any option (note: this is still beta and the options may change until .NET 4.0 will be released). In OpenMP, you can influence the work-distribution in many aspects.

PLINQ. LINQ stands for language-integrated query and allows for declarative data access. When I first heard about this technology, it was demonstrated in the context of data access and I found it interesting, but not closely related to the parallelism I am interested in. Well, it turned out that PLINQ (+ parallel) can be used to parallelize a SMXV code as well (the matrix_crs class has to implement the IEnumerable / IParallelEnumerable interface):

public static double[] operator *(matrix_crs_plinq lhs, double[] rhs)
{
   var res = from rowIndices in lhs.AsParallel().AsOrdered()
             select RowSum(rowIndices, lhs, rhs);
   double[] result = res.ToArray();
   return result;
}
public static double RowSum(long[] rowIndices, matrix_crs_plinq lhs, double[] rhs)
{
   double rowSum = 0;
   for (long i = rowIndices[0]; i < rowIndices[1]; i++)
   {
      rowSum += lhs.val(i) * rhs[lhs.col(i)];
   }
   return rowSum;
}

Did you recognized the AsParallel() in there? That is all you have to do, once the required interfaces have been implemented. Would I recommend using PLINQ for this type of code? No, it is meant to parallelize queries on object collections and more general data sources (think of databases). But (for me at least) it is certainly interesting to see this paradigm applied to a code snippet from the scientific-technical world. As PLINQ uses TPL internally, you will probably have the same issues regarding locality, although I did not look into this too closely yet.

Let me give credit to Ashwani Mehlem, who is one of the student workers in our group. He did some of the implementation work (especially the PLINQ version) and code maintenance of the experiment framework.

Written by terboven

February 18, 2009 at 5:01 pm

What is new in OpenMP 3.0 regarding C++

with 2 comments

This blog post has been hanging around in my draft folder for quite some time now. I have been talking about the changes a lot during the last couple of days and I just found a few minutes to finish it to publish my notes here. From a C++ programmer’s point of view, OpenMP 3.0 comes with a few improvements and since I was involved in getting those into the specification, this post is about explaining what we achieved and why we did not (yet) went further with some features…

1: The for-Worksharing has been enhanced to include RandomAccessIterators and both signed and unsigned integers and even C-style pointers.

  • RandomAccessIterators allow for constant-time increment, decrement, advance and distance computation, as they basically encapsulate pointer arithmetic. The 3.0 specification allows the use of loop variables of RandomAccessIterator type for loops used with for-Worksharing, as long as only the following relational operators are used inside the loop expression: <, <=, >, >=.
    While it is certainly nice to have this in, many programmers (including me) will find the != operator missing. The reason for the exclusion is that not all language committee members could be convinced that the number of loop iterations could be computed beforehand when this operator would be allowed (or: that the overflow-behavior would be equivalent to the integer case). By the time we decided to stop adding / extending features, we did not had answers for all questions and theoretical counter-examples, so != did not make it; nevertheless we hope to get it into the next specification.
  • The 2.5 specification only allowed signed integer loop variables for loops used with for-Worksharing. This was bad, as it was incompatible with size_t, which is used to query the number of elements in a STL container, for example. The 3.0 specification allows both signed and unsigned integer variables, which is a rather small but nice improvement.
  • C-style pointer loops have been allowed as well with the 3.0 specification. The same restrictions apply on the operator list as for the C++ RandomAccessIterator case.

2: It is now possible to threadprivatize static class member variables.

  • It is important to note that only *static* class member variables can be made threadprivate. There have been certain use cases for this, for example the implementation of a Singleton pattern and thread-specific allocators, and the changes to the specification were minimal. This was probably just overlooked in the 2.5 specification process.
    There were some requests to allow the privatization of class member variables in general, but this cannot be done by using the threadprivate clause since the address of general class member variables is not know at compile time.

3: We specified the lifetime and initialization of non-POD data types used in privatization clauses.

  • The lifetime and initialization of non-POD data types was kind of unclear in the 2.5 specification and because of that, the behavior changed between different compilers. It was important to get this right, so we made some updates to the semantics of private variables of non-POD data types. There are a few things that are important to note:
    • The order in which constructor calls and destructor calls for different threads happen is undefined. This is because we do not (want to) define the order in which threads are started, and you should never do any assumptions on that.
    • The default constructor and destructor have to be accessible. If, for example, the default constructor is private, the program is non-conforming if such an object occurs in a private clause.
  • We stated some things explicitly, e.g. that private objects have to be destructed at the end of a Parallel Region. This should force implementations to become consistent. Of course you should be aware that implementations are allowed to introduce additional objects of automatic storage duration if they “like”, this is granted by the C++ standard. The following lists a brief overview what happens with C++ non-POD data types for the different privatization clauses:
    • private: There has to be an accessible, unambiguous default constructor which is called for each object, and the object is destructed at the end of the Parallel Region via the accessible, unambigous destructor.
    • firstprivate: The private instances are copy-constructed, the argument for the copy constructor call is the original list item. Of course it is required for such a data type to have an accessible, unambiguous copy constructor.
    • lastprivate: The value is written back to the original list item by using the accessible, unambiguous copy assignment operator. A suitable constructor has to be available unless the data type is used in a firstprivate clause, then a suitable copy constructor is needed.
    • threadprivate: Here we have to differentiate three kinds of initialization: (i) no initialization, then the default constructor is called; (ii) direct initialization, then the constructor accepting the argument is called; (iii) copy initialization, then the copy constructor is called. In any case, the objects have to be constructed before the first reference, and have to be destructed after the last reference and before the program has been terminated.
    • threadprivate+copyin and threadprivate+copyprivate: Regarding the initialization the rules for threadprivate apply for the first encountered Parallel Region, at any following Parallel Region the copy assignment operator is invoked.

This is just a brief summary of the changes we made, I hope this is of interest for at least some person other than me :-) . From my point of view, there is still one “simple” thing missing: Allowing non-POD data types in reductions. By the time we decided to stop adding / extending features, we did not find a consensus on how the initialization for the reduction may occur. This is imporant, because with overloading you basically can implement user-defined reductions. We really hope to have that in the next specification update!

Written by terboven

November 26, 2008 at 7:41 am

Posted in OpenMP

Tagged with ,

Compiler Support for OpenMP 3.0

without comments

SC08 brought us some pretty good news regarding availability of (full) support for OpenMP 3.0:

  • Intel 11.0: Linux (x86), Windows (x86) and MacOS (x86)
  • Sun Studio Express 11/08: Linux (x86) and Solaris (SPARC + x86)
  • PGI 8.0: Linux (x86) and Windows (x86)
  • IBM 10.1: Linux (POWER) and AIX (POWER)

GCC 4.4 will have support for OpenMP 3.0 as well, it is currently in regression fixes and docs only mode (see http://gcc.gnu.org/ml/gcc/2008-11/msg00007.html). Great, hm?

Written by terboven

November 22, 2008 at 4:00 am

Posted in OpenMP

Tagged with ,

How to kill OpenMP by 2011 ?!

with 4 comments

When I was asked to give an answer to the question of How to kill OpenMP by 2011 during the OpenMP BoF panel discussion at SC08, I decided against listing the most prominent issues and challenges OpenMP is facing. It turned out that the first two speakers – Tim Mattson from Intel and Bronis de Supinski from LLNL – did exactly this very well. Instead, my claim is that OpenMP is doing quite well today and we “just” have to continue in riding on the multi-core momentum by outfitting OpenMP with a little more features. Our group is pretty involved in the OpenMP community and I got the feeling that since around early 2008 OpenMP is gaining moment and I tried to present this in an entertaining approach. This is a brief textual summary of my panel contribution (please do not take all things too seriously).

RWTH Aachen University is a member of the OpenMP ARB (= Architecture Review Board), as OpenMP is very important for many of our applications: All large codes (in terms of compute cycle consumption) are hybrid today, and in order to server some complex applications for which no MPI parallelization exists (so far) we offer the largest SPARC- and x86-based SMP systems one could buy. Obviously we would be very sad if OpenMP would disappear, but in order to find an answer for the question what an university could do to kill OpenMP 2011 it just needed a few domestic beers and a good chat with friends at one of the nice pubs in Austin, TX: Go teaching goto-based spaghetti style programming, as branching in and out of Parallel Regions is not allowed by OpenMP and as such this programming style is inherently incompatible with OpenMP.

By the next day this idea hat lost some of it’s fascination :-) , so I went off to evaluate OpenMP’s current momentum. In 2007, we have been invited to write a chapter for David Bader’s book on Petascale Computing. What we did just recently was to do a keyword search (with some manual postprocessing):

Algorithms and Applications, by David Bader.

Petascale Computing: Algorithms and Applications, by David Bader.

Keyword

Hits

MPI

612

OpenMP

150

Thread

109

Posix-Threads

2

UPC

30

C++

87

Fortran

69

Chapel

49

HPF

11

X10, Fortress, Titaium

< 10

This reveals at least the following interesting aspects:

  • MPI is clearly assessed to be the most important programming paradigm for Petascale Systems, but OpenMP is also well-recognized. Our own chapter on how to exploit SMP building blocks contributed for only 28 of the 150 hits on OpenMP.
  • The term Thread is often used in conjunction with OpenMP, but other threading models are virtually not touched at all.
  • C/C++ and Fortran are the programming languages considered to be used to program current and future Petascale systems.
  • There was one chapter on Chapel and because of that it had a comparably high number of hits, but otherwise the “new” parallel programming paradigms are not (yet ?) considered to be significant.

In order to take an even closer look at the recognition of OpenMP we asked our friend Google:

OpenMP versus Native Threading.

Google Trends: OpenMP versus Native Threading.

One can cleary see that the interest in OpenMP is increasing, as opposite to Posix-Threads and Win32-Threads. At the end of 2007 there is a peak when OpenMP 3.0 was announced and a draft standard for public comment was released. Since Q3/2008 we have compilers supporting OpenMP 3.0 which is accounting for increasing interest again. As there is quite some momentum in OpenMP it is hard for us, representing a University / the community it is hard of not impossible to kill OpenMP – which is actually quite nice.

But going back to finding an answer for the question posed on us, we found a suitable assassin: The trend of making Shared-Memory Systems more and more complex in terms of the architecture. For example all current x86-based systems (as announced this week at SC08) are cc-NUMA systems if you have more than one socket, and maybe we will eventually see NUMA (= non-uniform cache architecture) systems as well. So actually the hardware vendors have a chance to kill OpenMP by designing systems that are hard to exploit efficiently with multithreading. So the only chance to really kill OpenMP by 2011 is leaving it as it is and not equipping it with means to aid the programmer in squeezing performance out of such systems with an increasing depth of the memory hierarchy.  In terms of OpenMP, the world is still flat:

The World is still flat, no support for cc-NUMA (yet)!

OpenMP 3.0: The World is still flat, no support for cc-NUMA (yet)!

OpenMP is hardware agnostic, it has no notion of data locality.

The Affinity problem: How to maintain or improve the nearness of threads and their most frequently used data.

Or:

  • Where to run threads?

  • Where to place data?

Written by terboven

November 21, 2008 at 6:18 pm

Posted in OpenMP

Tagged with , , ,

A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1 of 2]

without comments

Since a while I am involved in several teaching activities on parallel programming and in my humble opinion this also includes talking about parallel computer architectures. As I am usually responsible for Shared-Memory parallel programming with OpenMP and TBB and the like, examples and exercises include learning about and tuning for the recent multi-core architectures we are using, namely Opteron-based and Xeon-based multi-socket systems. Well, understanding the perils of Shared-Memory parallel programming is not easy, but my impression is that several students are challenged when they are asked to carry the usual obstacles of parallel programming (e.g. load imbalance) forward to the context of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals: Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several architectures with (1) putting my oral explanations into text as a brief reference and (2) showing that one can do all the analysis and tuning work on Windows as well.

From school you probably know how to do a Matrix-Vector-Multiplication for dense matrices. In the field of high performance technical computing, you typically have to deal with sparse linear algebra (unless you do a LINPACK :-) benchmark). In my example, the matrix is stored in CRS format and has the following structure:

DROPS.

Matrix Structure Plot: DROPS.

The CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector has the same dimension as the val-vector and contains the column indices for each nonzero element, the row-vector is of the same length as there are rows in the matrix (+1) and points to the first nonzero element index (in val and col) for each matrix row. While there a several different format to save sparse matrices, the CRS format is well-suited for matrices without special properties and allows for an efficient implementation of the Matrix-Vector-Multiplication kernel. The intuitive approach for a parallel SMXV kernel may look as shown below. Let Aval, Acol and Arow be the vector-based implementations of val, col and row:

01  #pragma omp parallel for private(sum, rowend, rowbeg, nz)
02     for (i = 0; i < num_rows; i++){
03        sum = 0.0;
04        rowend = Arow[i+1];
05        rowbeg = Arow[i];
06        for (nz=rowbeg; nz<rowend; ++nz)
07        {
08           sum += Aval[nz] * x[Acol[nz]];
09        }
10        y[i] = sum;
11     }

How good is this parallelization for the matrix as shown above? Lets take a look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I am plotting the performance in MFLOP/s for one to eight threads using just the plain Debug configuration of Visual Studio 2008 in which OpenMP has been enabled:


Intuitive Parallelization.

Performance plot of a parallel SMXV: Intuitive Parallelization.

The speedup for two threads (about 1.7) is not too bad, but the best speedup of just 2.1 is achieved with eight threads. It does not pay off significantly to use more than four threads. This is because the Frontside Bus has an insuperable limit of about eight GB/s in total and using dedicated memory bandwidth benchmarks (e.g. STREAM) one can see that this limit can already be reached with four threads (sometimes even using just two threads). Since we are working with a sparse matrix, most accesses are quasi-random and neither the hardware prefetcher nor the compiler inserting prefetch instructions can help us any more.

In many cases, thread binding can be of some help to improve the performance. The result of thread binding is also shown as Debug w/ “scatter” binding – using this approach the threads are distributed over the machine as far away from each other as possible. For example with two threads, each thread is running on a separate socket. This strategy has the advantage of using the maximal possible cache size, but does not improve the performance significantly for this application (or: Windows is already doing a similarly good job with respect to thread binding). Nevertheless, I will use the scattered thread binding strategy in all following measurements. Now, what can we do? Let’s try compiler optimization:

Compiler Optimization.

Performance plot of a parallel SMXV: Compiler Optimization.

Switching to the Release configuration does not require any work from the user, but results in a pretty nice performance improvement. I usually enabled architecture-specific optimization as well (e.g. SSE-support is enabled in the ReleaseOpt configuration), but that does not result in any further performance improvement for this memory-bound application / benchmark. Anyway, as the compiler has optimized our code for example with respect to cache utilization, this also increases the performance when using more than one thread!

In sequential execution (aka with one thread only) we get about 570 MFLOP/s. This is only a small fraction of the peak performance one core could deliver theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is what you have to live with given the gap between CPU speed and memory speed. In order to improve the sequential performance, we would have to examine the matrix access pattern and re-arrange / optimize this with respect to the given cache hierarchy. But for now, I would rather like to think about the parallelization again: When you look at the matrix structure plot above, you will find that the density of nonzero elements is decreasing with the matrix rows counting. Our parallelization did not respect this, so we should expect to have a load imbalance limiting our parallelization. I used the Intel Thread Profiler (available on Windows as well as on Linux) to visualize this:

Load Imbalance with SMXV.

Intel Thread Profiler: Load Imbalance with SMXV.

The default for-loop scheduling in OpenMP is static (well, on all implementations I know), thus the iteration space is divided into as many chunks as we have threads, all of approximately equal size. So the first thread (T1 in the image above) gets the part of the matrix containing the more dense rows, thus it has more work to do than the other threads. Note: The reason why the Thread Profiler claims the threads two to four have “Barrier”-overhead instead of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly different than the code snippet above, but let’s ignore that differentiation here.

So, what can we do about it? Right, OpenMP allows for pretty easy and efficient ways of influencing the for-loop scheduling strategy. We just have to extend the line 01 of the code snippet above to look like this:

01 #pragma omp parallel for private(sum, rowend, rowbeg, nz) schedule(guided)

With guided scheduling, the initial chunks have an implementation-specific size which is decreased exponentially down to the chunksize specified, or 1 in our case. For the matrix with a structure as shown above, this results in a good load balance. So this is the performance we get including all optimization we discussed so far:

Load Balancing.

Performance plot of a parallel SMXV: Load Balancing.

We started with an non-optimized serial code delivering about 350 MFLOP/s and finished with a parallel code delivering about 1000 MFLOP/s! This is still far away from a linear scaling, but this is what you see in reality with complex (aka memory-bound) applications. Regarding these results, please note the following:

  • We did not apply any dataset-specific optimization. That means if the matrix structure changes (which it does over the time of a program run in the application I took this benchmark from) we will still do well and not run into any new load balance. This is clearly an advantage of OpenMP over manual threading!
  • We did not apply any architecture-specific optimization. This code will deliver a reasonable performance on most machines. But we did not yet take a look at cc-NUMA machines (e.g AMD Opteron-based or Intel Nehalem-based systems), this will be done in part 2. On a cc-NUMA system, there is a lot of performance to win or to loose, depending on if you are doing everything right or making a mistake.
  • Was anything in here OS-specific? No, it wasn’t. I did the experiments on Windows, but could have done everything on Linux in exactly the same way. More on this in the next post as well…

Written by terboven

November 2, 2008 at 9:40 pm

C++0x: OpenMP loop parallelization without pragmas?!

without comments

Some people are complaining that OpenMP’s approach of using pragmas to annotate a program is not very nice, as pragmas / the OpenMP directives are not well-integrated into the language. Personally, I like the OpenMP approach and think it has some specific advantages. But I am also very interested in researching how the OpenMP language bindings could be improved, especially for C++. This post is about using C++0x features to build parallelization constructs that have been praised in the context of other approaches (e.g. Parallel Extensions for C#, or Intel’s Threading Building Blocks), but using OpenMP constructs.

Let’s consider the following sequential loop which is very similar to the example used in the Microsoft Parallel Extensions to the .NET Framework 3.5 (June 2008) documentation:

01   double dStart, dEnd;
02   for (int rep = 0; rep < iNumRepetitions; rep++)
03   {04       dStart = omp_get_wtime();
05       for (int i = 0; i < iNumElements; i++)
06       {
07           vec[i] = compute(vec[i], iNumIterations);
08       }
09       dEnd = omp_get_wtime();10   }

The experiment loop (line 05 to line 08) is executed iNumRepetitions times, the time is taken in line 04 and 09 using OpenMP time measurement functions (portability!), and the time required for each element can be controlled via iNumIterations. I will use that parametrization for my performance experiments – for now let’s just look at how this would be parallelized in OpenMP:

#pragma omp parallel for shared(iNumElements, vec, iNumIterations)
        for (int i = 0; i < iNumElements; i++)
        {
            vec[i] = compute(vec[i], iNumIterations);
        }

Pretty straight forward – as this parallel loop is perfectly balanced, we do not need the schedule clause here. How could that loop look like without using pragmas? Maybe as shown here:

omp_pfor (0, iNumElements, [&](int i)
{
    vec[i] = compute(vec[i], iNumIterations);
});

Do you like that? No OpenMP pragma is visible in the user’s code, he just has to specify the loop iteration space and the loop variable, the parallelization is done “under the hood”. The implementation of this is pretty simple using lambda functions of C++0x:

template<typename F>
void omp_pfor(int start, int end, F x)
{
#pragma omp parallel for
    for(int __i = start; __i < end; __i++)
    {
        x(__i);
    }
}

Of course I am still using OpenMP directives here, but they are hidden as an implementation detail. The actual loop body is passed as an argument to the omp_pfor lambda function, as well as the loop boundaries. Please note that this is just a very simple example, of course one can handle all types of loops that are currently supported in OpenMP 3.0 (any maybe even more) and STL-type algorithms!

In this post I only talked about syntax, but there is more to it. A part of my research is looking into how programmers (especially from the background of computation engineering science in Aachen) can be provided with more powerful language-based tools to ease writing parallel and reusable code / components. I am always happy to discuss on such a topic – if you like the Live Space comment functionality as little as I do, just drop me a mail at christian@terboven.com.

You can download this example code from my website. In order to compile that code, I recommend using the latest Intel 11.0 beta compiler.

Written by terboven

September 11, 2008 at 9:18 am