Posts Tagged ‘C++’
Book Review: C# 2008 and 2005 Thread Programming (Beginner’s Guide)
Just recently – in May 2009 – I gave two lectures on Multithreading with C# for Desktop Applications. I found there are quite a few books available that cover the .NET Thread class when talking about Windows programming in general, but the book C# 2008 and 2005 Threaded Programming: Beginner’s Guide is only about, well, Multithreading with C#. The subtitle Exploit the power of multiple processors for faster, more responsive software also states that both algorithmic parallelization as well as the separation of computation from a graphical user interface (GUI) is covered in here, and this is exactly what I was looking for. The book is clearly marked as a Beginner’s Guide and is well-written for that aspect, so if you already know about Multithreading and just want to learn about how to do this with C#, you might find the book to proceed too slowly. If you are uncertain or clearly new to this subject, then this book might do it’s job very well for you.
Chapters one and two start with a brief motivation of why the shift towards multicore processors has such an important influence on how software has to be designed and written nowadays and also contain a brief description of the typical pitfalls you may run into when parallelizing software. Chapter three describes the BackgroundWorker component, which is the simplest facility to separate the computation from the user interface in order to keep it responsible. Chapters four and five cover the most important aspects of the Thread class as well as how to use Visual Studio to debug multithreaded programs. Chapters six to nine describe how to apply parallelization to a range of common problems and design cases, for example howobject-oriented features of C# and the garbage collector of .NET play along with the Thread class and what to take care for when doing Input/Output and Data Access. Chapter ten explains in detail how GUIs and Threads work together (or not) and how to design you GUI and your application to report progress to the GUI from threads, for example. When doing so there are some rules one has to obey and I found the issues that I was not aware of before very well-explained. Chapter eleven gives a brief overview of the .NET Parallel Extensions – which will be part of .NET 4.0 – such as the Parallel class and PLINQ. The final chapter twelve tries to put all things together into a single application.
Most aspects of Multithreading with C# are introduced by first stating a problem / motivation (with respect to the example code), then showing the solution in C# code and discussing the effects of it and finally explaining the concept in some more detail, if needed. The two example codes, a text message encryption and decryption software and an image analysis tool, are consistently extended with the new features that have been introduced. I personally did not like that there is so much example code shown in the book, although people new to Multithreading might find studying the source code helpfull. With a strong focus on explaining and discussing example the book is not well-suited as a reference, but it does not say to do so. Actually I think that once you are familiar with certain aspects of Multithreading with C#, MSDN does a good job of serving as a reference.
The book is published by Packt Publishing and has been released in January 2009. The price of about 30 Euro for about 420 pages at amazon.de in Germany is affordable for students, I think. Regards to Radha Iyer at Packt Publishing for making this book available for me in time.
A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1.5 of 2]
Although it is high time to deliver the second part of this blog post series, I decided to squeeze in one additional post which I named part “1.5″, as it will cover some experiments with SMXV in C#. Since I am currently preparing a lecture named Multi-Threading for Desktop Systems (it will be held in German, though) in which C# plays an important role, we took a closer look into how parallelism has made it’s way into the .NET framework version 3.5 and 4.0. The final post will then cover some more tools and performance experiments (especially regarding cc-NUMA architectures) with the focus back on native coding.
First, let us briefly recap how the SMXV was implemented and examine how this can look like in C#. As explained in my previous post, the CRS format stores just the nonzero elements of the matrix in three vectors: The val-vector contains the values of all nonzero elements, the col-vector contains the column indices for each nonzero element and the row-vector points to the first nonzero element index (in val and col) for each matrix row. Having one class to represent a CRS matrix and using an array of doubles to represent a vector, the SMXV operation encapsulated by the operator* can be implemented like this, independent of whether you use managed or unmanaged arrays:
public static double[] operator *(matrix_crs lhs, double[] rhs) {
double[] result = new double[lhs.getNumRows()];
for (long i = 0; i < lhs.getNumRows(); ++i)
{
double sum = 0;
long rowbeg = lhs.row(i);
long rowend = lhs.row(i + 1);
for (long nz = rowbeg; nz < rowend; ++nz)
sum += lhs.val(nz) * rhs[ lhs.col(nz) ];
result[i] = sum;
}
return result;
}
We have several options to parallelize this code, which I wil present and briefly discuss in the rest of this post.
Threading. In this approach, the programmer is responsible for managing the threads and distributing the work onto the threads. It is not too hard to implement a static work-distribution for any given number of threads, but implementing a dynamic or adaptive work-distribution is a lot of work and also error-prone. In order to implement the static approach, we need an array of threads, have to compute the iteration chunk for each thread, put the threads to work and finally wait for the threads to finish their computation.
//Compute chunks of work:Thread[] threads = new Thread[lhs.NumThreads];
long chunkSize = lhs.getNumRows() / lhs.NumThreads;
//Start threads with respective chunks:
for (int t = 0; t < threads.Length; ++t)
{
threads[t] = new Thread(delegate(object o)
{
int thread = (int)o;
long firstRow = thread * chunkSize;
long lastRow = (thread + 1) * chunkSize;
if (thread == lhs.NumThreads - 1) lastRow = lhs.getNumRows();
for (long i = firstRow; i < lastRow; ++i)
{ /* ... SMXV ... */ }
});
//Start the thread and pass the ID: threads[t].Start(t);
}
//Wait for all threads to complete:
for(int t = 0; t < threads.Length; ++t) threads[t].Join(); return result;
Instead of managing the threads on our own, we could use the thread pool of the runtime system. From a usage point of view, this is equivalent to the version shown above, so I will not discuss this any further.
Tasks. The problem of the approach discussed above is the static work-distribution that may lead to load imbalances, and implementing a dynamic work-distribution is error-prone and depending on the code it also may be a lot of work. The goal should be to distribute the workload into smaller packages, but doing this with threads is not optimal: Threads are quite costly in the sense that creating or destroying a thread takes quite a lot of time (in computer terms) since the OS is involved, and threads also need some amount of memory. A solution for this problem are Tasks. Well, tasks are quite “in” nowadays with many people thinking on how to program multicore systems and therefore there are many definitions of what a task really is. I have given mine in previous posts on OpenMP and repeat it here briefly: A task is a small package consisting of some code to execute and some private data (access to shared data is possible, of course) which the runtime schedules for execution by a team of threads. Actually it is pretty simple to parallelize the code from above using tasks: We have to manage a list of tasks and have to decide how much work a task should do (in terms of matrix lines), and of course we have to create and start the tasks and finally wait for them to finish. See below:
//Set the size of the tasks:
List<Task> taskList = new List<Task>();
int chunkSize = 1000;
//Create the tasks that calculate the parts of the result:
for (long i = 0; i < lhs.getNumRows(); i += chunkSize)
{
taskList.Add(Task.Create(delegate(object o)
{
long chunkStart = (long)o;
for(long index = (long)chunkStart;
index < System.Math.Min(chunkStart + chunkSize, lhs.getNumRows()); index++)
{ /* ... SMXV ... */ }
}, i));
}
//Wait for all tasks to finish:
Task.WaitAll(taskList.ToArray()); return result;
Using the TPL. The downside of the approach discussed so far is that we (= the programmer) has to distribute the work manually. In OpenMP, this is done by the compiler + runtime – at least when Worksharing constructs can be employed. In the case of for-loops, one would use Worksharing in OpenMP, With the upcoming .NET Framework version 4.0 there will be something similar (but not so powerful) available for C#: The Parallel class allows for the parallelization of for-loops, when certain conditions are fulfilled (always think about possible Data Races!). Using it is pretty simple thanks to support for delegates / lambda expressions in C#, as you can see below:
Parallel.For(0, (int)lhs.getNumRows(), delegate(int i)
{
/* ... SMXV ... */
});
return result;
Nice? I certainly like this! It is very similar to Worksharing in the sense that you instrument your code with further knowledge to (incrementally) add parallelization, while it is also nicely integrated in the core language (which OpenMP isn’t). But you have to note that this Worksharing-like functionality is different from OpenMP in certain important aspects:
- Tasks are used implicitly. There is a significant difference between using tasks underneath to implement this parallel for-loop, and Worksharing in OpenMP: Worksharing uses explicit threads that can be bound to cores / numa nodes, while tasks are scheduled onto threads on the behalf of the runtime system. Performance will be discussed in my next blog post, but tasks can easily be moved between numa nodes and that can spoil your performance really. OpenMP has no built-in support for affinity, but the tricks how to deal with Worksharing on cc-NUMA architectures are well-known.
- Runtime system has full control. To my current knowledge, there is no reliably way of influencing how many threads will be used to execute the implicit tasks. Even more: I think this is by design. While it is probably nice for many users and applications when the runtime figures out how many threads should be used, this is bad for the well-educated programmer as he often has better knowledge of the application than the compiler + runtime could ever figure out (about data access pattern, for instance). If you want to fine-tune this parallelization, you have hardly any option (note: this is still beta and the options may change until .NET 4.0 will be released). In OpenMP, you can influence the work-distribution in many aspects.
PLINQ. LINQ stands for language-integrated query and allows for declarative data access. When I first heard about this technology, it was demonstrated in the context of data access and I found it interesting, but not closely related to the parallelism I am interested in. Well, it turned out that PLINQ (+ parallel) can be used to parallelize a SMXV code as well (the matrix_crs class has to implement the IEnumerable / IParallelEnumerable interface):
public static double[] operator *(matrix_crs_plinq lhs, double[] rhs)
{
var res = from rowIndices in lhs.AsParallel().AsOrdered()
select RowSum(rowIndices, lhs, rhs);
double[] result = res.ToArray();
return result;
}
public static double RowSum(long[] rowIndices, matrix_crs_plinq lhs, double[] rhs)
{
double rowSum = 0;
for (long i = rowIndices[0]; i < rowIndices[1]; i++)
{
rowSum += lhs.val(i) * rhs[lhs.col(i)];
}
return rowSum;
}
Did you recognized the AsParallel() in there? That is all you have to do, once the required interfaces have been implemented. Would I recommend using PLINQ for this type of code? No, it is meant to parallelize queries on object collections and more general data sources (think of databases). But (for me at least) it is certainly interesting to see this paradigm applied to a code snippet from the scientific-technical world. As PLINQ uses TPL internally, you will probably have the same issues regarding locality, although I did not look into this too closely yet.
Let me give credit to Ashwani Mehlem, who is one of the student workers in our group. He did some of the implementation work (especially the PLINQ version) and code maintenance of the experiment framework.
What is new in OpenMP 3.0 regarding C++
This blog post has been hanging around in my draft folder for quite some time now. I have been talking about the changes a lot during the last couple of days and I just found a few minutes to finish it to publish my notes here. From a C++ programmer’s point of view, OpenMP 3.0 comes with a few improvements and since I was involved in getting those into the specification, this post is about explaining what we achieved and why we did not (yet) went further with some features…
1: The for-Worksharing has been enhanced to include RandomAccessIterators and both signed and unsigned integers and even C-style pointers.
- RandomAccessIterators allow for constant-time increment, decrement, advance and distance computation, as they basically encapsulate pointer arithmetic. The 3.0 specification allows the use of loop variables of RandomAccessIterator type for loops used with for-Worksharing, as long as only the following relational operators are used inside the loop expression: <, <=, >, >=.
While it is certainly nice to have this in, many programmers (including me) will find the != operator missing. The reason for the exclusion is that not all language committee members could be convinced that the number of loop iterations could be computed beforehand when this operator would be allowed (or: that the overflow-behavior would be equivalent to the integer case). By the time we decided to stop adding / extending features, we did not had answers for all questions and theoretical counter-examples, so != did not make it; nevertheless we hope to get it into the next specification. - The 2.5 specification only allowed signed integer loop variables for loops used with for-Worksharing. This was bad, as it was incompatible with size_t, which is used to query the number of elements in a STL container, for example. The 3.0 specification allows both signed and unsigned integer variables, which is a rather small but nice improvement.
- C-style pointer loops have been allowed as well with the 3.0 specification. The same restrictions apply on the operator list as for the C++ RandomAccessIterator case.
2: It is now possible to threadprivatize static class member variables.
- It is important to note that only *static* class member variables can be made threadprivate. There have been certain use cases for this, for example the implementation of a Singleton pattern and thread-specific allocators, and the changes to the specification were minimal. This was probably just overlooked in the 2.5 specification process.
There were some requests to allow the privatization of class member variables in general, but this cannot be done by using the threadprivate clause since the address of general class member variables is not know at compile time.
3: We specified the lifetime and initialization of non-POD data types used in privatization clauses.
- The lifetime and initialization of non-POD data types was kind of unclear in the 2.5 specification and because of that, the behavior changed between different compilers. It was important to get this right, so we made some updates to the semantics of private variables of non-POD data types. There are a few things that are important to note:
- The order in which constructor calls and destructor calls for different threads happen is undefined. This is because we do not (want to) define the order in which threads are started, and you should never do any assumptions on that.
- The default constructor and destructor have to be accessible. If, for example, the default constructor is private, the program is non-conforming if such an object occurs in a private clause.
- We stated some things explicitly, e.g. that private objects have to be destructed at the end of a Parallel Region. This should force implementations to become consistent. Of course you should be aware that implementations are allowed to introduce additional objects of automatic storage duration if they “like”, this is granted by the C++ standard. The following lists a brief overview what happens with C++ non-POD data types for the different privatization clauses:
- private: There has to be an accessible, unambiguous default constructor which is called for each object, and the object is destructed at the end of the Parallel Region via the accessible, unambigous destructor.
- firstprivate: The private instances are copy-constructed, the argument for the copy constructor call is the original list item. Of course it is required for such a data type to have an accessible, unambiguous copy constructor.
- lastprivate: The value is written back to the original list item by using the accessible, unambiguous copy assignment operator. A suitable constructor has to be available unless the data type is used in a firstprivate clause, then a suitable copy constructor is needed.
- threadprivate: Here we have to differentiate three kinds of initialization: (i) no initialization, then the default constructor is called; (ii) direct initialization, then the constructor accepting the argument is called; (iii) copy initialization, then the copy constructor is called. In any case, the objects have to be constructed before the first reference, and have to be destructed after the last reference and before the program has been terminated.
- threadprivate+copyin and threadprivate+copyprivate: Regarding the initialization the rules for threadprivate apply for the first encountered Parallel Region, at any following Parallel Region the copy assignment operator is invoked.
This is just a brief summary of the changes we made, I hope this is of interest for at least some person other than me
. From my point of view, there is still one “simple” thing missing: Allowing non-POD data types in reductions. By the time we decided to stop adding / extending features, we did not find a consensus on how the initialization for the reduction may occur. This is imporant, because with overloading you basically can implement user-defined reductions. We really hope to have that in the next specification update!
C++0x: OpenMP loop parallelization without pragmas?!
Some people are complaining that OpenMP’s approach of using pragmas to annotate a program is not very nice, as pragmas / the OpenMP directives are not well-integrated into the language. Personally, I like the OpenMP approach and think it has some specific advantages. But I am also very interested in researching how the OpenMP language bindings could be improved, especially for C++. This post is about using C++0x features to build parallelization constructs that have been praised in the context of other approaches (e.g. Parallel Extensions for C#, or Intel’s Threading Building Blocks), but using OpenMP constructs.
Let’s consider the following sequential loop which is very similar to the example used in the Microsoft Parallel Extensions to the .NET Framework 3.5 (June 2008) documentation:
01 double dStart, dEnd; 02 for (int rep = 0; rep < iNumRepetitions; rep++) 03 {04 dStart = omp_get_wtime(); 05 for (int i = 0; i < iNumElements; i++) 06 { 07 vec[i] = compute(vec[i], iNumIterations); 08 } 09 dEnd = omp_get_wtime();10 }
The experiment loop (line 05 to line 08) is executed iNumRepetitions times, the time is taken in line 04 and 09 using OpenMP time measurement functions (portability!), and the time required for each element can be controlled via iNumIterations. I will use that parametrization for my performance experiments – for now let’s just look at how this would be parallelized in OpenMP:
#pragma omp parallel for shared(iNumElements, vec, iNumIterations) for (int i = 0; i < iNumElements; i++) { vec[i] = compute(vec[i], iNumIterations); }
Pretty straight forward – as this parallel loop is perfectly balanced, we do not need the schedule clause here. How could that loop look like without using pragmas? Maybe as shown here:
omp_pfor (0, iNumElements, [&](int i)
{
vec[i] = compute(vec[i], iNumIterations);
});
Do you like that? No OpenMP pragma is visible in the user’s code, he just has to specify the loop iteration space and the loop variable, the parallelization is done “under the hood”. The implementation of this is pretty simple using lambda functions of C++0x:
template<typename F> void omp_pfor(int start, int end, F x) { #pragma omp parallel for for(int __i = start; __i < end; __i++) { x(__i); } }
Of course I am still using OpenMP directives here, but they are hidden as an implementation detail. The actual loop body is passed as an argument to the omp_pfor lambda function, as well as the loop boundaries. Please note that this is just a very simple example, of course one can handle all types of loops that are currently supported in OpenMP 3.0 (any maybe even more) and STL-type algorithms!
In this post I only talked about syntax, but there is more to it. A part of my research is looking into how programmers (especially from the background of computation engineering science in Aachen) can be provided with more powerful language-based tools to ease writing parallel and reusable code / components. I am always happy to discuss on such a topic – if you like the Live Space comment functionality as little as I do, just drop me a mail at christian@terboven.com.
You can download this example code from my website. In order to compile that code, I recommend using the latest Intel 11.0 beta compiler.
Building and Using BOOST.MPI on Windows HPC Server 2008
If you are a MPI programmer, you probably know that there are C++ bindings for MPI, which nowadays come with most MPI distributions. Personally, I find they are ugly and do not provide any advantage over using the plain C bindings. In addition, there are even some disadvantages in using the C++ bindings, as I encountered that they are causing problems for several MPI analysis tools (under some circumstances). If you are intending to use the MPI C++ bindings on Windows, you will find that MS-MPI does not come with them. So if you really need them, I would advise to use Intel MPI on Windows – but if you are interested in better C++ bindings for MPI, I would advise to take a look at the BOOST.MPI bindings.
This brief blog post is intended to get you started building and using BOOST.MPI on Windows HPC Server 2008. I will provide just basic BOOST build instructions and point you to the “Getting Started on Windows”-guide at the BOOST homepage for more details. This is how we typically build BOOST on our systems, our student Christopher Schleiden figured out the nifty details:
- Download boost_1_36_0.zip (53.4 MB) and boost-jam-3.1.16-1-ntx86.zip (120 KB), which were the most current versions by the time of this writing (September 8, 2008).
- Extract both archives, here I used X:\src\boost_1_36_0 as the destination path of the boost package itself and X:\src\bin as the destination path of the bjam tool.
- Start a command shell and put the Visual Studio (2008) compiler in your path. You can easily get a suitable prompt via Start –> All Programs –> Microsoft Visual Studio 2008 –> Visual Studio Tools –> Visual Studio 2008 Command Prompt.
- Put the bjam tool in your path, e.g. via set PATH=%PATH%;x:\src\bin.
- Start the build process via X:\src\boost_1_36_0\boost_1_36_0> bjam –build-type=release –toolset=msvc –build-dir=x:\src\boost_1_36_0\build\90\32 –stagedir=x:\src\boost_1_36_0\stage\90-32 stage. This will take some time… A brief description on my options:
- –build-type: You can choose between release and debug, or build both. Please take into account that the debug build will consume a significant amount of disc space.
- –toolset: msvc stands for the Microsoft Visual Studio C/C++ compiler. On Windows you could also use the Intel C/C++ compiler and maybe even cygwin, but I never tried that.
- –build-dir: In order to avoid trouble you should specify a directory to contain the intermediate files created during the build process. As we support Visual Studio version 2005 and 2008 for 32bit and 64bit targets, we created an appropriate naming scheme.
- –stage-dir: This denotes the directory in which you intend to install boost.
Following this approach, the MPI bindings will be skipped! To enable the MPI library build process, you have to add –with-mpi to the bjam command line from above, or edit the user-config.jam file in subdirectory tools\build\v2 of your BOOST sources and add the line using mpi ; (notice the white spaces) to the bottom of that file – the latter is the approach preferred by me. If you are building on a Windows Compute Cluster 2003 machine, you will end up with the desired library. If you are building on a Windows HPC Server 2008 machine, you will receive the following error message:
MPI auto-detection failed: unknown wrapper compiler mpic++
This is because the MPI configuration just looks for MS-MPI v1, which typically resides in directory C:\Program Files\Microsoft Compute Cluster Pack. In order to make the auto-configuration look for MS-MPI v2, you have to edit the file mpi.jam in subdirectory tools\build\v2\tools and replace line 235 with the following:
local cluster_pack_path_native = “C:\\Program Files\\Microsoft HPC Pack 2008 SDK” ;
Future versions of BOOST will probably look for the MS-MPI v2 automatically. Of course this works as well on a Windows HPC Server 2008 as on a workstation having the HPC Pack SDK installed (e.g. on my notebook running Vista 32-bit). Once the stage target is completed, you will find five additional files in your BOOST target directory:
- libboost_mpi-vc90-mt.lib
- libboost_mpi-vc90-mt-1_36.lib
- boost_mpi-vc90-mt.lib
- boost_mpi-vc90-mt-1_36.dll
- boost_mpi-vc90-mt-1_36.lib
You will find some examples for BOOST.MPI on page describing the BOOST.MPI bindings. In order to build those, you have to make the following changes to your project:
- Add include path (C:\Program Files\Microsoft HPC Pack 2008 SDK\Include) and library path (C:\Program Files\Microsoft HPC Pack 2008 SDK\Lib\i386 for 32-bit applications or C:\Program Files\Microsoft HPC Pack 2008 SDK\Lib\amd64 for 64-bit applications) for MS-MPI v2 and add msmpi.lib to the linker input.
- Add BOOST to your include path (in my example: X:\src\boost_1_36_0\boost_1_36_0) and to your library path (in my example: X:\src\boost_1_36_0\stage\90-32\lib).
That’s all it takes! Note that if you have just compiled the release libraries of BOOST (as done in this example) and not the debug libraries, building your project using BOOST.MPI will fail (or cause you trouble), so for development purposes you should build both.