<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Christian's corner on HPC</title>
	<atom:link href="http://terboven.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://terboven.wordpress.com</link>
	<description>A Blog on Parallel Programming - covering all OSes :-) - by Christian Terboven.</description>
	<lastBuildDate>Wed, 02 Dec 2009 18:03:57 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='terboven.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/ca15ae373518af13020a6ad3d697f507?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Christian's corner on HPC</title>
		<link>http://terboven.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://terboven.wordpress.com/osd.xml" title="Christian&#8217;s corner on HPC" />
		<item>
		<title>Daily cc-NUMA Craziness</title>
		<link>http://terboven.wordpress.com/2009/12/02/daily-cc-numa-craziness/</link>
		<comments>http://terboven.wordpress.com/2009/12/02/daily-cc-numa-craziness/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 18:03:57 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[NUMA]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[Binding]]></category>
		<category><![CDATA[cc-NUMA]]></category>
		<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10138</guid>
		<description><![CDATA[Since cc-NUMA architectures have become ubiquitous in the x86 server world, it is very important to optimize memory and thread or process placement, especially for Shared-Memory parallelization. In doing so I was pretty successful in optimizing several of our user codes for cc-NUMA architectures by introducing manual binding strategies. I like the cpuinfo tool that [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10138&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Since cc-NUMA architectures have become ubiquitous in the x86 server world, it is very important to optimize memory and thread or process placement, especially for Shared-Memory parallelization. In doing so I was pretty successful in optimizing several of our user codes for cc-NUMA architectures by introducing manual binding strategies. I like the <em>cpuinfo</em> tool that comes with Intel MPI 3.x a lot, it is to query how all the cores are related (i.e. which cores share a cache). Based on that output I used to figure out my strategies for every architecture that we have in our center or that I have access to elsewhere. However, during the last couple of days I observed some benchmark results that did not make much sense to me, and today I stumbled upon the cause for that &#8211; something I just did not expect. I will tell you in a second, but my statement is: Manual Binding can be a bad thing, although one can achieve a nice speedup by doing it right even experts can easily be fooled, therefore it is high time to get access to a standardized interface to communicate with the threading runtime and the OS!</p>
<p>We have dual-socket Intel Nehalem-EP systems from two different vendors: Sun and HP. The Sun systems are intended for HPC and are equipped with Xeon X5570 (2.93 GHz) CPUs, the HP systems are intended for infrastructure services and are equipped with Xeon E5540 (2.53 GHz) CPUs. Anyhow, I got hold of both, put some jobs on the boxes and was really disappointed by the speedup measurements on the HP system. In investigating the reason for that I found out that the numbering of the logical cores on both systems is different. Oh dear, two dual-socket systems with Intel Nehalem-EP processors, in one system the cores 0 and 1 are on the same socket, but in the other system they are on a different socket. Lets take a look at the output of <em>cpuinfo</em> on the Sun system:</p>
<table style="height:762px;" border="1" cellspacing="0" cellpadding="0" width="400">
<tbody>
<tr>
<td width="481" valign="top">
<h3 style="text-align:center;">Sun   Nehalem-EP   (linux)</h3>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h4>Processor composition</h4>
<pre>Processors(CPU)   : 16
Packages(sockets) : 2
Cores per package : 4
Threads per core  : 2</pre>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h3>Processor identification</h3>
<pre>Processor  Thread Id.  Core Id.  Package Id.
0          0           0         0
1          0           1         0
2          0           2         0
3          0           3         0
4          0           0         1
5          0           1         1
6          0           2         1
7          0           3         1
8          1           0         0
9          1           1         0
10         1           2         0
11         1           3         0
12         1           0         1
13         1           1         1
14         1           2         1
15         1           3         1</pre>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h3>Placement on packages</h3>
<pre>Package Id.  Core Id.  Processors
0            0,1,2,3   (0,8)(1,9)(2,10)(3,11)
1            0,1,2,3   (4,12)(5,13)(6,14)(7,15)</pre>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h3>Cache sharing</h3>
<pre>Cache  Size    Processors
L1     32 KB   (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L2     256 KB  (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L3     8 MB    (0,1,2,3,8,9,10,11)
               (4,5,6,7,12,13,14,15)</pre>
</td>
</tr>
</tbody>
</table>
<p>And this is the output on the HP system:</p>
<table style="height:762px;" border="1" cellspacing="0" cellpadding="0" width="400">
<tbody>
<tr>
<td width="481" valign="top">
<h3 style="text-align:center;">HP   Nehalem-EP   (linux)</h3>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h4>Processor composition</h4>
<pre>Processors(CPU)   : 16
Packages(sockets) : 2
Cores per package : 4
Threads per core  : 2</pre>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h3>Processor identification</h3>
<pre>Processor  Thread Id.  Core Id.  Package Id.
0          0           0         1
1          0           0         0
2          0           2         1
3          0           2         0
4          0           1         1
5          0           1         0
6          0           3         1
7          0           3         0
8          1           0         1
9          1           0         0
10         1           2         1
11         1           2         0
12         1           1         1
13         1           1         0
14         1           3         1
15         1           3         0</pre>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h3>Placement on packages</h3>
<pre>Package Id.  Core Id.  Processors
1            0,2,1,3   (0,8)(2,10)(4,12)(6,14)
0            0,2,1,3   (1,9)(3,11)(5,13)(7,15)</pre>
</td>
</tr>
<tr>
<td width="481" valign="top">
<h3>Cache sharing</h3>
<pre>Cache  Size    Processors
L1     32 KB   (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L2     256 KB  (0,8)(1,9)(2,10)(3,11)
               (4,12)(5,13)(6,14)(7,15)
L3     8 MB    (0,2,4,6,8,10,12,14)
               (1,3,5,7,9,11,13,15)
</pre>
</td>
</tr>
</tbody>
</table>
<p>Lets take a closer look at this table. Wherever you find the identification &#8216;processor&#8217;, this refers to the logical core as visible to the operating system. A &#8216;package&#8217; is a socket, and we have two &#8216;(hyper-)threads&#8217; per &#8216;package&#8217;. On the Sun system, the logical cores 0 and 1 are located on the same socket, the cores 0 to 8 refer to eight full cores on two sockets making use of all caches. On the HP system, the logical cores 0 and 1 are located on two sockets, the cores 0 to 8 refer to four hyper-threaded cores on two sockets making use of only half the caches. I am not saying one of the two strategies is better &#8211; but if you use one machine to determine what the best is for you application, put this into a start-up script and change the machines in between your measurements, that you will be surprised (and not to the good).</p>
<p>How is the core numbering determined? Well, the short answer is &#8220;not by the OS, but by the BIOS&#8221;; the honest answer is &#8220;I don&#8217;t know exactly&#8221;. The BIOS has a lot of influence, for example one can take a look at the <em>Advanced Configuration and Power Interface Specification</em> (ACPI: <a href="http://www.acpi.info/DOWNLOADS/ACPIspec40.pdf">http://www.acpi.info/DOWNLOADS/ACPIspec40.pdf</a>) in section 5.2.17 that there is a <em>System Locality Distance Information Table</em> (SLIT) that lists the distance between hardware resources on different NUMA nodes. In theory the OS kernel can make use of that table, and it does or it fills in constant values (i.e. 10 = local, 20 = remote) in case the table is empty. But the ACPI specification does not specify how the table is generated &#8211; that is up to the BIOS implementation itself, and probably up to BIOS settings. The important take-away is that (i) BIOS settings influence the core numbering scheme, (ii) obviously BIOS settings are not the same across vendors, (iii) the numbering can change over time anyhow and other OSes (i.e. Windows)  do it differently -&gt; (iv) do not rely on the numbering scheme being static.</p>
<p>What should you do instead? We do not have a standardized way to influence the thread / process binding. Using tools such as <em>numactl</em> (Linux) or <em>start /affinity</em> (Windows) accept core ids as argument, which is far from optimal. The same holds for explicit API calls to do the binding. Instead, the Intel compiler is following a good path: The environment variable <em>KMP_AFFINITY</em> can be used to define an explicit thread-to-core mapping, but it also accepts two strategies: <em>scatter</em> and <em>compact</em>. The idea of <em>scatter</em> is to bind the threads as far apart as possible (to use all the caches and to have all the memory bandwidth available); the idea of <em>compact</em> is to bind the threads as close together as possible (to profit from shared caches). Running a program with two threads using the <em>scatter</em> strategy on the Sun system results in binding thread 0 to the core set {0,8} and thread 1 to the core set {4,12} (-&gt; two sockets). The same experiment on the HP systems results in binding thread 0 to the core set {1,9} and thread 1 to the core set {0,8} (-&gt; two sockets, again). This abstracts from the hardware / system details and allows the user, who might not be an HPC expert, to concentrate on optimizing the application by choosing from just two strategies, still getting &#8220;portable performance&#8221; on Intel CPUs. A portable thread binding interface is under discussion for OpenMP 3.1 (see my <a href="http://terboven.wordpress.com/2009/10/04/how-openmp-is-moving-towards-version-3-1-4-0/" target="_self">previous blog post</a>), and I am in a strong favor for allowing the user from choosing strategies. The one shortcoming of Intel&#8217;s current implementation occurs when you have multiple levels of Shared-Memory parallelization in one application and want to mix strategies &#8211; which might make sense. But this could easily be overcome. Let&#8217;s see what the future might bring, for now I just fixed my scripts to include a sanity check that the core numbering is indeed as expected&#8230;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10138/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10138/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10138/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10138&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/12/02/daily-cc-numa-craziness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>How OpenMP is moving towards version 3.1 / 4.0</title>
		<link>http://terboven.wordpress.com/2009/10/04/how-openmp-is-moving-towards-version-3-1-4-0/</link>
		<comments>http://terboven.wordpress.com/2009/10/04/how-openmp-is-moving-towards-version-3-1-4-0/#comments</comments>
		<pubDate>Sun, 04 Oct 2009 16:44:03 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[C++0X]]></category>
		<category><![CDATA[Loop Parallelization]]></category>
		<category><![CDATA[SC09]]></category>
		<category><![CDATA[Threading]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10125</guid>
		<description><![CDATA[Not yet carved in stone, but the current plan of the OpenMP Language Committee (LC) is to publish a draft OpenMP 3.1 standard for public comment by IWOMP 2010 and to have the OpenMP 3.1 specification finished for SC 2010 &#8211; given that the Architecture Review Board (ARB) accepts the new version. Bronis R. de [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10125&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Not yet carved in stone, but the current plan of the OpenMP Language Committee (LC) is to publish a draft OpenMP 3.1 standard for public comment by IWOMP 2010 and to have the OpenMP 3.1 specification finished for SC 2010 &#8211; given that the Architecture Review Board (ARB) accepts the new version. <a href="http://people.llnl.gov/desupinski1" target="_blank">Bronis R. de Supinski</a> (LLNL) has taken on the duty of acting as the chair of the LC and since introduced some process changes. Besides weekly telephone conference calls, there are three face-to-face meetings per year and attendance is required for voting rights. The first face-to-face meeting was held on June 1st and 2nd in Dresden attached to IWOMP 2009, the second one was on September 22nd and 23rd in Chicago. This blog post is intended to report on this last meeting and to present an overview of what is going on with OpenMP right now, obviously from my personal point of view.</p>
<p>In the course of resuming work on OpenMP after the 3.0 specification was published, the LC voted on the priority of (small) extensions and clarifications for 3.1 as well as new topics for 4.0. We ended up with 12 major topics and 5 subcommittees, as outlined in <a href="https://iwomp.zih.tu-dresden.de/downloads/IWOMP09_State_of_LC-deSupinski.pdf" target="_blank">Bronis talk</a> during IWOMP 2009, which are still in use as identifiers of the different topics people are working on.</p>
<p><strong>1: Development of an OpenMP Error Model.</strong> This is the feature the LC people think OpenMP is missing most desperately, but in contrast to that it did not receive too much effort yet. A subcommittee has been formed to be lead by Tim Mattson (Intel) and Michael Wong (IBM), and currently there are three proposals on the table for discussion: (i) an extension of the API routines and some constructs to return error codes or the introduction of a global error indication variable, (ii) an exception-based mechanism to catch errors, and (iii) a callback-based mechanism allowing to react on errors based on the severity and origin. The absence of an error model is clearly a reason for not using OpenMP in applications with certain requirements on reliability, but introducing the wrong error model could easily spoil OpenMP for that audience. It seems that most LC people do not like error codes too much (I don&#8217;t either), using exceptions is not suitable for C and FORTRAN, so the third approach seems most promising by allowing a program to react on errors depending on the severity and to still allow the compiler to ignore OpenMP if it is not enabled. In fact, this mechanism has been <a href="http://www.springerlink.com/content/a5w71905t384h660/" target="_blank">proposed back in 2006 by Alex Duran (BSC) and friends</a> already. Since nothing has been decided yet, I guess the error model is targeted for OpenMP 4.0.</p>
<p><strong>2: Interoperability and Composability.</strong> This subcommittee is lead by myself (RWTH) and Bronis R. de Supinski (LLNL) and is looking for ways of allowing OpenMP to coexist with other threading packages, maybe even with other OpenMP runtime environments in the same application. We are also looking into how to allow the creation of parallel software components that can safely be plugged together, which I consider prominently missing in virtually all threading paradigms. This is a very broad topic and there is no OpenMP version number I would assign this topic as target for being solved to, but with a little bit of luck we can make some progress even for version 3.1. We have some ideas on the table of how to specify some basic aspects of OpenMP interacting with the native threading packages (POSIX-Threads on Linux/Unix, Win32-Threads on Windows), driven by application observations and known deficiencies in current OpenMP implementations. We might also attack the problem of orphaned reductions. I am not so certain of solving the issue of allowing or detecting nested Worksharing constructs, respectively.</p>
<p><strong>3: Incorporating Tools Support into the OpenMP Specification.</strong> This has been on the feature wishlist for OpenMP 3.0 already, but there is hardly any activity regarding this topic. Most vendors provide their own tools to analyze the performance (or correctness) of OpenMP programs by making their own runtime talk to their specific tool, but this situation is far from optimal for research / academia tools. As early as back in 2004 there were some proposal (i.e. <a href="http://www.springerlink.com/content/jgrbgk6pqcd4bg59/" target="_blank">POMP by Bernd Mohr and friends</a>), but they did not made it into the specification or into actual implementations.</p>
<p><strong>4: Associating Computation or Memory across Workshares.</strong> Today, the world of OpenMP is flat (memory), so this topic is mostly about supporting cc-NUMA architectures in OpenMP. There are two subcommittees working on this issue, the first is lead by Dieter an Mey (RWTH) and the goal is to standardize common practices (used in today&#8217;s applications) of dealing with cc-NUMA optimizations. If nothing comes in between, OpenMP 3.1 will allow the user to bind threads to cores by either specifying an explicit mapping, or by telling the runtime a strategy (like compact vs. scatter). Of course there are more ideas (and features needed), like influencing the memory allocation scheme or using page migration if supported by the operating system or interacting with resource management systems (batch queuing systems), but these are very hard to specify in a portable and extensible fashion. The other subcommittee is lead by Barbara Chapman (UH) and deals with thread team control. Using the Worksharing in OpenMP, it is very hard to dedicate a special task (i.e. I/O) to just one thread of the Parallel Region. There are applications asking for that, but I don&#8217;t see a proposal that the LC would agree on for 3.1. Nevertheless, they presented some interesting ideas at the last F2F based based on HPCS language capabilities, which hopefully have the potential to influence OpenMP 4.0.</p>
<p><strong>5: Accelerators, GPUs and More.</strong> Of course we have to follow the trend / hype <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> . But since no one knows for sure in which directions the hardware is evolving, there are so many different ideas on how to deal with this. Out of my head I can enumerate that PGI has some directives loosely based on OpenMP Worksharing (plus they have CUDA for FORTRAN), IBM has OpenMP for cell with several ideas on extensions, BSC has a proposal that is in principle based on their *SS concept, and CAPS Entreprise has the HMPP constructs + compiler. In summary: No clear direction yet, nothing for OpenMP in the scope of 3.1.</p>
<p><strong>6: Transactional Memory and Thread Level Speculation.</strong> Some people thought that OpenMP might need something for Transactional Memory. To the best of my knowledge no one from the LC did any work on this regard.</p>
<p><strong>7: Refinements to the OpenMP Tasking Model.</strong> There are two things that most people agree Tasks are missing: Dependencies and Reductions. With respect to the former, there were three proposals on the table from Grant Haab (Intel), Federico Massaioli (Caspur) and Alex Duran (BSC) and the BSC proposal looks most promising because it avoid deadlocks. It employs existing program variables to define the dependencies between tasks, i.e. the result of a computation can be the input of another task. With a good portion of luck, Task Dependencies could actually make it into OpenMP 3.1, I think. With respect to the latter thing, namely Task Reductions, there has been only little progress so far.</p>
<p><strong>8: Extending OpenMP to C++0x and FORTRAN 2003.</strong> Since the C++0x standard dropped Concepts, the work that Michael Wong (IBM) and myself (RWTH) made so far became obsolete. To the best of my knowledge there has been no progress made with respect to investigate the opportunities or issues that could arise with FORTRAN 2003.</p>
<p><strong>9: Extending OpenMP to Additional Languages.</strong> Well, there are Java and C#, and at least for Java there are some implementations of OpenMP available (incomplete, though). Anyhow, there was never any real attempt to write a formal specification of OpenMP for Java, nor for C#, and I don&#8217;t think there is one now.</p>
<p><strong>10: Clarifications to the Existing Specifications.</strong> The LC already approved several minor corrections (i.e. mistakes in the examples, improvements in the wording, and the like) that will make their way into OpenMP 3.1. Nothing spectacular, though, but this is something that has to be done.</p>
<p><strong>11: Miscellaneous Extensions.</strong> I might be wrong, but I think that User-defined Reductions (UDR) belong to this topic. Yes, there is a chance that UDRs will make it into OpenMP 3.1! This will bring obvious things like <em>min</em> and <em>max </em>for C and C++, but we are aiming higher: The goal is to enable the programmer to write any type of reduction operation for any type in the base language (including non-PODs) and this is achieved by introducing an OpenMP declare statement to define a reduction operation that can be specified in a reduction clause. There are two problems that are under discussion right now: (i) C++ templates and (ii) pointers / arrays. The first can be addressed by an extension of the current proposal and I got the feeling that most LC people like the new approach, but the second is a bit more complex. If you want to reduce an array that is described by a pointer, you need to know how much space to allocate for the thread private copy and how many elements the array consists of. There has been some discussion on this, but no strong agreement on how to solve this issue in general, as it also arises with the private, firstprivate, &#8230; clauses. We only agreed that we need a one-fits-all solution. With some good portion of luck we can solve this issue, otherwise we hopefully get UDRs with some limitations in OpenMP 3.1 and the full functionality in a later version of the specification.</p>
<p><strong>12: Additional Task / Threads Synchronization Mechanisms.</strong> Again I might be wrong, but I think that the Atomic Extension proposal by Grant Haab (Intel) belongs in here. This is a feature you will also find in threading-aware languages (such as C++0x), but the current base languages of OpenMP are not of that kind. This will almost certainly make it into OpenMP 3.1 and will allow for a portable way to write atomic updates that capture a value and atomic writes. This is already supported by most machines and using an atomic operations can be so much more efficient than using a Critical Region.</p>
<p>If you are interested in more details, you are invited to stop by the OpenMP booth at SC 2009 in Portland and ask the nice guy on booth duty some good questions <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> .</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10125/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10125/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10125/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10125&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/10/04/how-openmp-is-moving-towards-version-3-1-4-0/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>HPCS 2009 Workshop material: OpenMP + Visual Studio</title>
		<link>http://terboven.wordpress.com/2009/07/01/hpcs-2009-workshop-material-openmp-visual-studio/</link>
		<comments>http://terboven.wordpress.com/2009/07/01/hpcs-2009-workshop-material-openmp-visual-studio/#comments</comments>
		<pubDate>Wed, 01 Jul 2009 11:40:27 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[Private]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[Debugging]]></category>
		<category><![CDATA[HPCS]]></category>
		<category><![CDATA[Loop Parallelization]]></category>
		<category><![CDATA[MPI]]></category>
		<category><![CDATA[Threading]]></category>
		<category><![CDATA[Visual Studio]]></category>
		<category><![CDATA[Windows HPC Server]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10121</guid>
		<description><![CDATA[As announced in a previous post already, I was involved in two workshops attached to the HPCS 2009, hosted by the HPCVL in Kinston, ON, Canada. Being back in the office now I found some time to upload my slide sets. Obviously I can only make my own slides public.
Using OpenMP 3.0 for Parallel Programming [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10121&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As announced in a previous post already, I was involved in two workshops attached to the <a href="http://www.hpcs2009.org/" target="_blank">HPCS 2009</a>, hosted by the <a href="http://www.hpcvl.org/" target="_blank">HPCVL in Kinston, ON, Canada</a>. Being back in the office now I found some time to upload my slide sets. Obviously I can only make my own slides public.</p>
<h3>Using OpenMP 3.0 for Parallel Programming on Multicore Systems [<a href="http://www.hpcs2009.org/abstracts/OpenMPAbstract.pdf" target="_blank">abstract</a>]</h3>
<p>Ruud van der Pas, Sun Microsystems; Dieter an Mey and Christian Terboven, RWTH Aachen University.</p>
<ul>
<li> <a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_OpenMP%7C_30/Tasking%7C_in%7C_OpenMP%7C_30.pdf" target="_blank">Tasking in OpenMP 3.0</a> (Christian Terboven).</li>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_OpenMP%7C_30/OpenMP%7C_Tools%7C_Overview.pdf" target="_blank">Data Race Detection in OpenMP programs using the Sun Thread Analyzer</a> (Christian Terboven).</li>
<li> <a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_OpenMP%7C_30/OpenMP%7C_in%7C_the%7C_Real%7C_World.pdf" target="_blank">OpenMP in the Real World</a> (Christian Terboven and Dieter an Mey).</li>
</ul>
<h3>Parallel Programming in Visual Studio 2008 on Windows HPC Server 2008 [<a href="http://www.hpcs2009.org/abstracts/VisualStudioAbstract.pdf" target="_blank">abstract</a>]</h3>
<p>Christian Terboven, RWTH Aachen University.</p>
<ul>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_Parallel%7C_Programming%7C_Visual%7C_Studio/01%7C_Windows%7C_HPC%7C_Server%7C_Overview.pdf" target="_blank">Windows HPC Server 2008: Overview</a> (Christian Terboven).</li>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_Parallel%7C_Programming%7C_Visual%7C_Studio/02%7C_Windows%7C_HPC%7C_Server%7C_Users%7C_View.pdf" target="_blank">Windows HPC Server 2008: user&#8217;s point of view</a> (Christian Terboven).</li>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_Parallel%7C_Programming%7C_Visual%7C_Studio/03%7C_Visual%7C_Studio%7C_2008.pdf" target="_blank">Using Microsoft Visual Studio 2008</a> (Christian Terboven).</li>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_Parallel%7C_Programming%7C_Visual%7C_Studio/04%7C_HPC%7C_Tools%7C_Portfolio%7C_Shared%7C_Memory.pdf" target="_blank">HPC Tools Portfolio: Shared-Memory Parallelization on Windows</a> (Christian Terboven).</li>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_Parallel%7C_Programming%7C_Visual%7C_Studio/05%7C_HPC%7C_Tools%7C_Portfolio%7C_Message%7C_Passing.pdf" target="_blank">HPC Tools Portfolio: Message-Passing with MPI on Windows</a> (Christian Terboven).</li>
<li><a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/HPCS09%7C_Workshop%7C_Parallel%7C_Programming%7C_Visual%7C_Studio/06%7C_Case%7C_Studies%7C_and%7C_VS2010.pdf" target="_blank">Case Studies &#8230; and an Outlook into the Future</a> (Christian Terboven).</li>
</ul>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10121/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10121/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10121/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10121/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10121/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10121/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10121&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/07/01/hpcs-2009-workshop-material-openmp-visual-studio/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>Re: Book Review: C# 2008 and 2005 Thread Programming (Beginner’s Guide)</title>
		<link>http://terboven.wordpress.com/2009/06/28/re-book-review-c-2008-and-2005-thread-programming-beginner%e2%80%99s-guide/</link>
		<comments>http://terboven.wordpress.com/2009/06/28/re-book-review-c-2008-and-2005-thread-programming-beginner%e2%80%99s-guide/#comments</comments>
		<pubDate>Sun, 28 Jun 2009 16:55:41 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[Book Review]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[PLINQ]]></category>
		<category><![CDATA[Threading]]></category>
		<category><![CDATA[TPL]]></category>
		<category><![CDATA[Visual Studio]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10119</guid>
		<description><![CDATA[I was told that the book I covered in my last review can be get way cheaper over at Packt Publishing. There is also an eBook version available.
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10119&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I was told that the book I covered in my last review can be get way cheaper over at <a href="http://www.packtpub.com/beginners-guide-for-C-sharp-2008-and-2005-threaded-programming/book" target="_blank">Packt Publishing</a>. There is also an eBook version available.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10119/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10119/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10119/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10119/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10119/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10119/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10119/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10119/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10119/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10119/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10119&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/06/28/re-book-review-c-2008-and-2005-thread-programming-beginner%e2%80%99s-guide/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>Book Review: C# 2008 and 2005 Thread Programming (Beginner&#8217;s Guide)</title>
		<link>http://terboven.wordpress.com/2009/06/07/book-review-c-2008-and-2005-thread-programming-beginners-guide/</link>
		<comments>http://terboven.wordpress.com/2009/06/07/book-review-c-2008-and-2005-thread-programming-beginners-guide/#comments</comments>
		<pubDate>Sun, 07 Jun 2009 17:46:51 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[Book Review]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[PLINQ]]></category>
		<category><![CDATA[Threading]]></category>
		<category><![CDATA[TPL]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10108</guid>
		<description><![CDATA[Just recently &#8211; in May 2009 &#8211; I gave two lectures on Multithreading with C# for Desktop Applications. I found there are quite a few books available that cover the .NET Thread class when talking about Windows programming in general, but the book C# 2008 and 2005 Threaded Programming: Beginner&#8217;s Guide is only about, well, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10108&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Just recently &#8211; in May 2009 &#8211; I gave two lectures on Multithreading with C# for Desktop Applications. I found there are quite a few books available that cover the .NET <em>Thread</em> class when talking about Windows programming in general, but the book <a href="http://www.amazon.de/gp/product/1847197108?ie=UTF8&amp;tag=wwwterbovenco-21&amp;linkCode=as2&amp;camp=1638&amp;creative=6742&amp;creativeASIN=1847197108">C# 2008 and 2005 Threaded Programming: Beginner&#8217;s Guide</a><img style="border:none!important;margin:0!important;" src="http://www.assoc-amazon.de/e/ir?t=wwwterbovenco-21&amp;l=as2&amp;o=3&amp;a=1847197108" border="0" alt="" width="1" height="1" /> is only about, well, Multithreading with C#. The subtitle <em>Exploit the power of multiple processors for faster, more responsive software</em> also states that both algorithmic parallelization as well as the separation of computation from a graphical user interface (GUI) is covered in here, and this is exactly what I was looking for. The book is clearly marked as a Beginner&#8217;s Guide and is well-written for that aspect, so if you already know about Multithreading and just want to learn about how to do this with C#, you might find the book to proceed too slowly. If you are uncertain or clearly new to this subject, then this book might do it&#8217;s job very well for you.</p>
<p>Chapters one and two start with a brief motivation of why the shift towards multicore processors has such an important influence on how software has to be designed and written nowadays and also contain a brief description of the typical pitfalls you may run into when parallelizing software. Chapter three describes the <em>BackgroundWorker</em> component, which is the simplest facility to separate the computation from the user interface in order to keep it responsible. Chapters four and five cover the most important aspects of the <em>Thread</em> class as well as how to use Visual Studio to debug multithreaded programs. Chapters six to nine describe how to apply parallelization to a range of common problems and design cases, for example howobject-oriented features of C# and the garbage collector of .NET play along with the <em>Thread</em> class and what to take care for when doing Input/Output and Data Access. Chapter ten explains in detail how GUIs and Threads work together (or not) and how to design you GUI and your application to report progress to the GUI from threads, for example. When doing so there are some rules one has to obey and I found the issues that I was not aware of before very well-explained. Chapter eleven gives a brief overview of the .NET Parallel Extensions &#8211; which will be part of .NET 4.0 &#8211; such as the <em>Parallel</em> class and PLINQ. The final chapter twelve tries to put all things together into a single application.</p>
<p>Most aspects of Multithreading with C# are introduced by first stating a problem / motivation (with respect to the example code), then showing the solution in C# code and discussing the effects of it and finally explaining the concept in some more detail, if needed. The two example codes, a text message encryption and decryption software and an image analysis tool, are consistently extended with the new features that have been introduced. I personally did not like that there is so much example code shown in the book, although people new to Multithreading might find studying the source code helpfull. With a strong focus on explaining and discussing example the book is not well-suited as a reference, but it does not say to do so. Actually I think that once you are familiar with certain aspects of Multithreading with C#, <a href="http://msdn.microsoft.com/en-us/default.aspx" target="_blank">MSDN</a> does a good job of serving as a reference.</p>
<p>The book is published by Packt Publishing and has been released in January 2009. The price of about 30 Euro for about 420 pages at amazon.de in Germany is affordable for students, I think. Regards to Radha Iyer at Packt Publishing for making this book available for me in time.</p>
<div id="_mcePaste" style="position:absolute;width:1px;height:1px;overflow:hidden;top:0;left:-10000px;"><strong><span style="font-size:11pt;font-family:Calibri,sans-serif;color:#e36c0a;">Radha Iyer</span></strong></div>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10108/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10108&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/06/07/book-review-c-2008-and-2005-thread-programming-beginners-guide/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>

		<media:content url="http://www.assoc-amazon.de/e/ir?t=wwwterbovenco-21&#38;l=as2&#38;o=3&#38;a=1847197108" medium="image" />
	</item>
		<item>
		<title>Upcoming Events in June 2009</title>
		<link>http://terboven.wordpress.com/2009/05/19/upcoming-events-in-june-2009/</link>
		<comments>http://terboven.wordpress.com/2009/05/19/upcoming-events-in-june-2009/#comments</comments>
		<pubDate>Tue, 19 May 2009 20:13:55 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[Private]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[Allinea]]></category>
		<category><![CDATA[HPCS]]></category>
		<category><![CDATA[ISC09]]></category>
		<category><![CDATA[IWOMP]]></category>
		<category><![CDATA[ScaleMP]]></category>
		<category><![CDATA[Supercomputing]]></category>
		<category><![CDATA[Visual Studio]]></category>
		<category><![CDATA[Windows HPC Server]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10102</guid>
		<description><![CDATA[Let me point you to some HPC events in June 2009.
5th International Workshop on OpenMP (IWOMP 2009) in Dresden, Germany. The IWOMP workshop series focuses on the development and usage of OpenMP. This year&#8217;s conference is titled Evolving OpenMP in an Age of Extreme Parallelism &#8211; I think this phrase is a but funny, but [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10102&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Let me point you to some HPC events in June 2009.</p>
<p><strong>5th International Workshop on OpenMP (IWOMP 2009) in Dresden, Germany.</strong> The <a href="https://iwomp.zih.tu-dresden.de/" target="_blank">IWOMP</a> workshop series focuses on the development and usage of OpenMP. This year&#8217;s conference is titled <em>Evolving OpenMP in an Age of Extreme Parallelism</em> &#8211; I think this phrase is a but funny, but nevertheless one can clearly observe a trend towards Shared-Memory parallelization on the node of even the extremely parallel machines. Attached to the conference is a two day meeting of the OpenMP language committee. The language committee is currently discussing a long list of possible items for a future OpenMP 3.1 or 4.0 specification, including but not limited to my favorites <em>Composability (especially for C++)</em> and <em>Performance on cc-NUMA system</em>. <a href="http://people.llnl.gov/desupinski1" target="_blank">Bronis de Supinski</a>, the recently appointed Chair of the OpenMP Language Committee, will give a talk on the current activities of the LC and how the future of OpenMP might look like &#8211; I hope the slides will be made public soon after the talk. Right before the conference there will also be a one day tutorial for all people interested in learning OpenMP (mainly given by Ruud van der Pas &#8211; strongly recommended).</p>
<p><strong>High Performance Computing Symposium 2009 (HPCS) in Kingston, Canada.</strong> HPCS is a multidisciplinary conference that focuses on research involving High Performance Computing and this year it takes place in Kingston. I&#8217;ve never been at that conference series, so I am pretty curious how it will look like. Attached to the conference are a couple of <a href="http://www.hpcs2009.org/Workshops.html" target="_blank">workshops</a>, including <em>Using OpenMP 3.0 for Parallel Programming on Multicore Systems</em> &#8211; run again by Ruud van der Pas and us, and <em>Parallel Programming in Visual Studio 2008 on Windows HPC Server 2008</em> &#8211; organized by us as well. Here in Aachen, the interest in our Windows-HPC compute service is still growing fine and thus we have usually around 50 new participants in our bi-yearly training events. The HPCVL people asked explicitly to cover parallel programming on Windows in the OpenMP workshop, so we separated this aspect out without further ado to serve it well. The workshop program can be found <a href="http://www.hpcs2009.org/abstracts/VisualStudioAbstract.pdf" target="_blank">here</a>.</p>
<p><strong>International Supercomputing Conference (ISC 2009) in Hamburg, Germany.</strong> ISC titles itself as Europe&#8217;s premier HPC event &#8211; while this is probably true it is of course smaller than the SC events in the US, but usually better organized. Without question you will find numerous interesting exhibits and can listen to several talks (mostly by invited speakers), so please excuse the self-marketing of me pointing to the <em>Jülich Aachen Research Alliance (JARA) booth</em> in the research space where we will show an interactive visualization of large-scale numerical simulation (damage of blood cells by a ventricular device &#8211; pretty cool) as well as give an overview of our research activities focused on Shared-Memory parallelization (we will distribute OpenMP syntax references again). If you are interested in HPC software development on Windows, feel invited to stop by at <em>our demo station at the Microsoft booth</em> where we will have many demos regarding <em>HPC Application Development on Windows</em> (Visual Studio, Allinea DDTlite and Vampir are confirmed, maybe more &#8230;). And if you are closely monitoring the HPC market, you have probably heard about <a href="http://www.scalemp.com/" target="_blank">ScaleMP</a> already, the company aggregating multiple x86 system into a single (virtual) system over InfiniBand &#8211; obviously very interesting for Shared-Memory parallelization. If you are interested, you can hear about our experiences with this architecture for HPC.</p>
<p>If you want to meet up during any of these events just <a href="mailto:christian@terboven.com">drop me an email</a>.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10102/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10102/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10102/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10102/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10102/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10102/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10102/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10102/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10102/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10102/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10102&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/05/19/upcoming-events-in-june-2009/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>Recap of the Second Meeting of the German Windows-HPC User Group</title>
		<link>http://terboven.wordpress.com/2009/04/09/recap-of-the-second-meeting-of-the-german-windows-hpc-user-group/</link>
		<comments>http://terboven.wordpress.com/2009/04/09/recap-of-the-second-meeting-of-the-german-windows-hpc-user-group/#comments</comments>
		<pubDate>Thu, 09 Apr 2009 21:20:33 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[Private]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[Windows HPC Server]]></category>
		<category><![CDATA[Windows-HPC UG]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10091</guid>
		<description><![CDATA[The Second Meeting of the German Windows-UPC User Group took place on March 30th and 31st in Dresden with about 80 participants. The ZIH of TU Dresden was the kind host of this event, which included both a series of talks and several booth places for various vendors. Having the event begin on Monday afternoon [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10091&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>The <a href="http://www.rz.rwth-aachen.de/aw/cms/rz/Themen/hochleistungsrechnen/aussendarstellung_/projekte/winhpc/~sbb/winhpc_user_group/?lang=en" target="_blank">Second Meeting of the German Windows-UPC User Group </a>took place on March 30th and 31st in Dresden with about 80 participants. The <a href="http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/" target="_blank">ZIH of TU Dresden</a> was the kind host of this event, which included both a series of talks and several booth places for various vendors. Having the event begin on Monday afternoon allowed for comfortable travel and after the first part was over the social event took place in the <a href="http://www.luisenhof.org/" target="_blank">Luisenhof am Elbhang restaurant</a>, with a great view over the Elbtal. The second part of the presentation program ended on Tuesday afternoon. According to the feedback gathered so far, this format is very well accepted and will also be used for the next meeting on March 08th and 09th at Schloss Birlinghoven to be hosted by the Fraunhofer Institute SCAI.</p>
<p>Personally, I think this event was successful. We had a good mix of technical talks as well as presentations of how and where Windows-HPC is used today. Some of these users have presented at last year&#8217;s event already &#8211; they made progress, but most were new. The versions of the HPC ISV-Codes that will be released this year will be well-integrated into HPC Server 2008 (for some products it will be the second version supporting Windows-HPC), but it seems most codes still have quite some potential for out-of-the-box performance improvements. The service providers &#8211; such as cluster integrators &#8211; that were presenting and / or had booths at the event have made their first experiences (good and bad, of course) with HPC on Windows. Many thanks to the sponsors for supporting this event, in alphabetical ordering: GNS Systems, Intel, Microsoft, Megware, Myricom, Sun and Transtec! The presentation program consisted of several sessions of which I will provide a brief summary in the rest of this post.</p>
<p>The event was opened by a brief welcome talk of Dieter an Mey (RWTH Aachen) and Wolfgang Dreyer (Microsoft) outlining the agenda of the next 1.5 days. The first main presentation was given by Matej Ciesko (Microsoft) and titled Highly-productive Computing with Windows-HPC (no slides). With respect to cc-NUMA awareness and thread scheduling, he explained the new features of the upcoming Windows Server 2008 R2, which will also be the foundation of the next HPC Server (v3). He also gave a brief overview of the activities Microsoft is undertaking to push parallel programming into mainstream and which tools will be made available with Visual Studio 2010.</p>
<p>The second session was titled <em>HPC Infrastructure </em>and the first talk was on the evaluation of integrating Windows-HPC into existing computing environments (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Harry%7C_Schlagenhauf%7C_-%7C_Evaluation%7C_der%7C_Integration%7C_von%7C_Windows%7C_HPC%7C_in%7C_eine%7C_bestehende%7C_Berechnungsumgebung.pdf" target="_self">slides (de)</a>), given by Harry Schlagenhauf (Science+Computing). It was followed by Jürgen Gretzschel (MegWare) with an overview of MegWare’s activities and developments (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Juergen%7C_Gretzschel%7C_-%7C_HPC%7C_Cluster%7C_aus%7C_Sachsen.pps" target="_self">slides (de)</a>) in and for the HPC market. Markus Fischer (Myricom) presented on Myricom’s low-latency support of the Network Direct interface on both Myrinet and 10 Gb Ethernet networks (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Markus%7C_Fischer%7C_-%7C_Myricoms%7C_Low%7C_Latency%7C_Software%7C_Support%7C_for%7C_Network%7C_Direct.pdf" target="_self">slides (en)</a>). This session as well as the first day was closed by Wolfgang Nagel (TU Dresden), who talked about the HPC and tool development projects at the ZIH of TU Dresden as well as the goals and activities of the Gauß alliance in Germany (no slides).</p>
<p>The first session of the second day was captioned <em>Performance Analysis </em>and the first talk was given by Xavier Pillons (Microsoft) on exactly this topic (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Xavier%7C_Pillons%7C_-%7C_Performance%7C_Analysis%7C_and%7C_Tuning%7C_in%7C_Windows%7C_HPC%7C_Server.pdf" target="_self">slides (en)</a>). He presented and demoed on various build-in tools of Windows to measure the performance of the whole system and single applications as well, putting a focus on the xperf tool of the Windows Performance Toolkit. He also talked about his best practices of troubleshooting system performance issues and about his experience of running LINPACK benchmarks for Top500 submissions on Windows. He was followed by Holger Brunst (TU Dresden) who presented the Vampir GUI on Windows (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Holger%7C_Brunst%7C_-%7C_Tuning%7C_Parallel%7C_MPI%7C_Programs%7C_with%7C_Vampir.pdf" target="_self">slides (de)</a>) and how to collect MPI traces. The last talk of this session was given by Christian Terboven (RWTH Aachen); he discussed the HPC tools portfolio for debugging, performance tuning and Shared-Memory parallelization on Windows (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/HPC%7C_Tools%7C_Portfolio%7C_-%7C_Tuning%7C_and%7C_Shared%7C_Memory%7C_Parallelization%7C_on%7C_Windows.pdf" target="_self">slides (en)</a>).</p>
<p>The next session was on <em>Running Windows-HPC Clusters </em>and also included the project presentations of the Microsoft academic program of last year. Michael Wirtz (RWTH Aachen) presented on how the cluster at RWTH Aachen University has been designed and is managed today (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Michael%7C_Wirtz%7C_-%7C_Windows%7C_2008%7C_HPC%7C_Cluster%7C_aus%7C_der%7C_Betreiberperspektive.pdf" target="_self">slides (de)</a>) and what the current challenges look like. Johannes Habich (University of Erlangen) talked about some of their projects on the Windows-HPC platform and about how they are doing the resource accounting for their HPC customers (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Johannes%7C_Habich%7C_-%7C_Erfahrungsbericht%7C_Windows%7C_HPC%7C_in%7C_Erlangen.pdf" target="_self">slides (de)</a>). Thomas Blümel (TU Dresden) reported on the problems of getting their cluster, sponsored by Microsoft and Dell, up and running and what they would like to see in order to make the setup process more straight-forward (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Thomas%7C_Bluemel%7C_-%7C_Erste%7C_Erfahrungen%7C_mit%7C_dem%7C_Dell%7C_Cluster%7C_an%7C_der%7C_TU%7C_Dresden.pdf" target="_self">slides (de)</a>).</p>
<p>The 2008 Microsoft academic program was a student competition about bringing HPC codes to the Windows platform. Christopher Schleiden (RWTH Aachen) presented how a C++ 3D Navier-Stokes solver making use of libraries such as ParMETIS and DDD has been ported (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Christopher%7C_Schleiden%7C_-%7C_Windowsportierung%7C_und%7C_Performanceanalyse%7C_eines%7C_C++%7C_Navier%7C_Stokes%7C_Loesers.pdf" target="_self">slides (de)</a>) and how to get good performance. Roman Parys (University of Tübingen) reported on a project about data compression and volumetric rendering of giga-voxel data sets (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Roman%7C_Parys%7C_-%7C_Parallel%7C_Processing%7C_for%7C_Data%7C_Compression%7C_and%7C_Volumetric%7C_Rendering%7C_of%7C_Giga%7C_Voxel%7C_Data%7C_Sets.pdf" target="_self">slides (en)</a>) to drive a large-scale video cluster. Johannes Hoppe and Johannes Hofmeister (both University of Applied Sciences at Heidelberg) demoed their clustered neural network for pattern recognition in image and video data, written in .NET (no slides).</p>
<p>After the lunch break at the student cafeteria, the next session was titled <em>HPC Applications</em>. Karsten Reineck and Horst Schwichtenberg (both Fraunhofer SCAI) compared the performance of different ISV-Codes on Windows and Linux and, well, I got the impression that some ISVs still have quite a way to go (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Karsten%7C_Reineck%7C_-%7C_Linux%7C_Windows%7C_Performancevergleich%7C_ISV%7C_Simulationscodes.pdf" target="_self">slides (de)</a>). Peter Kirsch and Markus Kirsch (both ICT AG) compared the performance of Fluent for Windows-HPC on Myrinet versus Gb Ethernet and instructed how to setup this application for a cluster (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Peter%7C_Kirsch%7C_-%7C_CFD%7C_Rechnung%7C_mit%7C_Fluent.pdf" target="_self">slides (de)</a>). Sorin Serban (Visual Numerics) demoed the IMSL numeric library and explained how the library has been parallelized with OpenMP and what performance improvements the user can expect from different parts of the library (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Sorin%7C_Serban%7C_-%7C_Die%7C_Integration%7C_von%7C_IMSL%7C_Numerischen%7C_Bibliotheken%7C_in%7C_HPC%7C_Entwicklungen.pdf" target="_self">slides (de)</a>).</p>
<p>The last session was named <em>HPC Perspectives</em>. Mario Deilmann (Intel) gave a talk that was split into two parts (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Mario%7C_Deilmann%7C_-%7C_Parallele%7C_Programmierparadigmen.pdf" target="_self">slides (en + de)</a>): The first was on the features and capabilities of Intel Parallel Studio and how it integrates in the Windows-HPC development story as an extension to Visual Studio; the second was on the Ct research project, a language-in-the-C++-language for throughput computing that was just announced to become available at Intel’s IDF. The closing talk was given by Torsten Langner (Microsoft), who introduced Microsoft’s offerings for Service Oriented Architectures (SOA) in the HPC world. With a very interesting example from the financial business, he explained what a WCF broker node is doing and what class of applications is expected to make use of this technology (<a href="http://cid-b7be35f701a0d7d4.skydrive.live.com/self.aspx/2nd%7C_German%7C_Windows%7C_HPC%7C_UG%7C_Meeting/Torsten%7C_Langner%7C_-%7C_Extending%7C_HPC%7C_Platform%7C_Technologies%7C_into%7C_the%7C_SOA%7C_Solutions%7C_Arena.pdf" target="_self">slides (en)</a>).</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10091/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10091/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10091/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10091/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10091/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10091/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10091/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10091/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10091/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10091/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10091&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/04/09/recap-of-the-second-meeting-of-the-german-windows-hpc-user-group/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1.5 of 2]</title>
		<link>http://terboven.wordpress.com/2009/02/18/a-performance-tuning-tale-optimizing-smxv-sparse-matrix-vector-multiplication-on-windows-part-15-of-2/</link>
		<comments>http://terboven.wordpress.com/2009/02/18/a-performance-tuning-tale-optimizing-smxv-sparse-matrix-vector-multiplication-on-windows-part-15-of-2/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 16:01:37 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Lambda Function]]></category>
		<category><![CDATA[Load Balancing]]></category>
		<category><![CDATA[Loop Parallelization]]></category>
		<category><![CDATA[Loop Scheduling]]></category>
		<category><![CDATA[PLINQ]]></category>
		<category><![CDATA[SMXV]]></category>
		<category><![CDATA[TPL]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10066</guid>
		<description><![CDATA[Although it is high time to deliver the second part of this blog post series, I decided to squeeze in one additional post which I named part &#8220;1.5&#8243;, as it will cover some experiments with SMXV in C#. Since I am currently preparing a lecture named Multi-Threading for Desktop Systems (it will be held in [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10066&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Although it is high time to deliver the second part of this blog post series, I decided to squeeze in one additional post which I named part &#8220;1.5&#8243;, as it will cover some experiments with SMXV in C#. Since I am currently preparing a lecture named <em>Multi-Threading for Desktop Systems</em> (it will be held in German, though) in which C# plays an important role, we took a closer look into how parallelism has made it&#8217;s way into the .NET framework version 3.5 and 4.0. The final post will then cover some more tools and performance experiments (especially regarding cc-NUMA architectures) with the focus back on native coding.</p>
<p>First, let us briefly recap how the SMXV was implemented and examine how this can look like in C#. As explained in my previous post, the CRS format stores just the nonzero elements of the matrix in three vectors:  The <em>val</em>-vector contains the values of all nonzero elements, the  <em>col</em>-vector  contains the column indices for each nonzero element and the <em>row</em>-vector points to the first  nonzero element index (in <em>val</em> and <em>col</em>) for each matrix row. Having one class to represent a CRS matrix and using an array of doubles to represent a vector, the SMXV operation encapsulated by the operator* can be implemented like this, independent of whether you use managed or unmanaged arrays:</p>
<pre><span style="color:#000080;">public static double</span>[] <span style="color:#000080;">operator </span>*(<span style="color:#00ffff;">matrix_crs</span> lhs, <span style="color:#000080;">double</span>[] rhs)
{</pre>
<pre>   <span style="color:#000080;">double</span>[] result = <span style="color:#000080;">new double</span>[lhs.getNumRows()];</pre>
<pre>   <span style="color:#000080;">for </span>(<span style="color:#000080;">long </span>i = 0; i &lt; lhs.getNumRows(); ++i)</pre>
<pre>   {</pre>
<pre>      <span style="color:#000080;">double </span>sum = 0;</pre>
<pre>      <span style="color:#000080;">long </span>rowbeg = lhs.row(i);</pre>
<pre>      <span style="color:#000080;">long </span>rowend = lhs.row(i + 1);</pre>
<pre>      <span style="color:#000080;">for </span>(<span style="color:#000080;">long </span>nz = rowbeg; nz &lt; rowend; ++nz)</pre>
<pre>         sum += lhs.val(nz) * rhs[ lhs.col(nz) ];</pre>
<pre>      result[i] = sum;</pre>
<pre>   }</pre>
<pre>   <span style="color:#000080;">return </span>result;
}</pre>
<p>We have several options to parallelize this code, which I wil present and briefly discuss in the rest of this post.</p>
<p><strong>Threading.</strong> In this approach, the programmer is responsible for managing the threads and distributing the work onto the threads. It is not too hard to implement a static work-distribution for any given number of threads, but implementing a dynamic or adaptive work-distribution is a lot of work and also error-prone. In order to implement the static approach, we need an array of threads, have to compute the iteration chunk for each thread, put the threads to work and finally wait for the threads to finish their computation.</p>
<pre><span style="color:#008000;">//Compute chunks of work:</span><span style="color:#00ffff;">Thread</span>[] threads = <span style="color:#000080;">new </span><span style="color:#00ffff;">Thread</span>[lhs.NumThreads];</pre>
<pre><span style="color:#000080;">long </span>chunkSize = lhs.getNumRows() / lhs.NumThreads;</pre>
<pre><span style="color:#008000;">//Start threads with respective chunks:</span></pre>
<pre><span style="color:#000080;">for </span>(int t = 0; t &lt; threads.Length; ++t)</pre>
<pre>{</pre>
<pre>   threads[t] = <span style="color:#000080;">new </span><span style="color:#00ffff;">Thread</span>(<span style="color:#000080;">delegate</span>(<span style="color:#000080;">object </span>o)</pre>
<pre>   {</pre>
<pre>      <span style="color:#000080;">int </span>thread = (<span style="color:#000080;">int</span>)o;</pre>
<pre>      <span style="color:#000080;">long </span>firstRow = thread * chunkSize;</pre>
<pre>      <span style="color:#000080;">long </span>lastRow = (thread + 1) * chunkSize;</pre>
<pre>      <span style="color:#000080;">if </span>(thread == lhs.NumThreads - 1) lastRow = lhs.getNumRows();</pre>
<pre>      <span style="color:#000080;">for </span>(<span style="color:#000080;">long </span>i = firstRow; i &lt; lastRow; ++i)</pre>
<pre>      { <span style="color:#339966;">/* ... SMXV ... */</span> }</pre>
<pre>   });</pre>
<pre><span style="color:#008000;">   </span><span style="color:#008000;">//Start the thread and pass the ID:</span>
   threads[t].Start(t);</pre>
<pre>}</pre>
<pre><span style="color:#008000;">//Wait for all threads to complete:</span></pre>
<pre><span style="color:#000080;">for</span>(<span style="color:#000080;">int </span>t = 0; t &lt; threads.Length; ++t) threads[t].Join();
<span style="color:#000080;">return </span>result;</pre>
<p>Instead of managing the threads on our own, we could use the thread pool of the runtime system. From a usage point of view, this is equivalent to the version shown above, so I will not discuss this any further.</p>
<p><!--[if gte mso 9]&gt;  Normal 0   21   false false false  DE X-NONE X-NONE              MicrosoftInternetExplorer4              &lt;![endif]--><!--[if gte mso 9]&gt;                                                                                                                                            &lt;![endif]--><!--  /* Font Definitions */  @font-face 	{font-family:"Cambria Math"; 	panose-1:2 4 5 3 5 4 6 3 2 4; 	mso-font-charset:1; 	mso-generic-font-family:roman; 	mso-font-format:other; 	mso-font-pitch:variable; 	mso-font-signature:0 0 0 0 0 0;}  /* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal 	{mso-style-unhide:no; 	mso-style-qformat:yes; 	mso-style-parent:""; 	margin:0cm; 	margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman","serif"; 	mso-fareast-font-family:"Times New Roman";} .MsoChpDefault 	{mso-style-type:export-only; 	mso-default-props:yes; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:Calibri; 	mso-fareast-theme-font:minor-latin; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi; 	mso-fareast-language:EN-US;} .MsoPapDefault 	{mso-style-type:export-only; 	margin-bottom:10.0pt; 	line-height:115%;} @page Section1 	{size:612.0pt 792.0pt; 	margin:70.85pt 70.85pt 2.0cm 70.85pt; 	mso-header-margin:36.0pt; 	mso-footer-margin:36.0pt; 	mso-paper-source:0;} div.Section1 	{page:Section1;} --><!--[if gte mso 10]&gt; &lt;!   /* Style Definitions */  table.MsoNormalTable 	{mso-style-name:"Table Normal"; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-priority:99; 	mso-style-qformat:yes; 	mso-style-parent:""; 	mso-padding-alt:0cm 5.4pt 0cm 5.4pt; 	mso-para-margin-top:0cm; 	mso-para-margin-right:0cm; 	mso-para-margin-bottom:10.0pt; 	mso-para-margin-left:0cm; 	line-height:115%; 	mso-pagination:widow-orphan; 	font-size:11.0pt; 	font-family:"Calibri","sans-serif"; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:"Times New Roman"; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} --> <!--[endif]--></p>
<p><strong>Tasks.</strong> The problem of the approach discussed above is the static work-distribution that may lead to load imbalances, and implementing a dynamic work-distribution is error-prone and depending on the code it also may be a lot of work. The goal should be to distribute the workload into smaller packages, but doing this with threads is not optimal: Threads are quite costly in the sense that creating or destroying a thread takes quite a lot of time (in computer terms) since the OS is involved, and threads also need some amount of memory. A solution for this problem are <em>Tasks</em>. Well, tasks are quite &#8220;in&#8221; nowadays with many people thinking on how to program multicore systems and therefore there are many definitions of what a task really is. I have given mine in previous posts on OpenMP and repeat it here briefly: A task is a small package consisting of some code to execute and some private data (access to shared data is possible, of course) which the runtime schedules for execution by a team of threads. Actually it is pretty simple to parallelize the code from above using tasks: We have to manage a list of tasks and have to decide how much work a task should do (in terms of matrix lines), and of course we have to create and start the tasks and finally wait for them to finish. See below:</p>
<pre><span style="color:#339966;">//Set the size of the tasks:</span></pre>
<pre><span style="color:#00ffff;">List</span>&lt;<span style="color:#00ffff;">Task</span>&gt; taskList = <span style="color:#000080;">new </span><span style="color:#00ffff;">List</span>&lt;<span style="color:#00ffff;">Task</span>&gt;();</pre>
<pre><span style="color:#000080;">int </span>chunkSize = 1000;</pre>
<pre><span style="color:#339966;">//Create the tasks that calculate the parts of the result:</span></pre>
<pre><span style="color:#000080;">for </span>(<span style="color:#000080;">long </span>i = 0; i &lt; lhs.getNumRows(); i += chunkSize)</pre>
<pre>{</pre>
<pre>   taskList.Add(<span style="color:#00ffff;">Task</span>.Create(<span style="color:#000080;">delegate</span>(<span style="color:#000080;">object </span>o)</pre>
<pre>   {</pre>
<pre>      <span style="color:#000080;">long </span>chunkStart = (<span style="color:#000080;">long</span>)o;</pre>
<pre>      <span style="color:#000080;">for</span>(<span style="color:#000080;">long </span>index = (<span style="color:#000080;">long</span>)chunkStart;</pre>
<pre>      index &lt; System.<span style="color:#00ffff;">Math</span>.Min(chunkStart + chunkSize, lhs.getNumRows()); index++)</pre>
<pre>      { <span style="color:#339966;">/* ... SMXV ... */</span> }</pre>
<pre>   }, i));</pre>
<pre>}</pre>
<pre><span style="color:#339966;">//Wait for all tasks to finish:</span></pre>
<pre><span style="color:#00ffff;">Task</span>.WaitAll(taskList.ToArray());
<span style="color:#000080;">return </span>result;</pre>
<p><strong>Using the TPL.</strong> The downside of the approach discussed so far is that we (= the programmer) has to distribute the work manually. In OpenMP, this is done by the compiler + runtime &#8211; at least when Worksharing constructs can be employed. In the case of for-loops, one would use Worksharing in OpenMP, With the upcoming .NET Framework version 4.0 there will be something similar (but not so powerful) available for C#: The <em>Parallel </em>class allows for the parallelization of for-loops, when certain conditions are fulfilled (always think about possible Data Races!). Using it is pretty simple thanks to support for delegates / lambda expressions in C#, as you can see below:</p>
<pre><span style="color:#00ffff;">Parallel</span>.For(0, (<span style="color:#000080;">int</span>)lhs.getNumRows(), <span style="color:#000080;">delegate</span>(<span style="color:#000080;">int </span>i)</pre>
<pre>{<span style="color:#339966;">
   /* ... SMXV ... */</span></pre>
<pre>});<span style="color:#000080;">
return </span>result;</pre>
<p>Nice? I certainly like this! It is very similar to Worksharing in the sense that you instrument your code with further knowledge to (incrementally) add parallelization, while it is also nicely integrated in the core language (which OpenMP isn&#8217;t). But you have to note that this Worksharing-like functionality is different from OpenMP in certain important aspects:</p>
<ul>
<li>Tasks are used implicitly. There is a significant difference between using tasks underneath to implement this parallel for-loop, and Worksharing in OpenMP: Worksharing uses explicit threads that can be bound to cores / numa nodes, while tasks are scheduled onto threads on the behalf of the runtime system. Performance will be discussed in my next blog post, but tasks can easily be moved between numa nodes and that can spoil your performance really. OpenMP has no built-in support for affinity, but the tricks how to deal with Worksharing on cc-NUMA architectures are well-known.</li>
<li>Runtime system has full control. To my current knowledge, there is no reliably way of influencing how many threads will be used to execute the implicit tasks. Even more: I think this is by design. While it is probably nice for many users and applications when the runtime figures out how many threads should be used, this is bad for the well-educated programmer as he often has better knowledge of the application than the compiler + runtime could ever figure out (about data access pattern, for instance). If you want to fine-tune this parallelization, you have hardly any option (note: this is still beta and the options may change until .NET 4.0 will be released). In OpenMP, you can influence the work-distribution in many aspects.</li>
</ul>
<p><strong>PLINQ.</strong> LINQ stands for language-integrated query and allows for declarative data access. When I first heard about this technology, it was demonstrated in the context of data access and I found it interesting, but not closely related to the parallelism I am interested in. Well, it turned out that PLINQ (+ parallel) can be used to parallelize a SMXV code as well (the matrix_crs class has to implement the IEnumerable / IParallelEnumerable interface):</p>
<pre><span style="color:#000080;">public </span><span style="color:#000080;">static </span><span style="color:#000080;">double</span>[] <span style="color:#000080;">operator </span>*(<span style="color:#33cccc;">matrix_crs_plinq</span> lhs, double[] rhs)</pre>
<pre>{</pre>
<pre>   <span style="color:#000080;">var </span>res = <span style="color:#000080;">from </span>rowIndices <span style="color:#000080;">in </span>lhs.AsParallel().AsOrdered()</pre>
<pre>             <span style="color:#000080;">select </span>RowSum(rowIndices, lhs, rhs);</pre>
<pre>   <span style="color:#000080;">double</span>[] result = res.ToArray();</pre>
<pre>   <span style="color:#000080;">return </span>result;
}</pre>
<pre><span style="color:#000080;">public </span><span style="color:#000080;">static </span><span style="color:#000080;">double </span>RowSum(<span style="color:#000080;">long</span>[] rowIndices, <span style="color:#33cccc;">matrix_crs_plinq</span> lhs, <span style="color:#000080;">double</span>[] rhs)</pre>
<pre>{</pre>
<pre>   <span style="color:#000080;">double </span>rowSum = 0;</pre>
<pre>   <span style="color:#000080;">for </span>(<span style="color:#000080;">long </span>i = rowIndices[0]; i &lt; rowIndices[1]; i++)</pre>
<pre>   {</pre>
<pre>      rowSum += lhs.val(i) * rhs[lhs.col(i)];</pre>
<pre>   }</pre>
<pre>   <span style="color:#000080;">return </span>rowSum;
}</pre>
<p>Did you recognized the AsParallel() in there? That is all you have to do, once the required interfaces have been implemented. Would I recommend using PLINQ for this type of code? No, it is meant to parallelize queries on object collections and more general data sources (think of databases). But (for me at least) it is certainly interesting to see this paradigm applied to a code snippet from the scientific-technical world. As PLINQ uses TPL internally, you will probably have the same issues regarding locality, although I did not look into this too closely yet.</p>
<p>Let me give credit to Ashwani Mehlem, who is one of the student workers in our group. He did some of the implementation work (especially the PLINQ version) and code maintenance of the experiment framework.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10066/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10066/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10066/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10066/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10066/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10066/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10066/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10066/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10066/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10066/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10066&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/02/18/a-performance-tuning-tale-optimizing-smxv-sparse-matrix-vector-multiplication-on-windows-part-15-of-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>Upcoming Events in March 2009</title>
		<link>http://terboven.wordpress.com/2009/02/16/upcoming-events-in-march-2009/</link>
		<comments>http://terboven.wordpress.com/2009/02/16/upcoming-events-in-march-2009/#comments</comments>
		<pubDate>Mon, 16 Feb 2009 18:04:28 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[Private]]></category>
		<category><![CDATA[Windows-HPC]]></category>
		<category><![CDATA[Windows-HPC UG]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10055</guid>
		<description><![CDATA[Let me use this well-linked position to announce three HPC events in March 2009.
Future Directions in High Performance Computing: 2009 &#8211; 20018. On Monday, March 23th 2009, Dr. Horst Simon (Associate Laboratory Director for Computing Sciences at Lawrence Berkeley National Laboratory) will give a presentation on future directions in HPC. The talk will be given [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10055&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Let me use this well-linked position to announce three HPC events in March 2009.</p>
<p><strong>Future Directions in High Performance Computing: 2009 &#8211; 20018.</strong> On Monday, March 23th 2009, Dr. Horst Simon (Associate Laboratory Director for Computing Sciences at Lawrence Berkeley National Laboratory) will give a presentation on future directions in HPC. The talk will be given in room 637 in the <a href="http://www.rwth-aachen.de/go/id/sgg/" target="_blank">SuperC</a> building of RWTH Aachen University at 11:00h. More information can be found at the <a href="http://www.rz.rwth-aachen.de/li/c/sng/lang/en/" target="_blank">associated webpage</a>.</p>
<p><strong>Parallel Programming in Computational Engineering and Science (PPCES).</strong> This event will continue the tradition of previous annual SunHPC events taking place in Aachen since 2001, which have been organized by the RWTH Aachen University and Sun Microsystems. We are changing the format a little bit with each year, reflecting changes in the hardware architecture (Sun UltraSparc to Intel Nehalem) and the operating systems (Solaris-only to Linux plus Windows) and programming languages (Fortran-dominated examples are disappearing). This year we intend to cover Linux, Windows and Solaris and offer all demos, exercises and examples on both Linux and Windows, demonstrating that Windows-HPC is a first class citizen with respect to our user support. The event will take place in the lecture room and the pc pool of the Center for Computing and Communication at RWTH Aachen University. More information and the option to register can be found at the <a href="http://www.rz.rwth-aachen.de/go/id/sms/lang/en" target="_blank">event website</a>.</p>
<p><strong>2nd Meeting of the German Windows-HPC User Group (Win-HPC UG).</strong> The second meeting of the German Windows-HPC User Group will be hosted by the Technical University of Dresden and take place on March 30th and 31th. The agenda is not yet online (but will be shortly), but I am looking forward to some interesting talks. The event is sponsored by Intel, Microsoft, Megware, Sun and Transtec &#8211; we are hoping and doing our best to repeat the success of the <a href="http://www.rz.rwth-aachen.de/li/c/ryy/lang/en" target="_blank">predecessor event</a> that took place last year in Aachen. It was harder than usual to get technical experts from these companies for a talk, but in the end we are quite happy (if you think you have something interesting to share, please contact us immediately). More information can be found at the <a href="http://www.rz.rwth-aachen.de/go/id/siu/lang/en" target="_blank">event website</a> and our plan is to schedule all talks between Monday 14:00h and Tuesday 17:00h to allow for comfortable travel.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10055/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10055/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10055/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10055/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10055/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10055/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10055/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10055/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10055/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10055/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10055&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/02/16/upcoming-events-in-march-2009/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>
	</item>
		<item>
		<title>Running Windows 7</title>
		<link>http://terboven.wordpress.com/2009/01/13/running-windows-7/</link>
		<comments>http://terboven.wordpress.com/2009/01/13/running-windows-7/#comments</comments>
		<pubDate>Tue, 13 Jan 2009 07:48:23 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[Private]]></category>
		<category><![CDATA[Windows 7]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10050</guid>
		<description><![CDATA[
Sorry to bother you, but luckily I got the chance to be a beta tester for our Windows system group and just yesterday my work laptop has been upgraded to Windows 7.

Although I really liked Vista, Windows 7 is running so fast that it feels as if I got a new computer  . In [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10050&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><div class="mceTemp mceIEcenter">
<p style="text-align:left;">Sorry to bother you, but luckily I got the chance to be a beta tester for our Windows system group and just yesterday my work laptop has been upgraded to Windows 7.</p>
<div id="attachment_10051" class="wp-caption aligncenter" style="width: 710px"><img class="size-full wp-image-10051" title="screenshot_windows7" src="http://terboven.files.wordpress.com/2009/01/screenshot_windows7.png?w=700&#038;h=259" alt="Windows 7" width="700" height="259" /><p class="wp-caption-text">Windows 7</p></div>
</div>
<p>Although I really liked Vista, Windows 7 is running so fast that it feels as if I got a new computer <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . In the tech press there was some whispering about Kernel changes and improvements of which I am curious to learn about&#8230;</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10050/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10050/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10050/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10050&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2009/01/13/running-windows-7/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>

		<media:content url="http://terboven.files.wordpress.com/2009/01/screenshot_windows7.png" medium="image">
			<media:title type="html">screenshot_windows7</media:title>
		</media:content>
	</item>
	</channel>
</rss>