<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Christian's corner on HPC &#187; DROPS</title>
	<atom:link href="http://terboven.wordpress.com/tag/drops/feed/" rel="self" type="application/rss+xml" />
	<link>http://terboven.wordpress.com</link>
	<description>A Blog on Parallel Programming - covering all OSes :-) - by Christian Terboven.</description>
	<lastBuildDate>Mon, 28 Dec 2009 14:31:25 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='terboven.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/ca15ae373518af13020a6ad3d697f507?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Christian's corner on HPC &#187; DROPS</title>
		<link>http://terboven.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://terboven.wordpress.com/osd.xml" title="Christian&#8217;s corner on HPC" />
		<item>
		<title>A performance tuning tale: Optimizing SMXV (sparse Matrix-Vector-Multiplication) on Windows [part 1 of 2]</title>
		<link>http://terboven.wordpress.com/2008/11/02/a-performance-tuning-tale-optimizing-smxv-sparse-matrix-vector-multiplication-on-windows-part-1-of-2/</link>
		<comments>http://terboven.wordpress.com/2008/11/02/a-performance-tuning-tale-optimizing-smxv-sparse-matrix-vector-multiplication-on-windows-part-1-of-2/#comments</comments>
		<pubDate>Sun, 02 Nov 2008 20:40:03 +0000</pubDate>
		<dc:creator>terboven</dc:creator>
				<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[DROPS]]></category>
		<category><![CDATA[Load Balancing]]></category>
		<category><![CDATA[Loop Scheduling]]></category>
		<category><![CDATA[SMXV]]></category>
		<category><![CDATA[Thread Profiler]]></category>

		<guid isPermaLink="false">http://terboven.wordpress.com/?p=10</guid>
		<description><![CDATA[Since a while I am involved in several teaching activities on parallel  programming and in my humble opinion this also includes talking about parallel  computer architectures. As I am usually responsible for Shared-Memory parallel  programming with OpenMP and TBB and the like, examples and exercises include  learning about and tuning for [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10&subd=terboven&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Since a while I am involved in several teaching activities on parallel  programming and in my humble opinion this also includes talking about parallel  computer architectures. As I am usually responsible for Shared-Memory parallel  programming with OpenMP and TBB and the like, examples and exercises include  learning about and tuning for the recent multi-core architectures we are using,  namely Opteron-based and Xeon-based multi-socket systems. Well, understanding  the perils of Shared-Memory parallel programming is not easy, but my impression  is that several students are challenged when they are asked to carry the usual  obstacles of parallel programming (e.g. load imbalance) forward to the context  of different systems (e.g. UMA versus cc-NUMA). So this blog post has two goals:  Examine and tune a sparse Matrix-Vector-Multiplication (SMXV) kernel on several  architectures with (1) putting my oral explanations into text as a brief  reference and (2) showing that one can do all the analysis and tuning work on  Windows as well.</p>
<p>From school you probably know how to do a <a href="http://en.wikipedia.org/wiki/Matrix_multiplication" target="_blank">Matrix-Vector-Multiplication</a> for dense matrices. In the field of high performance technical computing, you  typically have to deal with <a href="http://en.wikipedia.org/wiki/Sparse_matrix" target="_blank">sparse linear algebra</a> (unless you do a LINPACK <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  benchmark). In my example, the matrix is stored in  CRS format and has the following structure:</p>
<div id="attachment_12" class="wp-caption aligncenter" style="width: 310px"><a href="http://terboven.files.wordpress.com/2008/11/matrix_sparsity_plot.jpg"><img class="size-medium wp-image-12" title="matrix_sparsity_plot" src="http://terboven.files.wordpress.com/2008/11/matrix_sparsity_plot.jpg?w=300&#038;h=193" alt="DROPS." width="300" height="193" /></a><p class="wp-caption-text">Matrix Structure Plot: DROPS.</p></div>
<p>The CRS format stores just the nonzero elements of the matrix in three vectors:  The <em>val</em>-vector contains the values of all nonzero elements, the  <em>col</em>-vector has the same dimension as the <em>val</em>-vector and  contains the column indices for each nonzero element, the <em>row</em>-vector is  of the same length as there are rows in the matrix (+1) and points to the first  nonzero element index (in <em>val</em> and <em>col</em>) for each matrix row.  While there a several different format to save sparse matrices, the CRS format  is well-suited for matrices without special properties and allows for an  efficient implementation of the Matrix-Vector-Multiplication kernel. The  intuitive approach for a parallel SMXV kernel may look as shown below. Let <span style="font-family:Courier New;">Aval</span>, <span style="font-family:Courier New;">Acol</span> and <span style="font-family:Courier New;">Arow</span> be the vector-based implementations of  <em>val</em>, <em>col</em> and <em>row</em>:</p>
<pre><span style="color:blue;"><span style="color:#000000;">01  </span>#pragma </span>omp parallel <span style="color:blue;">for private<span style="color:#000000;">(sum, rowend, rowbeg, nz)
</span></span>02     <span style="color:blue;">for </span>(i = 0; i &lt; num_rows; i++){
03        sum = 0.0;
04        rowend = Arow[i+1];
05        rowbeg = Arow[i];
06        <span style="color:blue;">for </span>(nz=rowbeg; nz&lt;rowend; ++nz)
07        {
08           sum += Aval[nz] * x[Acol[nz]];
09        }
10        y[i] = sum;
11     }</pre>
<p>How good is this parallelization for the matrix as shown above? Lets take a  look at a two-socket quad-core Intel Xeon E5450-based system (3.0 GHz), Below, I  am plotting the performance in MFLOP/s for one to eight threads using just the  plain <em>Debug</em> configuration of Visual Studio 2008 in which OpenMP has  been enabled:</p>
<pre>

<div id="attachment_11" class="wp-caption aligncenter" style="width: 310px"><a href="http://terboven.files.wordpress.com/2008/11/smxv_intuitive_parallelization.jpg"><img class="size-medium wp-image-11" title="smxv_intuitive_parallelization" src="http://terboven.files.wordpress.com/2008/11/smxv_intuitive_parallelization.jpg?w=300&#038;h=180" alt="Intuitive Parallelization." width="300" height="180" /></a><p class="wp-caption-text">Performance plot of a parallel SMXV: Intuitive Parallelization.</p></div></pre>
<p>The speedup for two threads (about 1.7) is not too bad, but the best speedup of just 2.1 is achieved with eight threads. It does not pay off significantly to use more than four threads. This is because the Frontside Bus has an insuperable limit of about eight GB/s in total and using dedicated memory bandwidth benchmarks (e.g. <a href="http://www.cs.virginia.edu/stream/" target="_blank">STREAM</a>) one can see that this limit can already be reached with four threads (sometimes even using just two threads). Since we are working with a sparse matrix, most accesses are quasi-random and neither the hardware prefetcher nor the compiler inserting prefetch instructions can help us any more.</p>
<p>In many cases, thread binding can be of some help to improve the performance.  The result of thread binding is also shown as <em>Debug w/ “scatter”  binding</em> – using this approach the threads are distributed over the machine  as far away from each other as possible. For example with two threads, each  thread is running on a separate socket. This strategy has the advantage of using  the maximal possible cache size, but does not improve the performance  significantly for this application (or: Windows is already doing a similarly  good job with respect to thread binding). Nevertheless, I will use the scattered  thread binding strategy in all following measurements. Now, what can we do?  Let’s try compiler optimization:</p>
<div id="attachment_13" class="wp-caption aligncenter" style="width: 310px"><a href="http://terboven.files.wordpress.com/2008/11/smxv_compiler_optimization.jpg"><img class="size-medium wp-image-13" title="smxv_compiler_optimization" src="http://terboven.files.wordpress.com/2008/11/smxv_compiler_optimization.jpg?w=300&#038;h=178" alt="Compiler Optimization." width="300" height="178" /></a><p class="wp-caption-text">Performance plot of a parallel SMXV: Compiler Optimization.</p></div>
<p>Switching to the <em>Release</em> configuration does not require any work from  the user, but results in a pretty nice performance improvement. I usually  enabled architecture-specific optimization as well (e.g. SSE-support is enabled  in the <em>ReleaseOpt</em> configuration), but that does not result in any  further performance improvement for this memory-bound application / benchmark.  Anyway, as the compiler has optimized our code for example with respect to cache  utilization, this also increases the performance when using more than one  thread!</p>
<p>In sequential execution (aka with one thread only) we get about 570 MFLOP/s.  This is only a small fraction of the peak performance one core could deliver  theoretically (1 core * 3 GHz * 4 instructions/sec = 12 GFLOP/s), but this is  what you have to live with given the gap between CPU speed and memory speed. In  order to improve the sequential performance, we would have to examine the matrix  access pattern and re-arrange / optimize this with respect to the given cache  hierarchy. But for now, I would rather like to think about the parallelization  again: When you look at the matrix structure plot above, you will find that the  density of nonzero elements is decreasing with the matrix rows counting. Our  parallelization did not respect this, so we should expect to have a load  imbalance limiting our parallelization. I used the <a href="http://www.intel.com/cd/software/products/asmo-na/eng/286749.htm" target="_blank">Intel  Thread Profiler</a> (available on Windows as well as on Linux) to visualize  this:</p>
<div id="attachment_14" class="wp-caption aligncenter" style="width: 310px"><a href="http://terboven.files.wordpress.com/2008/11/load_imbalance_threadprofiler_4threads.png"><img class="size-medium wp-image-14" title="load_imbalance_threadprofiler_4threads" src="http://terboven.files.wordpress.com/2008/11/load_imbalance_threadprofiler_4threads.png?w=300&#038;h=132" alt="Load Imbalance with SMXV." width="300" height="132" /></a><p class="wp-caption-text">Intel Thread Profiler: Load Imbalance with SMXV.</p></div>
<p>The default for-loop scheduling in OpenMP is <em>static</em> (well, on all  implementations I know), thus the iteration space is divided into as many chunks  as we have threads, all of approximately equal size. So the first thread (T1 in  the image above) gets the part of the matrix containing the more dense rows,  thus it has more work to do than the other threads. Note: The reason why the  Thread Profiler claims the threads two to four have “Barrier”-overhead instead  of “Imbalance”-overhead is caused by my benchmark kernel, which looks slightly  different than the code snippet above, but let’s ignore that differentiation  here.</p>
<p>So, what can we do about it? Right, OpenMP allows for pretty easy and efficient  ways of influencing the for-loop scheduling strategy. We just have to extend the  line 01 of the code snippet above to look like this:</p>
<pre><span style="font-family:Courier New;"><span style="color:blue;">01 #pragma </span>omp parallel </span><span style="color:blue;"><span style="font-family:Courier New;">for private</span><span style="color:#000000;"><span style="font-family:Courier New;">(sum, rowend, rowbeg, nz) schedule(guided)</span></span></span></pre>
<p>With guided scheduling, the initial chunks have an implementation-specific size  which is decreased exponentially down to the chunksize specified, or 1 in our  case. For the matrix with a structure as shown above, this results in a good  load balance. So this is the performance we get including all optimization we  discussed so far:</p>
<div id="attachment_15" class="wp-caption aligncenter" style="width: 310px"><a href="http://terboven.files.wordpress.com/2008/11/smxv_binding_balancing.jpg"><img class="size-medium wp-image-15" title="smxv_binding_balancing" src="http://terboven.files.wordpress.com/2008/11/smxv_binding_balancing.jpg?w=300&#038;h=179" alt="Load Balancing." width="300" height="179" /></a><p class="wp-caption-text">Performance plot of a parallel SMXV: Load Balancing.</p></div>
<p>We started with an non-optimized serial code delivering about 350 MFLOP/s and  finished with a parallel code delivering about 1000 MFLOP/s! This is still far  away from a linear scaling, but this is what you see in reality with complex  (aka memory-bound) applications. Regarding these results, please note the  following:</p>
<ul>
<li>We did not apply any dataset-specific optimization. That means if the matrix  structure changes (which it does over the time of a program run in the  application I took this benchmark from) we will still do well and not run into  any new load balance. This is clearly an advantage of OpenMP over manual  threading!</li>
<li>We did not apply any architecture-specific optimization. This code will deliver  a reasonable performance on most machines. But we did not yet take a look at  cc-NUMA machines (e.g AMD Opteron-based or Intel Nehalem-based systems), this  will be done in part 2. On a cc-NUMA system, there is a lot of performance to  win or to loose, depending on if you are doing everything right or making a  mistake.</li>
<li>Was anything in here OS-specific? No, it wasn’t. I did the experiments on  Windows, but could have done everything on Linux in exactly the same way. More  on this in the next post as well…</li>
</ul>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/terboven.wordpress.com/10/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/terboven.wordpress.com/10/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/terboven.wordpress.com/10/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/terboven.wordpress.com/10/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/terboven.wordpress.com/10/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/terboven.wordpress.com/10/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/terboven.wordpress.com/10/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/terboven.wordpress.com/10/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/terboven.wordpress.com/10/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/terboven.wordpress.com/10/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=terboven.wordpress.com&blog=5383873&post=10&subd=terboven&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://terboven.wordpress.com/2008/11/02/a-performance-tuning-tale-optimizing-smxv-sparse-matrix-vector-multiplication-on-windows-part-1-of-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">terboven</media:title>
		</media:content>

		<media:content url="http://terboven.files.wordpress.com/2008/11/matrix_sparsity_plot.jpg?w=300" medium="image">
			<media:title type="html">matrix_sparsity_plot</media:title>
		</media:content>

		<media:content url="http://terboven.files.wordpress.com/2008/11/smxv_intuitive_parallelization.jpg?w=300" medium="image">
			<media:title type="html">smxv_intuitive_parallelization</media:title>
		</media:content>

		<media:content url="http://terboven.files.wordpress.com/2008/11/smxv_compiler_optimization.jpg?w=300" medium="image">
			<media:title type="html">smxv_compiler_optimization</media:title>
		</media:content>

		<media:content url="http://terboven.files.wordpress.com/2008/11/load_imbalance_threadprofiler_4threads.png?w=300" medium="image">
			<media:title type="html">load_imbalance_threadprofiler_4threads</media:title>
		</media:content>

		<media:content url="http://terboven.files.wordpress.com/2008/11/smxv_binding_balancing.jpg?w=300" medium="image">
			<media:title type="html">smxv_binding_balancing</media:title>
		</media:content>
	</item>
	</channel>
</rss>