First NDP benchmarks (2007)
Sparse matrix vector multiplication
This benchmark is explained it much detail in Data Parallel Haskell: a status report. Runtimes comparing to sequential C code on Intel Xeon (x86) and Sun SunFire9600 (Sparc) are in timecolour.png. The parallel Haskell code is more efficient from 2 PEs for the SunFire and from 4 PEs for the Xeon processors. We blame the low sequential performance for the Xeon on the lack of effort that has been put into generating good straightline code (both in the NCG and when compiling via C), this includes inadequate register allocation and lack of lowlevel optimisations.
The speedup for the Xeon box and the SunFire are in speedupcolour.png and the speedup for our 8x dualcore Opteron NUMA box is in serenityallspeedupcolour.png. The speedup of on the NUMA machine is limited by the memory bandwidth for smvm. When we only use one core per CPU, the benchmark scales much better. Moreover, the memory traffic/compute ratio is slightly more favourable when processing arrays of Float
s than when processing arrays of Double
s.
Connected components in undirected graphs
This is based on http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#concomp. Currently, two of the tree algorithms described there are implemented: AwerbuchShiloach and the hybrid algorithm. The random mate algorithm needs a dataparallel random number genenerator and is left for later.
The algorithms are interesting because they actually don't do a lot of computations; they mostly filter and copy edges. Thus, it is perhaps not unreasonable to expect that if they scale well, then so will more computationintensive ones. Also, they should make any inefficiencies introduced by missed fusion opportunities quite obvious.
Both algorithms take the number of nodes (n :: Int
) and an array of edges (es :: Int :*: Int
) and yield an array of Int
s of length n
where each node is assigned a number between 0
and n1
. Nodes which are assigned the same number are connected. The algorithms are described and compared in http://citeseer.ifi.unizh.ch/greiner94comparison.html. At the moment, I have only benchmarked them for a random graph with 1000000 nodes and 40000 edges; eventually, I'll add benchmarks for other kinds of graphs described in the paper. The benchmarks have been run on a dual Intel Xeon 2.8 GHz with two cores per processor which effectively gives us 4 processors overall.
The parallel versions are currently very much slower than the sequential ones, particularly so for AwerbuchShiloach. This is because fusion doesn't work for the parallel algorithms at the moment.
No benchmarks on 4 processors yet, as the 4th PE is currently busy.
AwerbuchShiloach
The sequential code is in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/AwShU.hs and the parallel in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/AwShUP.hs.
Version  Threads  Time (ms)  Speedup 
sequential  1600  
parallel  1  29800  
2  16800  1.8  
3  12800  2.3  
4  ???  ??? 
Hybrid
The sequential code is in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/HybU.hs and the parallel in http://darcs.haskell.org/packages/ndp/Data/Array/Parallel/test/nesl/concomp/HybUP.hs.
Version  Threads  Time (ms)  Speedup 
sequential  1850  
parallel  1  7450  
2  4600  1.6  
3  3800  2.0  
4  ???  ??? 
I haven't completely parallelised this one yet (it's only a matter of implementing some parallel combinators).
Attachments (3)

speedupcolour.png (6.0 KB)  added by 12 years ago.
Speedup of smvm on 2x dualcore Xeon and SunFire9600

timecolour.png (6.9 KB)  added by 12 years ago.
Runtime of smvm on 2x dualcore Xeon and SunFire9600, including sequential C reference implementation

serenityallspeedupcolour.png (7.7 KB)  added by 12 years ago.
Speedup of smvm on 8x dualcore Opteron for Float and Double, including the use of one core/CPU and two cores/CPU
Download all attachments as: .zip