Changes between Initial Version and Version 1 of DataParallel/SMP


Ignore:
Timestamp:
Mar 19, 2007 5:54:02 AM (13 years ago)
Author:
chak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataParallel/SMP

    v1 v1  
     1== Data parallelism on shared-memory machines ==
     2
     3The implementation of NDP on shared-memory machines is based on Concurrent Haskell; in the following, a thread is always a Haskell thread. Parallel computations are executed by several threads (a ''gang''), typically as many as there are processors or cores. The gang threads run continuously, waiting for work requests. Non-parallel code is executed in a separate thread (the main Haskell thread). Once it encounters a parallel computation, it passes work requests to the gang threads, such that each thread executes a part of the computation, blocks until the result becomes available and then resumes execution. For instance, `sumP [:1 .. n*n:] + 1` is roughly evaluated as follows:
     4{{{
     5  Main thread                                 Gang threads
     6  -----------                                 ------------
     7
     8  compute (m = n*n)                              sleep
     9      |                                            |
     10      |          request [:1 .. m:]                |
     11      +---------------------------------------->   |
     12      |                                            | 
     13    sleep                                    compute (xs = [:1 .. m:])
     14      |          yield xs                          |
     15      |  <-----------------------------------------+
     16      |                                            |
     17   wake up                                       sleep
     18      |          request (sumP xs)                 |
     19      +---------------------------------------->   |
     20      |                                            |
     21    sleep                                    compute (k = sumP xs)
     22      |          yield k                           |
     23      |  <-----------------------------------------+
     24      |                                            |
     25  compute (k+1)                                  sleep
     26}}}
     27This execution strategy is effectively equivalent to forking off a new gang for each parallel computation. It is more efficient than fork/join parallelism, however, as no new threads are forked during program execution.
     28
     29The above example has 4 synchronisation points represented by arrows in the diagram: two for passing work requests to the gang threads and two for yielding the results. Note, however, that when the main thread receives `xs` from the gang it immediately passes it back as part of the `sumP xs` request. Similar situations arise quite frequently in functional programs which are often formulated as pipelines of (parallel, in our case) computations. Clearly, we would like to eliminate these superfluous synchronisation points and, indeed, our fusion framework allows us to do so automatically. After optimisation, the above example is evaluated as follows:
     30{{{
     31  Main thread                                 Gang threads
     32  -----------                                 ------------
     33
     34  compute (m = n*n)                              sleep
     35      |                                            |
     36      |          request (sumP [:1 .. m:])         |
     37      +---------------------------------------->   |
     38      |                                            | 
     39    sleep                                    compute (k = sumP [:1 .. m:])
     40      |          yield k                           |
     41      |  <-----------------------------------------+
     42      |                                            |
     43  compute (k+1)                                  sleep
     44}}}
     45The superfluous synchronisation points been eliminated; moreover, the resulting distributed computation (`sumP [:1 .. m:]`) can be optimised further, most importantly by performing sequential loop fusion.
     46
     47Ultimately, we intend to implement a more flexible system which would allow gangs to be created dynamically. Unfortunately, it is not entirely clear how to specify what gang is to be used for a particular parallel computation, especially in the presence of laziness as the order of evaluation is usually hard to figure out in this case. More on this in [wiki:DataParallel/SMP/NDPContexts the discussion of NDP contexts].