id,summary,reporter,owner,description,type,status,priority,milestone,component,version,resolution,keywords,cc
18,Permute does not properly write-combine results,tmcdonell,,"When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The `permute` operation does not currently take this into account.

{{{
main :: IO ()
main = do
  putStr ""Interpreter : "" ; print     (Interp.run accumulate)
  putStr ""CUDA        : "" ; print =<< (CUDA.run   accumulate)

accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
  where
    src = Acc.use $ Acc.fromList 16 (repeat 1)
    idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
    dst = Acc.use $ Acc.fromList 4  (repeat 0)
}}}

Which results in:

{{{
*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA        : Array 4 [1,1,1,1]
}}}


Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl

For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl
",defect,new,major,,CUDA backend,0.8.1.0,,,
