Ticket #18 (new defect)
Permute does not properly write-combine results
| Reported by: | tmcdonell | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | CUDA backend | Version: | 0.8.1.0 |
| Keywords: | Cc: |
Description
When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The permute operation does not currently take this into account.
main :: IO ()
main = do
putStr "Interpreter : " ; print (Interp.run accumulate)
putStr "CUDA : " ; print =<< (CUDA.run accumulate)
accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
where
src = Acc.use $ Acc.fromList 16 (repeat 1)
idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
dst = Acc.use $ Acc.fromList 4 (repeat 0)
Which results in:
*Test> :main Interpreter : Array 4 [4,6,3,3] CUDA : Array 4 [1,1,1,1]
Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl
For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl
