Ticket #18 (new defect)

Opened 4 years ago

Last modified 4 years ago

Permute does not properly write-combine results

Reported by: tmcdonell Owned by:
Priority: major Milestone:
Component: CUDA backend Version: 0.8.1.0
Keywords: Cc:

Description

When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The permute operation does not currently take this into account.

main :: IO ()
main = do
  putStr "Interpreter : " ; print     (Interp.run accumulate)
  putStr "CUDA        : " ; print =<< (CUDA.run   accumulate)

accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
  where
    src = Acc.use $ Acc.fromList 16 (repeat 1)
    idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
    dst = Acc.use $ Acc.fromList 4  (repeat 0)

Which results in:

*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA        : Array 4 [1,1,1,1]

Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl

For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl

Attachments

permute_tag.inl (1.6 kB) - added by tmcdonell 4 years ago.
permute write combining using integer tagging
permute_atomic.inl (1.3 kB) - added by tmcdonell 4 years ago.
permute write combining using atomic intrinsics

Change History

Changed 4 years ago by tmcdonell

permute write combining using integer tagging

Changed 4 years ago by tmcdonell

permute write combining using atomic intrinsics

Changed 4 years ago by tmcdonell

This mess is sufficient to coerce nvcc.

#define INT_AS_FLOAT(x) (*((float*)&(x)))
#define FLOAT_AS_INT(x) (*((int*)&(x)))

Changed 4 years ago by tmcdonell

Actually, pointer casting breaks strict aliasing rules, so a union would be better.

float __int_as_float(int a)
{
 union {int a; float b;} u;

 u.a = a;

 return u.b;
}

Changed 4 years ago by chak

  • version changed from 0.7.1.0 to 0.8.0.0

Changed 4 years ago by chak

  • version changed from 0.8.0.0 to 0.8.1.0
Note: See TracTickets for help on using tickets.