Ticket #28 (new defect)
reduced performance of small types
|Reported by:||tmcdonell||Owned by:|
CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.