__color__	ticket	summary	component	version	milestone	type	owner	status	created	_changetime	_description	_reporter
3	38	Create user defined data sturcture as instance of Elem	Accelerate language			enhancement	chak	new	2010-12-07T08:59:46Z+0000	2010-12-07T08:59:46Z+0000	"Hi, I'm trying to use accelerate to write multipole solver using CUDA as backend.

I wanted to create user defined data structure like:
{{{
data Panel b a
     = Panel { getBounds    :: b
             , getSrcBounds :: (Int, Int)
             -- , getFarExpan  :: Array Int a
             }
       deriving (Show, Typeable)
}}}
But I found that it's hard to declare it as an instance of `Elem' because its methods are hidden.

I understand that user can define a type using nested (,). But it would be nice to expose `Elem' to end user so user can use any type as `Elem'. Is there any concern about this?"	fxie
3	32	OpenCL Backend	Accelerate language			feature request	chak	new	2010-08-23T08:43:46Z+0100	2010-08-24T12:45:15Z+0100	Since OpenCL is a standard which works across Graphics-Hardware Vendors, a backend for accelerate using this should make the library useful for a broader range of people.	anonymous
3	37	"Support for ""combining"" operators"	Accelerate language	0.8.0.0		feature request	chak	new	2010-09-13T02:58:15Z+0100	2010-09-13T02:58:15Z+0100	"The ability to combine multiple arrays into a single one can currently only be achieved using zip/zipWith. Of course, these arrays can only combine 2 arrays at a time, so in general it would be useful to have operators for combining more than 2 arrays in more elaborate, but structured, ways.

Some requirements could be:

 * 'combine' operator:
   * a generalised array combining operator
   * sum of input arrays sizes is equal to output array size - i.e. no elements are lost or duplicated
   * no permutation is performed on elements within in input array - i.e. output array elements are contiguous with respect to their source input array
   * input array elements must all be of the same type and shape (shape would need to be a run-time check)
   * the combination does not need to preserve dimensionality - e.g. multiple 1D arrays could be combined to produce another 1D array (concatenation), or a 2D array (stacked) or even ""maybe"" a 3D array (stack-stacking?)
 * 'append' operator:
   * a specialised array combining operator
   * two input arrays - place one array at the ""end"" of the other
   * input array elements of the same type
   * the higher dimensions of the input arrays must have the same extent


In addition to fulfilling the need of a common pattern, combining operators would allow for further backend optimisations:

 *  input arrays to a combine operator would not require intermediate writing to memory - they can be written directly (by their producer) to their location within the combined output array
 * on architectures such as Fermi, the generation of the input arrays can be done in parallel by using separate streams - 'combine' in this case acts as synchronisation barrier until all computations are complete

"	blever
3	40	liftAcc in CUDA.Execute doesn't handle Let and Let2	Accelerate language			missing functionality	tmcdonell	new	2010-12-21T05:02:26Z+0000	2010-12-21T05:02:26Z+0000	When recovering sharing of computations involving `scanl'`, `scanr'`, etc whose results are used in `size`, `shape`, or `(!)` functions, let-bindings may be encountered by `D.A.A.CUDA.Execute.liftAcc`, which panics as it currently doesn't handle `Let` and `Let2` AST nodes.	chak
3	8	Sharing is lost	Accelerate language	0.8.1.0	0.9 release	missing functionality	chak	assigned	2010-07-18T13:32:46Z+0100	2010-12-21T05:14:55Z+0000	Currently the frontend fails to preserve sharing in Accelerate expressions, which leads to unnecessary recomputation of shared values.	chak
3	41	Comprehensive support for standard Haskell classes and numeric conversions	Accelerate language	0.9.0.0	0.9 release	missing functionality	chak	new	2010-12-21T05:09:20Z+0000	2010-12-21T05:13:58Z+0000		chak
4	42	FFI support to interface with existing CUDA code	Accelerate language	0.9.0.0		defect	chak	new	2010-12-21T05:10:58Z+0000	2010-12-21T05:10:58Z+0000		chak
3	2	No Bool and Char arrays with the CUDA backend	CUDA backend	0.8.1.0		defect	chak	new	2010-07-18T12:41:11Z+0100	2010-10-13T05:41:52Z+0100	The CUDA backend can currently not handle arrays that contain elements of type `Bool` or `Char`.  In `D.A.A.Array.Data`, see the instance declarations for `ArrayElem Bool` and `ArrayElem Char` for details.	chak
3	18	Permute does not properly write-combine results	CUDA backend	0.8.1.0		defect		new	2010-07-24T11:16:09Z+0100	2010-10-13T05:43:13Z+0100	"When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The `permute` operation does not currently take this into account.

{{{
main :: IO ()
main = do
  putStr ""Interpreter : "" ; print     (Interp.run accumulate)
  putStr ""CUDA        : "" ; print =<< (CUDA.run   accumulate)

accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
  where
    src = Acc.use $ Acc.fromList 16 (repeat 1)
    idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
    dst = Acc.use $ Acc.fromList 4  (repeat 0)
}}}

Which results in:

{{{
*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA        : Array 4 [1,1,1,1]
}}}


Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl

For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl
"	tmcdonell
3	39	./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument	CUDA backend	0.8.1.0		defect	tmcdonell	new	2010-12-14T12:17:43Z+0000	2010-12-15T10:32:50Z+0000	"Hello. 

I was trying to build/use the current accelerate version,
and run into some problems.

I am using ghc-6.12.3 on Linux 2.6.32-bpo.5-amd64 (debian 5.0.6)
with cudatoolkit_3.2.16_linux_64_ubuntu10.04  and gcc-4.5.1
and my hardware is GTX295

* cabal install accelerate  
is trying to build accelerate-0.8.1.0
and it starts with building the dependency cuda-0.2.2
which fails with 

{{{
c2hs: Errors during expansion of binding hooks:

./Foreign/CUDA/Driver/Context.chs:76: (column 15) [ERROR]  >>> Unknown identifier!
  Cannot find a definition for `cuCtxCreate' in the header file.
}}}

* cabal install cuda   
builds cuda-0.3.2  and the installation runs through.


* cabal unpack accelerate
and manually remove the ""cuda < 0.3"" dependency
then cabal install   runs through.

* cd accelerate-0.8.1.0/examples/simple ; make ; ./test
looks OK 

* cd accelerate-0.8.1.0/examples/rasterize ; ghc --make rasterize

{{{
[1 of 2] Compiling RasterizeAcc     ( RasterizeAcc.hs, RasterizeAcc.o )

RasterizeAcc.hs:26:9:
    Not in scope: type constructor or class `NFData'
}}}

I manually add ""import Control.DeepSeq"" to RasterizeAcc.hs
then ""ghc --make"" succeeds

./rasterize  prints 
4 * Haskell (pass), 4 * (Accelerate interpreted) pass

So I figure this isn't using cuda at all.
I change  Data.Array.Accelerate.Interpreter to
Data.Array.Accelerate.CUDA
in both  RasterizeAcc.hs  and   rasterize.hs

Then ""ghc --make rasterize"" is OK,
but running ./rasterize gives

{{{
rasterize-test1.txt (Haskell) - pass
rasterize-test2.txt (Haskell) - pass
rasterize-test3.txt (Haskell) - pass
rasterize-test4.txt (Haskell) - pass
rasterize: 
*** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument
}}}"	j.waldmann
3	34	CUDA backend does not support 'stencil'	CUDA backend	0.8.1.0	0.9 release	missing functionality	tmcdonell	new	2010-08-28T05:57:53Z+0100	2010-12-21T05:15:36Z+0000		tmcdonell
4	3	Unhelpful error when using Double on pre-1.3 CUDA devices	CUDA backend	0.8.1.0		defect	None	new	2010-07-18T12:50:10Z+0100	2010-10-13T05:46:21Z+0100	"Only CUDA devices with compute capability 1.3 and up contain hardware support for Double.  Currently, Accelerate will nevertheless generate code for Doubles for these devices, which leads to a failure further down the pipeline.  This is not very user-friendly.

This raises the general questions of how to handle device capabilities elegantly."	chak
4	26	Internal error in filter test with criterion	CUDA backend	0.8.0.0		defect	tmcdonell	assigned	2010-08-18T06:39:41Z+0100	2010-08-18T10:09:45Z+0100	"On my MBP with a NVIDIA !GeForce 9400M (256MB VRAM) and CUDA 3.1, I get
{{{
benchmarking filter/cuda
collecting 100 samples, 1 iterations each, in estimated 5.192780 s
test: *** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:45 (unhandled): CUDA Exception: invalid argument
}}}
The filter test runs fine in the validation phase. It only dies with criterion (probably as that tests a wider range of inputs)."	chak
4	28	reduced performance of small types	CUDA backend	0.8.1.0		defect		new	2010-08-22T08:44:59Z+0100	2010-10-13T05:46:59Z+0100	CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.	tmcdonell
4	36	scan operations hang indefinitely on devices with Compute Capability 1.0	CUDA backend	0.8.1.0		defect		new	2010-09-11T03:54:56Z+0100	2010-10-13T05:47:42Z+0100	{{{scan_intervals}}}, in a for-loop, has a {{{__syncthreads()}}} and calls another device function {{{scan_block}}} that has a bunch of {{{__syncthreads()}}}. As threads do not exit the loop all at once, scan operations hang indefinitely at {{{__syncthreads()}}} on devices with Comput Capability 1.0.	seanl
4	31	support concurrent kernel execution on Fermi architectures	CUDA backend	0.8.0.0		enhancement		new	2010-08-23T01:51:05Z+0100	2010-08-23T01:51:18Z+0100		tmcdonell
