__color__,__group__,ticket,summary,component,version,milestone,type,owner,status,created,_changetime,_description,_reporter
3,Active Tickets,2,No Bool and Char arrays with the CUDA backend,CUDA backend,0.8.1.0,,defect,chak,new,2010-07-18T12:41:11Z+0100,2010-10-13T05:41:52Z+0100,"The CUDA backend can currently not handle arrays that contain elements of type `Bool` or `Char`.  In `D.A.A.Array.Data`, see the instance declarations for `ArrayElem Bool` and `ArrayElem Char` for details.",chak
4,Active Tickets,3,Unhelpful error when using Double on pre-1.3 CUDA devices,CUDA backend,0.8.1.0,,defect,None,new,2010-07-18T12:50:10Z+0100,2010-10-13T05:46:21Z+0100,"Only CUDA devices with compute capability 1.3 and up contain hardware support for Double.  Currently, Accelerate will nevertheless generate code for Doubles for these devices, which leads to a failure further down the pipeline.  This is not very user-friendly.

This raises the general questions of how to handle device capabilities elegantly.",chak
3,Active Tickets,8,Sharing is lost,Accelerate language,0.8.1.0,0.9 release,missing functionality,chak,assigned,2010-07-18T13:32:46Z+0100,2010-12-21T05:14:55Z+0000,"Currently the frontend fails to preserve sharing in Accelerate expressions, which leads to unnecessary recomputation of shared values.",chak
3,Active Tickets,18,Permute does not properly write-combine results,CUDA backend,0.8.1.0,,defect,,new,2010-07-24T11:16:09Z+0100,2010-10-13T05:43:13Z+0100,"When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The `permute` operation does not currently take this into account.

{{{
main :: IO ()
main = do
  putStr ""Interpreter : "" ; print     (Interp.run accumulate)
  putStr ""CUDA        : "" ; print =<< (CUDA.run   accumulate)

accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
  where
    src = Acc.use $ Acc.fromList 16 (repeat 1)
    idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
    dst = Acc.use $ Acc.fromList 4  (repeat 0)
}}}

Which results in:

{{{
*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA        : Array 4 [1,1,1,1]
}}}


Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl

For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl
",tmcdonell
4,Active Tickets,26,Internal error in filter test with criterion,CUDA backend,0.8.0.0,,defect,tmcdonell,assigned,2010-08-18T06:39:41Z+0100,2010-08-18T10:09:45Z+0100,"On my MBP with a NVIDIA !GeForce 9400M (256MB VRAM) and CUDA 3.1, I get
{{{
benchmarking filter/cuda
collecting 100 samples, 1 iterations each, in estimated 5.192780 s
test: *** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:45 (unhandled): CUDA Exception: invalid argument
}}}
The filter test runs fine in the validation phase. It only dies with criterion (probably as that tests a wider range of inputs).",chak
4,Active Tickets,28,reduced performance of small types,CUDA backend,0.8.1.0,,defect,,new,2010-08-22T08:44:59Z+0100,2010-10-13T05:46:59Z+0100,"CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.",tmcdonell
4,Active Tickets,31,support concurrent kernel execution on Fermi architectures,CUDA backend,0.8.0.0,,enhancement,,new,2010-08-23T01:51:05Z+0100,2010-08-23T01:51:18Z+0100,,tmcdonell
3,Active Tickets,32,OpenCL Backend,Accelerate language,,,feature request,chak,new,2010-08-23T08:43:46Z+0100,2010-08-24T12:45:15Z+0100,"Since OpenCL is a standard which works across Graphics-Hardware Vendors, a backend for accelerate using this should make the library useful for a broader range of people.",anonymous
3,Active Tickets,34,CUDA backend does not support 'stencil',CUDA backend,0.8.1.0,0.9 release,missing functionality,tmcdonell,new,2010-08-28T05:57:53Z+0100,2010-12-21T05:15:36Z+0000,,tmcdonell
4,Active Tickets,36,scan operations hang indefinitely on devices with Compute Capability 1.0,CUDA backend,0.8.1.0,,defect,,new,2010-09-11T03:54:56Z+0100,2010-10-13T05:47:42Z+0100,"{{{scan_intervals}}}, in a for-loop, has a {{{__syncthreads()}}} and calls another device function {{{scan_block}}} that has a bunch of {{{__syncthreads()}}}. As threads do not exit the loop all at once, scan operations hang indefinitely at {{{__syncthreads()}}} on devices with Comput Capability 1.0.",seanl
3,Active Tickets,37,"Support for ""combining"" operators",Accelerate language,0.8.0.0,,feature request,chak,new,2010-09-13T02:58:15Z+0100,2010-09-13T02:58:15Z+0100,"The ability to combine multiple arrays into a single one can currently only be achieved using zip/zipWith. Of course, these arrays can only combine 2 arrays at a time, so in general it would be useful to have operators for combining more than 2 arrays in more elaborate, but structured, ways.

Some requirements could be:

 * 'combine' operator:
   * a generalised array combining operator
   * sum of input arrays sizes is equal to output array size - i.e. no elements are lost or duplicated
   * no permutation is performed on elements within in input array - i.e. output array elements are contiguous with respect to their source input array
   * input array elements must all be of the same type and shape (shape would need to be a run-time check)
   * the combination does not need to preserve dimensionality - e.g. multiple 1D arrays could be combined to produce another 1D array (concatenation), or a 2D array (stacked) or even ""maybe"" a 3D array (stack-stacking?)
 * 'append' operator:
   * a specialised array combining operator
   * two input arrays - place one array at the ""end"" of the other
   * input array elements of the same type
   * the higher dimensions of the input arrays must have the same extent


In addition to fulfilling the need of a common pattern, combining operators would allow for further backend optimisations:

 *  input arrays to a combine operator would not require intermediate writing to memory - they can be written directly (by their producer) to their location within the combined output array
 * on architectures such as Fermi, the generation of the input arrays can be done in parallel by using separate streams - 'combine' in this case acts as synchronisation barrier until all computations are complete

",blever
3,Active Tickets,38,Create user defined data sturcture as instance of Elem,Accelerate language,,,enhancement,chak,new,2010-12-07T08:59:46Z+0000,2010-12-07T08:59:46Z+0000,"Hi, I'm trying to use accelerate to write multipole solver using CUDA as backend.

I wanted to create user defined data structure like:
{{{
data Panel b a
     = Panel { getBounds    :: b
             , getSrcBounds :: (Int, Int)
             -- , getFarExpan  :: Array Int a
             }
       deriving (Show, Typeable)
}}}
But I found that it's hard to declare it as an instance of `Elem' because its methods are hidden.

I understand that user can define a type using nested (,). But it would be nice to expose `Elem' to end user so user can use any type as `Elem'. Is there any concern about this?",fxie
3,Active Tickets,39,./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument,CUDA backend,0.8.1.0,,defect,tmcdonell,new,2010-12-14T12:17:43Z+0000,2010-12-15T10:32:50Z+0000,"Hello. 

I was trying to build/use the current accelerate version,
and run into some problems.

I am using ghc-6.12.3 on Linux 2.6.32-bpo.5-amd64 (debian 5.0.6)
with cudatoolkit_3.2.16_linux_64_ubuntu10.04  and gcc-4.5.1
and my hardware is GTX295

* cabal install accelerate  
is trying to build accelerate-0.8.1.0
and it starts with building the dependency cuda-0.2.2
which fails with 

{{{
c2hs: Errors during expansion of binding hooks:

./Foreign/CUDA/Driver/Context.chs:76: (column 15) [ERROR]  >>> Unknown identifier!
  Cannot find a definition for `cuCtxCreate' in the header file.
}}}

* cabal install cuda   
builds cuda-0.3.2  and the installation runs through.


* cabal unpack accelerate
and manually remove the ""cuda < 0.3"" dependency
then cabal install   runs through.

* cd accelerate-0.8.1.0/examples/simple ; make ; ./test
looks OK 

* cd accelerate-0.8.1.0/examples/rasterize ; ghc --make rasterize

{{{
[1 of 2] Compiling RasterizeAcc     ( RasterizeAcc.hs, RasterizeAcc.o )

RasterizeAcc.hs:26:9:
    Not in scope: type constructor or class `NFData'
}}}

I manually add ""import Control.DeepSeq"" to RasterizeAcc.hs
then ""ghc --make"" succeeds

./rasterize  prints 
4 * Haskell (pass), 4 * (Accelerate interpreted) pass

So I figure this isn't using cuda at all.
I change  Data.Array.Accelerate.Interpreter to
Data.Array.Accelerate.CUDA
in both  RasterizeAcc.hs  and   rasterize.hs

Then ""ghc --make rasterize"" is OK,
but running ./rasterize gives

{{{
rasterize-test1.txt (Haskell) - pass
rasterize-test2.txt (Haskell) - pass
rasterize-test3.txt (Haskell) - pass
rasterize-test4.txt (Haskell) - pass
rasterize: 
*** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument
}}}",j.waldmann
3,Active Tickets,40,liftAcc in CUDA.Execute doesn't handle Let and Let2,Accelerate language,,,missing functionality,tmcdonell,new,2010-12-21T05:02:26Z+0000,2010-12-21T05:02:26Z+0000,"When recovering sharing of computations involving `scanl'`, `scanr'`, etc whose results are used in `size`, `shape`, or `(!)` functions, let-bindings may be encountered by `D.A.A.CUDA.Execute.liftAcc`, which panics as it currently doesn't handle `Let` and `Let2` AST nodes.",chak
3,Active Tickets,41,Comprehensive support for standard Haskell classes and numeric conversions,Accelerate language,0.9.0.0,0.9 release,missing functionality,chak,new,2010-12-21T05:09:20Z+0000,2010-12-21T05:13:58Z+0000,,chak
4,Active Tickets,42,FFI support to interface with existing CUDA code,Accelerate language,0.9.0.0,,defect,chak,new,2010-12-21T05:10:58Z+0000,2010-12-21T05:10:58Z+0000,,chak
