__color__,__group__,ticket,summary,component,version,milestone,type,owner,status,created,_changetime,_description,_reporter
3,Active Tickets,38,Create user defined data sturcture as instance of Elem,Accelerate language,,,enhancement,chak,new,2010-12-07T08:59:46Z+0000,2010-12-07T08:59:46Z+0000,"Hi, I'm trying to use accelerate to write multipole solver using CUDA as backend.

I wanted to create user defined data structure like:
{{{
data Panel b a
     = Panel { getBounds    :: b
             , getSrcBounds :: (Int, Int)
             -- , getFarExpan  :: Array Int a
             }
       deriving (Show, Typeable)
}}}
But I found that it's hard to declare it as an instance of `Elem' because its methods are hidden.

I understand that user can define a type using nested (,). But it would be nice to expose `Elem' to end user so user can use any type as `Elem'. Is there any concern about this?",fxie
3,Active Tickets,32,OpenCL Backend,Accelerate language,,,feature request,chak,new,2010-08-23T08:43:46Z+0100,2010-08-24T12:45:15Z+0100,"Since OpenCL is a standard which works across Graphics-Hardware Vendors, a backend for accelerate using this should make the library useful for a broader range of people.",anonymous
3,Active Tickets,37,"Support for ""combining"" operators",Accelerate language,0.8.0.0,,feature request,chak,new,2010-09-13T02:58:15Z+0100,2010-09-13T02:58:15Z+0100,"The ability to combine multiple arrays into a single one can currently only be achieved using zip/zipWith. Of course, these arrays can only combine 2 arrays at a time, so in general it would be useful to have operators for combining more than 2 arrays in more elaborate, but structured, ways.

Some requirements could be:

 * 'combine' operator:
   * a generalised array combining operator
   * sum of input arrays sizes is equal to output array size - i.e. no elements are lost or duplicated
   * no permutation is performed on elements within in input array - i.e. output array elements are contiguous with respect to their source input array
   * input array elements must all be of the same type and shape (shape would need to be a run-time check)
   * the combination does not need to preserve dimensionality - e.g. multiple 1D arrays could be combined to produce another 1D array (concatenation), or a 2D array (stacked) or even ""maybe"" a 3D array (stack-stacking?)
 * 'append' operator:
   * a specialised array combining operator
   * two input arrays - place one array at the ""end"" of the other
   * input array elements of the same type
   * the higher dimensions of the input arrays must have the same extent


In addition to fulfilling the need of a common pattern, combining operators would allow for further backend optimisations:

 *  input arrays to a combine operator would not require intermediate writing to memory - they can be written directly (by their producer) to their location within the combined output array
 * on architectures such as Fermi, the generation of the input arrays can be done in parallel by using separate streams - 'combine' in this case acts as synchronisation barrier until all computations are complete

",blever
3,Active Tickets,40,liftAcc in CUDA.Execute doesn't handle Let and Let2,Accelerate language,,,missing functionality,tmcdonell,new,2010-12-21T05:02:26Z+0000,2010-12-21T05:02:26Z+0000,"When recovering sharing of computations involving `scanl'`, `scanr'`, etc whose results are used in `size`, `shape`, or `(!)` functions, let-bindings may be encountered by `D.A.A.CUDA.Execute.liftAcc`, which panics as it currently doesn't handle `Let` and `Let2` AST nodes.",chak
3,Active Tickets,8,Sharing is lost,Accelerate language,0.8.1.0,0.9 release,missing functionality,chak,assigned,2010-07-18T13:32:46Z+0100,2010-12-21T05:14:55Z+0000,"Currently the frontend fails to preserve sharing in Accelerate expressions, which leads to unnecessary recomputation of shared values.",chak
3,Active Tickets,41,Comprehensive support for standard Haskell classes and numeric conversions,Accelerate language,0.9.0.0,0.9 release,missing functionality,chak,new,2010-12-21T05:09:20Z+0000,2010-12-21T05:13:58Z+0000,,chak
4,Active Tickets,42,FFI support to interface with existing CUDA code,Accelerate language,0.9.0.0,,defect,chak,new,2010-12-21T05:10:58Z+0000,2010-12-21T05:10:58Z+0000,,chak
3,Active Tickets,2,No Bool and Char arrays with the CUDA backend,CUDA backend,0.8.1.0,,defect,chak,new,2010-07-18T12:41:11Z+0100,2010-10-13T05:41:52Z+0100,"The CUDA backend can currently not handle arrays that contain elements of type `Bool` or `Char`.  In `D.A.A.Array.Data`, see the instance declarations for `ArrayElem Bool` and `ArrayElem Char` for details.",chak
3,Active Tickets,18,Permute does not properly write-combine results,CUDA backend,0.8.1.0,,defect,,new,2010-07-24T11:16:09Z+0100,2010-10-13T05:43:13Z+0100,"When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The `permute` operation does not currently take this into account.

{{{
main :: IO ()
main = do
  putStr ""Interpreter : "" ; print     (Interp.run accumulate)
  putStr ""CUDA        : "" ; print =<< (CUDA.run   accumulate)

accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
  where
    src = Acc.use $ Acc.fromList 16 (repeat 1)
    idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
    dst = Acc.use $ Acc.fromList 4  (repeat 0)
}}}

Which results in:

{{{
*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA        : Array 4 [1,1,1,1]
}}}


Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits. attachment:permute_tag.inl

For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). I was however unable to convince nvcc to reinterpret the bits of a float as an int (say), but in principle we should be able to do this... attachment:permute_atomic.inl
",tmcdonell
3,Active Tickets,39,./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument,CUDA backend,0.8.1.0,,defect,tmcdonell,new,2010-12-14T12:17:43Z+0000,2010-12-15T10:32:50Z+0000,"Hello. 

I was trying to build/use the current accelerate version,
and run into some problems.

I am using ghc-6.12.3 on Linux 2.6.32-bpo.5-amd64 (debian 5.0.6)
with cudatoolkit_3.2.16_linux_64_ubuntu10.04  and gcc-4.5.1
and my hardware is GTX295

* cabal install accelerate  
is trying to build accelerate-0.8.1.0
and it starts with building the dependency cuda-0.2.2
which fails with 

{{{
c2hs: Errors during expansion of binding hooks:

./Foreign/CUDA/Driver/Context.chs:76: (column 15) [ERROR]  >>> Unknown identifier!
  Cannot find a definition for `cuCtxCreate' in the header file.
}}}

* cabal install cuda   
builds cuda-0.3.2  and the installation runs through.


* cabal unpack accelerate
and manually remove the ""cuda < 0.3"" dependency
then cabal install   runs through.

* cd accelerate-0.8.1.0/examples/simple ; make ; ./test
looks OK 

* cd accelerate-0.8.1.0/examples/rasterize ; ghc --make rasterize

{{{
[1 of 2] Compiling RasterizeAcc     ( RasterizeAcc.hs, RasterizeAcc.o )

RasterizeAcc.hs:26:9:
    Not in scope: type constructor or class `NFData'
}}}

I manually add ""import Control.DeepSeq"" to RasterizeAcc.hs
then ""ghc --make"" succeeds

./rasterize  prints 
4 * Haskell (pass), 4 * (Accelerate interpreted) pass

So I figure this isn't using cuda at all.
I change  Data.Array.Accelerate.Interpreter to
Data.Array.Accelerate.CUDA
in both  RasterizeAcc.hs  and   rasterize.hs

Then ""ghc --make rasterize"" is OK,
but running ./rasterize gives

{{{
rasterize-test1.txt (Haskell) - pass
rasterize-test2.txt (Haskell) - pass
rasterize-test3.txt (Haskell) - pass
rasterize-test4.txt (Haskell) - pass
rasterize: 
*** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:48 (unhandled): CUDA Exception: invalid argument
}}}",j.waldmann
3,Active Tickets,34,CUDA backend does not support 'stencil',CUDA backend,0.8.1.0,0.9 release,missing functionality,tmcdonell,new,2010-08-28T05:57:53Z+0100,2010-12-21T05:15:36Z+0000,,tmcdonell
4,Active Tickets,26,Internal error in filter test with criterion,CUDA backend,0.8.0.0,,defect,tmcdonell,assigned,2010-08-18T06:39:41Z+0100,2010-08-18T10:09:45Z+0100,"On my MBP with a NVIDIA !GeForce 9400M (256MB VRAM) and CUDA 3.1, I get
{{{
benchmarking filter/cuda
collecting 100 samples, 1 iterations each, in estimated 5.192780 s
test: *** Internal error in package accelerate ***
*** Please submit a bug report at http://trac.haskell.org/accelerate
./Data/Array/Accelerate/CUDA.hs:45 (unhandled): CUDA Exception: invalid argument
}}}
The filter test runs fine in the validation phase. It only dies with criterion (probably as that tests a wider range of inputs).",chak
4,Active Tickets,28,reduced performance of small types,CUDA backend,0.8.1.0,,defect,,new,2010-08-22T08:44:59Z+0100,2010-10-13T05:46:59Z+0100,"CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.",tmcdonell
4,Active Tickets,36,scan operations hang indefinitely on devices with Compute Capability 1.0,CUDA backend,0.8.1.0,,defect,,new,2010-09-11T03:54:56Z+0100,2010-10-13T05:47:42Z+0100,"{{{scan_intervals}}}, in a for-loop, has a {{{__syncthreads()}}} and calls another device function {{{scan_block}}} that has a bunch of {{{__syncthreads()}}}. As threads do not exit the loop all at once, scan operations hang indefinitely at {{{__syncthreads()}}} on devices with Comput Capability 1.0.",seanl
4,Active Tickets,31,support concurrent kernel execution on Fermi architectures,CUDA backend,0.8.0.0,,enhancement,,new,2010-08-23T01:51:05Z+0100,2010-08-23T01:51:18Z+0100,,tmcdonell
4,Active Tickets,3,Unhelpful error when using Double on pre-1.3 CUDA devices,CUDA backend,0.8.1.0,,defect,None,new,2010-07-18T12:50:10Z+0100,2010-10-13T05:46:21Z+0100,"Only CUDA devices with compute capability 1.3 and up contain hardware support for Double.  Currently, Accelerate will nevertheless generate code for Doubles for these devices, which leads to a failure further down the pipeline.  This is not very user-friendly.

This raises the general questions of how to handle device capabilities elegantly.",chak
