Ticket #32 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Performance of concatMap and O2/Odph

Reported by: choener Owned by:
Priority: critical Milestone: 0.7.1
Version: 0.6 Keywords:
Cc:

Description

This one is fun, I get slowdowns of up to 10.000x which is especially nice in code that is called often ;-) (why exactly, I do not know yet, it should be more like x5 - 10x) -Odph adversely effects runtime performance here!

The code below differs only in the "makes it slow" line.

module Main where

import qualified Data.Vector.Unboxed as VU
import Criterion.Main

iL = 4
jL = 100

good :: Int -> Int -> VU.Vector (Int,Int)
good i j = {-# CORE "good" #-}
  VU.map (\(k,l) -> (k-l,l)) $
  VU.concatMap (
    \d -> VU.map (\d' -> (d,d'))
          $ VU.enumFromN 3 (d-5))    -- for each distance, all possible left/right combinations
  $ VU.enumFromN 8 (min 23 (j-i-13)) -- diagonal distance or number of unpaired nucleotides -2.
{-# INLINE good #-}


bad :: Int -> Int -> VU.Vector (Int,Int)
bad i j = {-# CORE "bad" #-}
  VU.map (\(k,l) -> (i+k,j-l)) $ -- this part makes it slow!
  VU.map (\(k,l) -> (k-l,l)) $
  VU.concatMap (
    \d -> VU.map (\d' -> (d,d'))
          $ VU.enumFromN 3 (d-5))    -- for each distance, all possible left/right combinations
  $ VU.enumFromN 8 (min 23 (j-i-13)) -- diagonal distance or number of unpaired nucleotides -2.
{-# INLINE bad #-}

main = defaultMain
   [ bench "good" $ whnf (\j -> VU.sum $ VU.map (\(k,l) -> k+l) $ good iL j) jL
   , bench "bad" $ whnf (\j -> VU.sum $ VU.map (\(k,l) -> k+l) $ bad iL j) jL
   ]

Change History

Changed 3 years ago by choener

With -O2 and -funfolding-use-threshold=100, the difference drops to 18x

Changed 3 years ago by rl

  • priority changed from minor to critical
  • milestone set to 0.8

Changed 3 years ago by rl

  • milestone changed from 0.8 to 0.7.1

Changed 3 years ago by rl

There's good news and bad news here. With GHC 7, there is practically no performance difference between the two functions. But both functions are slower when compiled with GHC 7 than the first one when compiled with GHC 6.12. I need to investigate more.

Changed 3 years ago by rl

This is all quite horrible. I rewrote concatMap such that good has the same performance with 7.0 as with 6.12.3 but now bad is about 10x slower than good again. I found one GHC bug which affects concatMap but working around it doesn't help with bad. The problem seems to be that bad's loop uses i, can't be floated out because of this and then GHC doesn't optimise it as well as good's loop which is getting floated out. Why this affects optimisation is a bit of a mystery so far. Perhaps SpecConstr is getting confused by something.

Changed 3 years ago by rl

I found another GHC bug which seems to play a role here. I suspect there are more. This is a great example, thanks!

Changed 3 years ago by choener

You're welcome.

Actually, if this gets faster it makes me very happy. Making above faster has direct consequences for RNA folding algorithms and is responsible for something like 50% of the runtime of 'normal input'. :-)

Gruss, Christian

Changed 3 years ago by rl

Alas, the two bugs I found so far have been fixed but that didn't help. I have a suspicion about what's going on here but I need to produce a small testcase which demonstrates the problem to Simon PJ and that's tricky. I'll keep working on it, though, it's on the top of my list.

That said, concatMap is never going to be really fast. You would probably benefit from having segmented operations in the style of the dph-prim-seq package. I'm going to move those into a separate, nice package eventually but that won't happen in the next couple of weeks. I'll see if I can rewrite your code to use segmentation on the weekend. Are these really the loops you use in your code? If not, where can I find this particular bit? In !RNAFold? Are there benchmarks I could run?

Changed 3 years ago by choener

My first answer was eaten by a timeout:

In http://hackage.haskell.org/packages/archive/RNAFold/0.0.2.1/doc/html/src/BioInf-RNAFold-Functions.html please look for the function 'interiorLoopDistances'. At the call site in 'largeInteriorLoopBase' we basically have the "bad" case.

I'll have to recheck the core again with ghc7-rc1 if there still is an intermediate array being created.

I will add benchmarks and other stuff in the next release on hackage (they are already in my darcs, which is not public due to severely embarassing check-in comments ;-)

Changed 3 years ago by rl

Another GHC bug. This one might actually be the real culprit.

Changed 3 years ago by choener

I should rewrite the benchmarks to make them less messy. That'll probably be in the next release. For now, just use the messy stuff, uploaded here: http://www.tbi.univie.ac.at/~choener/Haskell/benchmarks/ They are a part of Lib-RNAFold.

BadCases?.hs contains the more annoying functions.

Changed 3 years ago by rl

Sigh. I fixed the GHC problem which also affected a couple of DPH programs but that didn't do a lot for this particular example. This is rather annoying.

Nevertheless, I wonder if this GHC patch has any effect at all on your programs:

Thu Nov 18 13:28:39 PST 2010  Roman Leshchinskiy <rl@cse.unsw.edu.au>
 * ForceSpecConstr now forces specialisation even for arguments which aren't scrutinised

   M ./compiler/specialise/SpecConstr.lhs -24 +31

Changed 3 years ago by rl

  • status changed from new to closed
  • resolution set to fixed

Doh, I'm such an idiot. GHC transforms good to:

good :: Int -> Int -> VU.Vector (Int,Int)
good i j
  | 23 <= j-i-13 = 
    VU.map (\(k,l) -> (k-l,l)) $
    VU.concatMap (
      \d -> VU.map (\d' -> (d,d'))
            $ VU.enumFromN 3 (d-5))
    $ VU.enumFromN 8 23

  | otherwise =
    VU.map (\(k,l) -> (k-l,l)) $
    VU.concatMap (
      \d -> VU.map (\d' -> (d,d'))
            $ VU.enumFromN 3 (d-5))
    $ VU.enumFromN 8 (j-i-13)

Now, the first alternative doesn't depend on i and j at all and is floated out:

vec :: VU.Vector (Int,Int)
vec = VU.map (\(k,l) -> (k-l,l)) $
      VU.concatMap (
        \d -> VU.map (\d' -> (d,d'))
              $ VU.enumFromN 3 (d-5))
      $ VU.enumFromN 8 23


good :: Int -> Int -> VU.Vector (Int,Int)
good i j
  | 23 <= j-i-13 = vec
  | otherwise = ...

Of course, j-i-13 is indeed greater than 23 in the example so it's the first alternative that get executed. When Criterion runs this multiple times, the actual computation only happens once. In all subsequent runs, good is basically a noop which, of course, explains why it appears to be fast.

FWIW, the implementation of concatMap has changed in vector 0.7. The result is that bad runs twice as fast as with 0.6 and good now has the same performance as bad. I presume this is because GHC fails to float out the computation.

I'm going to close this. It was a good example, helped find quite a few bugs in SpecConstr.

Note: See TracTickets for help on using tickets.