Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#4978 closed bug (fixed)

Continuation passing style loop doesn't compile into a loop

Reported by: tibbe Owned by:
Priority: normal Milestone: 7.2.1
Component: Compiler Version: 7.0.1
Keywords: Cc: johan.tibell@…, pumpkingod@…, djahandarie@…, leuschner@…, kolmodin@…, dterei
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: perf/should_run/T4978
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I was investigating some poor performance in Data.Binary.Builder from the binary package. I boiled it down to GHC not turning a loop, expressed in CPS, into tail recursive function.

Here's the test code:

-- Simplification of a problem spotted in Data.Binary.Builder
module Repro (test) where

-- A builder that carries around one 'Int' worth of state.
newtype Builder = Builder { runBuilder ::  (Int -> Int) -> Int -> Int }

empty = Builder id
append (Builder f) (Builder g) = Builder (f . g)
add i = Builder $ \ k n -> k (n + i)
run b = runBuilder b id 0    

loop :: [Int] -> Builder
loop [] = empty
loop (x:xs) = add 1 `append` loop xs
    
test :: Int
test = run (loop [1..100])

Here's the (cleaned up) core:

test4 :: [Int] -> (Int -> Int) -> Int -> Int
test4 =
  \ (ys :: [Int])
    (k :: Int -> Int) ->
    case ys of _ {
      [] -> k;
      : x xs ->
        let {
          k2 :: Int -> Int
          k2 = test4 xs k } in
        \ (n_abz :: Int) ->
          k2
            (case n_abz of _ { I# x# ->
             I# (+# x# 1)
             })
    }

test3 :: [Int]
test3 = eftInt 1 100

test2 :: Int -> Int
test2 = test4 test3 (id @ Int)

test1 :: Int
test1 = I# 0

test :: Int
test = test2 test1

Note how test4 allocates a continuation it uses to call itself. Perhaps it could instead SAT the original continuation.

Attachments (2)

Repro3.hs (2.9 KB) - added by tibbe 9 years ago.
Repro4.hs (3.4 KB) - added by tibbe 9 years ago.

Download all attachments as: .zip

Change History (25)

comment:1 Changed 9 years ago by simonpj

Thank you for digging into the perf problem. (This one doesn't look like a regression btw; I think 6.12.3 behaves the same.)

I'm not sure what you mean by "SAT the original continuation".

To me the bad thing is that test4 gets arity 2. We'd like it to have arity 3. That is, we want to eta-expand test4.

And it's not hard to make it so. I made this 1-line change:

empty = Builder (\ k n -> k n)

Then I get

Main.main5 [Occ=LoopBreaker]
  :: [GHC.Types.Int]
     -> (GHC.Types.Int -> GHC.Types.Int)
     -> GHC.Types.Int
     -> GHC.Types.Int
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType SC(S)L]
Main.main5 =
  \ (ds_dpE :: [GHC.Types.Int])
    (eta_B1 :: GHC.Types.Int -> GHC.Types.Int)
    (eta1_X2 :: GHC.Types.Int) ->
    case ds_dpE of _ {
      [] -> eta_B1 eta1_X2;
      : x_abU xs_abV ->
        Main.main5
          xs_abV
          eta_B1
          (case eta1_X2 of _ { GHC.Types.I# x1_aq7 ->
           GHC.Types.I# (GHC.Prim.+# x1_aq7 1)
           })
    }

Much better! And indeed the program allocates less. With an input list of 100,000 (rather than 100), I get 19Mbytes of allocation before and 14Mbytes after.

Why is this? When compiling test4 GHC is worried that you might say

   test4 [] (error "urk") `seq` True

Now if we eta-expand test4 the before and after would behave differently.

The root of it is that f and (\y. f x) aren't the same, because seq can distinguish them.

Humph. See Note [Dealing with bottom] in CoreArity. In fact I do eta-expand bottoms because not doing so is a massive lose. But I do not eta-expand variables (as above). Maybe I should. Interesting.

First thing: is the transformation above what you were hoping for? If not, what? Write it by hand so we can see!

Simon

comment:2 Changed 9 years ago by tibbe

Cc: johan.tibell@… added

Ignore the comment about using the static argument transformation. This was just my guess at a solution to the problem (get rid of the continuation being passed around by hoisting it.)

I have a more realistic test case (taken directly from Data.Binary.Builder and then simplified). Your one line fix doesn't work here.

module Repro2 (test) where

import qualified Data.ByteString as S
import Data.ByteString.Internal (inlinePerformIO)
import Data.Monoid
import Foreign

newtype Builder = Builder {
        runBuilder :: (Buffer -> [S.ByteString]) -> Buffer -> [S.ByteString]
    }

instance Monoid Builder where
    mempty  = empty
    {-# INLINE mempty #-}
    mappend = append
    {-# INLINE mappend #-}
    mconcat = foldr mappend mempty
    {-# INLINE mconcat #-}

empty :: Builder
empty = Builder (\ k b -> k b)
{-# INLINE empty #-}

singleton :: Word8 -> Builder
singleton = writeN 1 . flip poke
{-# INLINE singleton #-}

append :: Builder -> Builder -> Builder
append (Builder f) (Builder g) = Builder (f . g)
{-# INLINE append #-}

-- Our internal buffer type
data Buffer = Buffer {-# UNPACK #-} !(ForeignPtr Word8)
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- used bytes
                     {-# UNPACK #-} !Int                -- length left

-- | Sequence an IO operation on the buffer
unsafeLiftIO :: (Buffer -> IO Buffer) -> Builder
unsafeLiftIO f =  Builder $ \ k buf -> inlinePerformIO $ do
    buf' <- f buf
    return (k buf')
{-# INLINE unsafeLiftIO #-}

-- | Get the size of the buffer
withSize :: (Int -> Builder) -> Builder
withSize f = Builder $ \ k buf@(Buffer _ _ _ l) -> runBuilder (f l) k buf

-- | Ensure that there are at least @n@ many bytes available.
ensureFree :: Int -> Builder
ensureFree n = n `seq` withSize $ \ l ->
    if n <= l then empty
    else error "Stub implementation: not enough space in buffer"
{-# INLINE ensureFree #-}

-- | Ensure that @n@ many bytes are available, and then use @f@ to write some
-- bytes into the memory.
writeN :: Int -> (Ptr Word8 -> IO ()) -> Builder
writeN n f = ensureFree n `append` unsafeLiftIO (writeNBuffer n f)
{-# INLINE writeN #-}

writeNBuffer :: Int -> (Ptr Word8 -> IO ()) -> Buffer -> IO Buffer
writeNBuffer n f (Buffer fp o u l) = do
    withForeignPtr fp (\p -> f (p `plusPtr` (o+u)))
    return (Buffer fp o (u+n) (l-n))
{-# INLINE writeNBuffer #-}

-- Test case

fromWord8s :: [Word8] -> Builder
fromWord8s [] = mempty
fromWord8s (x:xs) = singleton x `mappend` fromWord8s xs

word8s :: [Word8]
word8s = replicate 10000 97
{-# NOINLINE word8s #-}

test :: Builder
test = fromWord8s word8s

The problem is the same as before (a continuation getting in the way of a simple loop). In particular, lets look at the core for fromWord8s:

fromWord8s [Occ=LoopBreaker]
  :: [Word8] -> Builder
[GblId, Arity=1, Str=DmdType S]
fromWord8s =
  \ (ys :: [Word8]) ->
    case ys of _ {
      [] -> empty;
      : x xs ->
        let {
          a1_sTE [Dmd=Just L] :: Builder
          [LclId, Str=DmdType]
          a1_sTE = fromWord8s xs } in
        (\ (x2_aN1
              :: Buffer -> [Data.ByteString.Internal.ByteString]) ->
           let {
             k [Dmd=Just L]
               :: Buffer -> [Data.ByteString.Internal.ByteString]
             [LclId, Str=DmdType]
             k = (a1_sTE `cast` ...) x2_aN1 } in
           \ (buf :: Buffer) ->
             case buf
             of _ { Buffer addr# rb1_dMB o u l ->
             case <=# 1 l of _ {  -- Buffer size check
               False -> lvl1_rU0 `cast` ...;
               True ->
                 case x of _ { W8# x# ->
                 case writeWord8OffAddr#
                        @ RealWorld
                        (plusAddr# addr# (+# o u))
                        0
                        x#
                        realWorld#
                 of s2_aQv { __DEFAULT ->
                 case touch#
                        @ ForeignPtrContents rb1_dMB s2_aQv
                 of _ { __DEFAULT ->
                 k
                   (Buffer
                      addr#
                      rb1_dMB
                      o
                      (+# u 1)
                      (-# l 1))
                 }
                 }
                 }
             }
             })
        `cast` ...
    }

I don't understand why empty hasn't been inlined here and whether that's the reason we don't get a nice loop.

This particular performance bug is pretty important. The performance difference is almost 10x in my benchmarks, which are quite representative for what people actually use the package for (e.g. serializing lists of things).

comment:3 Changed 9 years ago by pumpkin

Cc: pumpkingod@… added

comment:4 Changed 9 years ago by simonpj

So HEAD (and 7.0.2) does a much better job here. It has a simple arity analyser that (somewhat to my surprise) is clever enough to spot that test2 really has arity 3. Your earlier example was in fact nastier, because test4's arity depended on its parameter k, which isn't the case with your new example.

Anyway try HEAD of 7.0.2.

T4978a.test2 [Occ=LoopBreaker]
  :: [GHC.Word.Word8]
     -> (T4978a.Buffer -> [Data.ByteString.Internal.ByteString])
     -> T4978a.Buffer
     -> [Data.ByteString.Internal.ByteString]
[GblId, Arity=3, Str=DmdType SC(S)L]
T4978a.test2 =
  \ (ds_dQG :: [GHC.Word.Word8])
    (eta_B1 :: T4978a.Buffer -> [Data.ByteString.Internal.ByteString])
    (eta1_X2 :: T4978a.Buffer) ->
    case ds_dQG of _ {
      [] -> eta_B1 eta1_X2;
      : x1_ax6 xs1_ax7 ->
        case eta1_X2
        of _ { T4978a.Buffer rb_dQL rb1_dQM rb2_dQN rb3_dQO rb4_dQP ->
        case GHC.Prim.<=# 1 rb4_dQP of _ {
          GHC.Bool.False ->
            lvl1_rYs
            `cast` (CoUnsafe
                      T4978a.Builder [Data.ByteString.Internal.ByteString]
                    :: T4978a.Builder ~ [Data.ByteString.Internal.ByteString]);
          GHC.Bool.True ->
            case x1_ax6 of _ { GHC.Word.W8# x2_aWt ->
            case GHC.Prim.writeWord8OffAddr#
                   @ GHC.Prim.RealWorld
                   (GHC.Prim.plusAddr# rb_dQL (GHC.Prim.+# rb2_dQN rb3_dQO))
                   0
                   x2_aWt
                   GHC.Prim.realWorld#
            of s2_aWv { __DEFAULT ->
            case GHC.Prim.touch#
                   @ GHC.ForeignPtr.ForeignPtrContents rb1_dQM s2_aWv
            of _ { __DEFAULT ->
            T4978a.test2
              xs1_ax7
              eta_B1
              (T4978a.Buffer
                 rb_dQL
                 rb1_dQM
                 rb2_dQN
                 (GHC.Prim.+# rb3_dQO 1)
                 (GHC.Prim.-# rb4_dQP 1))
            }
            }
            }
        }
        }
    }

comment:5 Changed 9 years ago by tibbe

That looks much better. I'll stil need to figure out if I can get test2 to be strict in the buffer, as there's lots of reboxing going on. That's outside the scope of this ticket.

Could we get this example turned into a test case? I'm not sure how to write one, otherwise I would do so myself. We'll likely depend on the arity analysis in the future, both in binary and attoparsec, and it would be nice to be confident that the performance of those libraries don't degrade dramatically in some future GHC version.

comment:6 Changed 9 years ago by tibbe

Actually, I can unbox Buffer by redefining empty like so:

empty :: Builder
empty = Builder (\ k b -> b `seq` k b)
{-# INLINE empty #-}

That gives me a "perfect" loop in the previous example. However, redefining empty in the real Data.Binary.Builder module doesn't work. The extra seq gets in the way of the arity analysis. I've attached Repro3.hs, which is very similar to the code I posted previously except that now includes code to allocate a new buffer when needed (that part was stubbed out in the above example).

If I don't add the seq to empty the arity analysis works, but Buffer gets passed around boxed. If I add the seq I get an unboxed Buffer, but the arity analysis fails and I get a closure allocated on each iteration of the loop.

Changed 9 years ago by tibbe

Attachment: Repro3.hs added

comment:7 Changed 9 years ago by djahandarie

Cc: djahandarie@… added

comment:8 Changed 9 years ago by dleuschner

Cc: leuschner@… added

comment:9 Changed 9 years ago by kolmodin

Cc: kolmodin@… added

comment:10 Changed 9 years ago by simonmar

Milestone: 7.2.1

Just a note to say that I have exactly the same problem in my parallelism monad (https://github.com/simonmar/monad-par). I believe I've tried a recent HEAD and couldn't get a loop to optimise correctly, just like tibbe's first example. I'll follow up with some concrete details and code shortly.

comment:11 Changed 9 years ago by simonmar

Here's my example boiled down as much as I can.

module Fac (fac) where

newtype Par a = Par {
    runCont :: (a -> Trace) -> Trace
}

instance Monad Par where
    return a = Par $ \c -> c a
    m >>= k  = Par $ \c -> runCont m (\a -> runCont (k a) c)

data Sched = Sched

type Trace = Sched -> IO ()

fac :: Int -> Par Int
fac n = if n <= 1 then return 1
                  else do x <- fac (n-1)
                          return (x * n)

which yields this for fac with today's HEAD:

a2_rnD :: (GHC.Types.Int -> ParFib.Trace) -> ParFib.Trace
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType C(S)]
a2_rnD = \ (c_agZ :: GHC.Types.Int -> ParFib.Trace) -> c_agZ a1_rnC

Rec {
ParFib.$wfac [Occ=LoopBreaker]
  :: GHC.Prim.Int# -> (GHC.Types.Int -> ParFib.Trace) -> ParFib.Trace
[GblId, Arity=1, Caf=NoCafRefs, Str=DmdType L]
ParFib.$wfac =
  \ (ww_snb :: GHC.Prim.Int#) ->
    case GHC.Prim.<=# ww_snb 1 of _ {
      GHC.Types.False ->
        let {
          a3_sno :: (GHC.Types.Int -> ParFib.Trace) -> ParFib.Trace
          [LclId, Str=DmdType]
          a3_sno = ParFib.$wfac (GHC.Prim.-# ww_snb 1) } in
        \ (c_ah2 :: GHC.Types.Int -> ParFib.Trace) ->
          a3_sno
            (\ (a4_ah3 :: GHC.Types.Int) ->
               c_ah2
                 (case a4_ah3 of _ { GHC.Types.I# x_ams ->
                  GHC.Types.I# (GHC.Prim.*# x_ams ww_snb)
                  }));
      GHC.Types.True -> a2_rnD
    }
end Rec }

It looks like the arity analyser should infer an arity of at least 2 (ultimately we want 3) for fac, but it has inferred only 1. The case should look cheap, and the let is for the recursive call.

I have tried eta-expanding return and >>=, to no avail. (that might still be necessary to get to arity 3, I'm not sure).

comment:12 Changed 9 years ago by batterseapower

Simon's example gets the "correct" arity of 2 if you replace the go_pap case in exprIsCheap' with (all (exprIsCheap' good_app) args). Although the comment there explicitly warns against doing this, I feel that is the right thing to do: why should f get arity 2 here:

f = \a -> let x = map g
              y = map h
          in \b -> .. y .. f ..

But not here:

f = \a -> let y = map (map g)
          in \b -> .. y .. f ..

There is no substantiative difference between the terms, but the current analysis treats them differently because it is sensitive to whether something is let-bound or occurs as an argument.

comment:13 in reply to:  12 Changed 9 years ago by tibbe

Replying to batterseapower:

Simon's example gets the "correct" arity of 2 if you replace the go_pap case in exprIsCheap' with (all (exprIsCheap' good_app) args).

This doesn't help in the Repro3.hs case above. For reference, this is the change I applied to GHC:

diff --git a/compiler/coreSyn/CoreUtils.lhs b/compiler/coreSyn/CoreUtils.lhs
index 70e1db7..1c86825 100644
--- a/compiler/coreSyn/CoreUtils.lhs
+++ b/compiler/coreSyn/CoreUtils.lhs
@@ -588,7 +588,7 @@ exprIsCheap' good_app other_expr    -- Applications and variables
     go _ _ = False
  
     --------------
-    go_pap args = all exprIsTrivial args
+    go_pap args = all (exprIsCheap' good_app) args
        -- For constructor applications and primops, check that all
        -- the args are trivial.  We don't want to treat as cheap, say,
        --      (1:2:3:4:5:[])

comment:14 Changed 9 years ago by tibbe

The above change to go_pap does improve the performance in the binary package's benchmark suite by a bit, but the core still doesn't look great. Here's an example from Repro3.hs:

$wa =
  \ (w_s11t :: [Word8])
    (w1_s11u :: Buffer -> [ByteString])
    (ww_s11x :: Addr#)
    (ww1_s11y :: ForeignPtrContents)
    (ww2_s11z :: Int#)
    (ww3_s11A :: Int#)
    (ww4_s11B :: Int#) ->
    case w_s11t of _ {
      [] -> w1_s11u (Buffer ww_s11x ww1_s11y ww2_s11z ww3_s11A ww4_s11B);
      : x1_ax1 xs1_ax2 ->
        let {
          $weta_s11K =
            \ (ww5_s11j :: Addr#)
              (ww6_s11k :: ForeignPtrContents)
              (ww7_s11l :: Int#)
              (ww8_s11m :: Int#)
              (ww9_s11n :: Int#) ->
              case x1_ax1 of _ { W8# x2_aYW ->
              case writeWord8OffAddr#
                     (plusAddr# ww5_s11j (+# ww7_s11l ww8_s11m)) 0 x2_aYW realWorld#
              of s2_aYY { __DEFAULT ->
              case touch# ww6_s11k s2_aYY of _ { __DEFAULT ->
              $wa
                xs1_ax2
                w1_s11u
                ww5_s11j
                ww6_s11k
                ww7_s11l
                (+# ww8_s11m 1)
                (-# ww9_s11n 1)
              }
              }
              } } in
        case <=# 1 ww4_s11B of _ {
          False ->
            case ww3_s11A of wild2_X18 {
              __DEFAULT ->
                : (PS ww_s11x ww1_s11y ww2_s11z wild2_X18)
                  (case newPinnedByteArray# 32752 realWorld#
                   of _ { (# _, mbarr#_a10h #) ->
                   $weta_s11K
                     (byteArrayContents# (mbarr#_a10h `cast` ...))
                     (PlainPtr mbarr#_a10h)
                     0
                     0
                     32752
                   });
              0 ->
                case newPinnedByteArray# 32752 realWorld#
                of _ { (# _, mbarr#_a10h #) ->
                $weta_s11K
                  (byteArrayContents# (mbarr#_a10h `cast` ...))
                  (PlainPtr mbarr#_a10h)
                  0
                  0
                  32752
                }
            };
          True -> $weta_s11K ww_s11x ww1_s11y ww2_s11z ww3_s11A ww4_s11B
        }
    }

It would be better to duplicate $weta_s11K in both branches instead of allocating a closure.

comment:15 Changed 9 years ago by simonmar

We should look into why $weta_s11K isn't being inlined, it looks like it ought to be. Perhaps touch# or the other primops don't look cheap enough to the simplifier?

comment:16 Changed 9 years ago by simonpj

Johan, if $weta was inlined would you be happy?

It's not being inlined just because GHC thinks it's too big (ie static code size; there's no work-duplication issue here). You can experiment by increasing the inlining threshold -funfolding-use-threshold=N. The default is 6; try increasing it to 15 or so.

Perhaps primops applications should count as practically free for inlining purposes?

Simon

comment:17 Changed 9 years ago by tibbe

With $weta inlined the code looks good enough; There's no longer any allocation in the common case i.e. when there's still space in the write buffer. Hopefully that also means there isn't a heap check in the common path.

It seems to me that at least some primops should look really cheap, like memory writes and reads, which only require a few instruction. Looking at primOpIsCheap (which simply calls primOpOkForSpeculation) we don't consider writeWord8OffAddr# cheap as it has side effects.

If primOpIsCheap would pattern match on the actual primop and return True or False on a case-by-case basis, we could make writeWord8OffAddr# look cheap. Shouldn't calls to out-of-line primops (like newPinnedByteArray#) also look cheap, as the primop itself will never be inlined?

The optimal Core for $wa would look something like:

$wa =
  \ (xs :: [Word8])
    (k :: Buffer -> [ByteString])
    (addr :: Addr#)
    (fp :: ForeignPtrContents)
    (o :: Int#)
    (u :: Int#)
    (l :: Int#) ->
    case xs of ys {
      [] -> k (Buffer addr fp o u l);
      : x xs1_awV ->
        case <=# 1 l of _ {
          False ->
            case u of wild2_X18 {
              __DEFAULT ->
                : (PS addr fp o wild2_X18)
                  (case newPinnedByteArray# 32752 realWorld#
                   of _ { (# _, mbarr#_a10k #) ->
                   let { a2_s10s = byteArrayContents# (mbarr#_a10k `cast` ...) } in
                   let { fp' = PlainPtr mbarr#_a10k } in
                   case touch# fp' realWorld# of _ { __DEFAULT ->
                   $wa ys k a2_s10s fp' 0 0 32752
                   }
                   });
              0 ->
                case newPinnedByteArray# 32752 realWorld#
                of _ { (# _, mbarr#_a10k #) ->
                let { a2_s10w = byteArrayContents# (mbarr#_a10k `cast` ...) } in
                let { fp' = PlainPtr mbarr#_a10k } in
                case touch# fp' realWorld# of _ { __DEFAULT ->
                $wa ys k a2_s10w fp' 0 0 32752
                }
                }
            };
          True ->
            case x of _ { W8# x# ->
            case writeWord8OffAddr#
                   (plusAddr# addr (+# o u)) 0 x# realWorld#
            of s2_aYX { __DEFAULT ->
            case touch# fp s2_aYX of _ { __DEFAULT ->
            $wa
              xs1_awV
              k
              addr
              fp
              o
              (+# u 1)
              (-# l 1)
            }
            }
            }
        }
    }

I'm not sure it's possible to get there as the code transformations I applied manually are non-obvious (i.e. I shared the call to writeWord8OffAddr# between all three branches).

comment:18 in reply to:  16 Changed 9 years ago by simonmar

Replying to simonpj:

Perhaps primops applications should count as practically free for inlining purposes?

I think you could get away with three classifications for the purposes of code size: touch# should be free, simple inline primops (e.g. +#, writeWord8OffAddr#) should be very cheap (they are just one instruction), and all others should look a bit bigger (the same as an application, probably).

comment:19 Changed 9 years ago by dterei

Cc: dterei added

Changed 9 years ago by tibbe

Attachment: Repro4.hs added

comment:20 Changed 9 years ago by tibbe

The current HEAD now generates good core for Repro3.hs, thanks to Simon Marlow. However, I've attached a more difficult (but real life) test case that still causes a closure to be allocated each time around the loop.

The main difference from Repro3.hs is that we now write four bytes at the time, increasing the size of the loop body a little bit.

comment:21 Changed 9 years ago by simonmar

Resolution: fixed
Status: newclosed

Ok, I believe this is now fixed. The relevant commits are:

and I added a test: changeset:cec2c3f6f99c3eea35bf1d1eb952ad9477166999

comment:22 Changed 9 years ago by simonmar

Milestone: 7.4.17.2.1

comment:23 Changed 9 years ago by simonpj

Test Case: perf/should_run/T4978
Note: See TracTickets for help on using tickets.