Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#3116 closed bug (fixed)

missed opportunity for call-pattern specialisation

Reported by: duncan Owned by:
Priority: normal Milestone:
Component: Compiler Version: 6.10.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: eyeball/T3116.hs
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

With strict ByteString an a simple tail recursive iteration pattern we get perfect code.

First the definitions:

data ByteString
   = BS  {-# UNPACK #-} !(ForeignPtr Word8)  -- payload
         {-# UNPACK #-} !Int                 -- offset
         {-# UNPACK #-} !Int                 -- length

null :: ByteString -> Bool
null (BS _ _ l) = l <= 0

tail :: ByteString -> ByteString
tail (BS p s l)
  | l <= 0     = error "ByteString.tail: empty ByteString"
  | otherwise  = BS p (s+1) (l-1)

Now a trivial iteration pattern:

length :: ByteString -> Int
length = go 0
  where
    go !n bs  | null bs    = n
              | otherwise  = go (n+1) (tail bs)

Perfect core code (edited for clarity):

go :: Int# -> Addr# -> ForeignPtrContents
   -> Int# -> Int# -> Int#
go = \n p fpc o l ->
  case l <=# 0 of
    False  -> go (n +# 1) p fpc (o +# 1) (l -# 1)
    True   -> n

length :: ByteString -> GHC.Base.Int
length = \bs ->
  case bs of
    BS p fpc 0 l ->
      case go 0 p fpc 0 l of
        n -> I# n

This worked because strict ByteString is a single constructor data type which allows it to be unpacked into separate parameters in the recursive call.

Now, lets try the same with lazy ByteStrings:

data ByteString
   =  Empty
   |  Chunk {-# UNPACK #-} !StrictBS.ByteString ByteString

This of course has two constructors. It's built for call-pattern specialisation.

Now the ops:

null :: ByteString -> Bool
null Empty = True
null _     = False

tail :: ByteString -> ByteString
tail Empty = error "empty tail"
tail (Chunk (S.BS fp s 1) cs) = cs
tail (Chunk (S.BS fp s l) cs) = Chunk (S.BS fp (s+1) (l-1)) cs

We can use the exact same code for tail, but now for the lazy ByteString type:

length :: ByteString -> Int
length = go 0
  where
    go !n bs  | null bs    = n
              | otherwise  = go (n+1) (tail bs)

But, oh noes!! The optimisation does not work:

go :: Int# -> ByteString -> Int#
go = \n bs ->
  case bs of
    Empty                 -> n
    Chunk p fpc o l bs'  ->
      go (n +# 1)
         (case l of
            1 -> bs'
            _ -> Chunk p fpc (o +# 1) (l -# 1) bs')

length :: ByteString -> Int
length = \bs ->
  case go 0 bs of
    n -> I# n

However this is not because call pattern specialisation didn't do what we wanted. We know this for several reasons. One, if we remove the case

tail (Chunk (S.BS fp s 1) cs) = cs

then call pattern specialisation works and the code ends up as a perfect loop. Of course we need that case for correctness.

Also, if we change the definition of lazy ByteString to be a single constructor and represent the end by an empty chunk then we still get essentially the same code.

Also, if we use uncons instead of null and tail then we effectively perform by hand the missing optimisation transformation and get perfect code (and call-pattern specialisation happens exactly as expected).

length :: ByteString -> Int
length = go 0
  where
    go !n bs = case uncons bs of
      Nothing        ->  n
      Just (_, bs')  ->  go (n+1) bs'

uncons :: ByteString -> Maybe (Word8, ByteString)
uncons Empty = Nothing
uncons (Chunk c cs)
  | StrictBS.length c == 1
  = Just (S.unsafeHead c, cs)

  | otherwise
  = Just (S.unsafeHead c, Chunk (S.unsafeTail c) cs)

This version with uncons gives us perfect code:

go_chunk  :: ByteString -> Int# -> Int#
          -> ForeignPtrContents -> Addr# -> Int# -> Int#
go_chunk = \bs' l o fpc p n ->
  case l of
    1 -> go (n +# 1) bs'
    _ -> go_chunk bs' (l -# 1) (o +# 1) fpc p (n +# 1)

go :: Int# -> ByteString -> Int#
go = \n bs ->
  case bs of
    Empty -> n
    Chunk p fpc o l bs' ->
      case l of
        1 -> go (n +# 1) bs'
        _ -> go_chunk bs' (l -# 1) (o +# 1) fpc p (n +# 1)

length :: ByteString -> Int
length = \bs ->
  case go 0 bs of
    n -> I# n

and we can see the specialisation rule that ghc invented for us:

forall bs l o fpc p n.
     go n (Chunk p fpc o l bs)
  =  go_chunk bs l o fpc p n

Aside: looks like there's an extra/missing reverse in there somewhere.

The problem with the head / tail version is that the following transformation is never performed and so the opportunity for call pattern specialisation (or even simple worker/wrapper unpacking) is never exposed:

    go (n+1)
       (case l of
          1  -> bs'
          _  -> Chunk p fpc (o+1) (l-1) bs')
=
    case l of
      1  -> go (n+1) bs'
      _  -> go (n+1) (Chunk p fpc (o+1) (l-1) bs')

which is the fragment inside the inlined definition of the go worker function:

go !n bs = case bs of
  Empty                -> n
  Chunk p fpc o l bs'  ->
    go (n+1)
       (case l of
          1  -> bs'
          _  -> Chunk p fpc (o+1) (l-1) bs')

So go is tail recursive and strict in both arguments. This situation will arise whenever we have something like

go (tail xs)

and tail involves a case analysis. Since go is just a function call there is no problem in duplicating it into the two branches of tail. The fact that it enables the call-pattern specialisation makes this a huge win.

This will happen with lazy ByteStrings or any other chunked representation using trees. We really need the per-chunk inner loop to be good with just a single test in the fast path to see if we're at the end of the chunk. When we get to the end of the chunk we can move into the slow path and do more cunning things with chunks and trees and lazyness. But the overall speed of these kinds of operations is determined by the quality of the inner loop. If that inner loop has to allocate a fresh constructor each time round then we lose.

In principle, if we can optimise perfectly then there is no reason for strict ByteString to have faster code than lazy ByteString, because their inner loops should look exactly the same.

Change History (8)

comment:1 Changed 10 years ago by duncan

Summary: missed optimisation opportunity with lazy ByteStringsmissed opportunity for call-pattern specialisation
Type: bugrun-time performance bug

comment:2 Changed 10 years ago by duncan

It's not directly related to the missing transformation, but we could actually do slightly better in the call-pattern specialisation in the above example, even in the case where it already works.

As I understand it, for call-pattern specialisation we start with:

go !n bs = case bs of
  Empty                -> n
  Chunk p fpc o l bs'  ->
    go (n+1)
       (case l of
          1  -> bs'
          _  -> Chunk p fpc (o+1) (l-1) bs')

And we decide that it's profitable to specialise on the second argument for Empty and Chunk _ _ _ _ _ patterns. We then define:

go_empty n                = go n Empty
go_chunk n p fpc o l bs'  = go n (Chunk p fpc o l bs')

But, and this is the point, we leave the original go definition as it is. Of course later we apply the specialisation rules in the body of the original go.

However, here is a potential improvement. We could redefine the original go in terms of its specialisations:

go n bs
  = case bs of
      Empty                -> go_empty n
      Chunk p fpc o l bs'  -> go_chunk b p fpc o l bs'

Now we can leave it up to the heuristics of the inliner to decide if it is actually profitable to inline go_chunk here or if it is better to leave it as a call. If we decide not to inline it then we would get:

go n bs
  = case bs of
      Empty                -> n
      Chunk p fpc o l bs'  -> go_chunk b p fpc o l bs'

go_chunk n p fpc o l bs'
  = case l of
      1  -> go        (n+1) bs'
      _  -> go_chunk  (n+1) p fpc (o+1) (l-1) bs'

It seems to me there is little benefit here from inlining go_chunk into go. I've not benchmarked the two so I can't say for sure but this looks like it is the perfect code given the original definitions (though with the obvious unboxing). Indeed in this simple case it may not even be necessary to export the go_chunk specialisation rule because it is implicit in the new definition of go (though this would not be true if the inliner decided to inline go_chunk into the body of go).

Anyway, the point is, there may be such cases and by redefining the original body in terms of the specialisation we give the inliner the opportunity to make that choice. I don't see any downsides to giving it the choice. I may be missing something of course.

comment:3 Changed 10 years ago by simonpj

difficulty: Unknown

As luck would have it, I had a spare hour and looked at this. I was happy to discover that a few months ago I'd changed the representation of a StrictArg continuation (the SimplCont type) that made it easy to achieve what you want; previously it was nigh impossible which is why GHC was so wimpy.

So a matter of half a dozen lines of code changed gives the result you want. I attach my standalone test case; try compiling it with GHC 6.10 to see the bad behaviour. But with my new patch we get this for length:

T3116.$s$wgo =
  \ (sc_sy6 :: T3116.ByteString)
    (sc1_sy7 :: GHC.Prim.Int#)
    (sc2_sy8 :: GHC.Prim.Int#)
    (sc3_sy9 :: GHC.ForeignPtr.ForeignPtrContents)
    (sc4_sya :: GHC.Prim.Addr#)
    (sc5_syb :: GHC.Prim.Int#) ->
    case sc1_sy7 of ds_XvF {
      __DEFAULT ->
        T3116.$s$wgo
          sc_sy6
          (GHC.Prim.-# ds_XvF 1)
          (GHC.Prim.+# sc2_sy8 1)
          sc3_sy9
          sc4_sya
          (GHC.Prim.+# sc5_syb 1);
      1 -> T3116.$wgo (GHC.Prim.+# sc5_syb 1) sc_sy6
    }

T3116.$wgo =
  \ (ww_sxF :: GHC.Prim.Int#) (w_sxH :: T3116.ByteString) ->
    case w_sxH of _ {
      T3116.Empty -> ww_sxF;
      T3116.Chunk ipv_sw9 ipv1_swa ipv2_swb ipv3_swc ipv4_swd ->
        case ipv3_swc of ds_XvF {
          __DEFAULT ->
            T3116.$s$wgo
              ipv4_swd
              (GHC.Prim.-# ds_XvF 1)
              (GHC.Prim.+# ipv2_swb 1)
              ipv1_swa
              ipv_sw9
              (GHC.Prim.+# ww_sxF 1);
          1 -> T3116.$wgo (GHC.Prim.+# ww_sxF 1) ipv4_swd
        }
    }
end Rec }

T3116.length :: T3116.ByteString -> GHC.Types.Int
T3116.length =
  \ (w_sxH :: T3116.ByteString) ->
    case T3116.$wgo 0 w_sxH of ww_sxK { __DEFAULT ->
    GHC.Types.I# ww_sxK
    }

Can't do much better than that. Expect a patch shortly.

I don't understand your recent comment above, though. The RULES generated from the specialisations are indeed applied in the original function! Can you give a standalone example showing what happens, and what you want to happen instead?

Simon

comment:4 in reply to:  3 Changed 10 years ago by duncan

Replying to simonpj:

Can't do much better than that. Expect a patch shortly.

Woo hoo! That is indeed perfect. Thanks very much! I will update that example in my thesis accordingly! :-)

I don't understand your recent comment above, though. The RULES generated from the specialisations are indeed applied in the original function!

Yes, of course. Sorry, I was not very clear.

Can you give a standalone example showing what happens, and what you want to happen instead?

Ok, so with your fix we're getting essentially:

go_chunk n p fpc o l bs'
  = case l of
      1  -> go        (n+1) bs'
      _  -> go_chunk  (n+1) p fpc (o+1) (l-1) bs'

go !n bs
  = case bs of
      Empty                -> n
      Chunk p fpc o l bs'  ->
        case l of
          1  -> go        (n+1) bs'
          _  -> go_chunk  (n+1) p fpc (o+1) (l-1) bs'

The body of go contains a full unfolded instance of go_chunk (obviously, because that's the way go was defined and go_chunk was derived from a portion of the body of go).

But what if it turned out that calling go_chunk was better than having go_chunk unfolded into the body of go?

That is, it might be that this turns out to be the faster code:

go_chunk n p fpc o l bs'
  = case l of
      1  -> go        (n+1) bs'
      _  -> go_chunk  (n+1) p fpc (o+1) (l-1) bs'

go n bs
  = case bs of
      Empty                -> n
      Chunk p fpc o l bs'  -> go_chunk b p fpc o l bs'

Currently, when we do the specialisation we leave the original defintion unmodified. Of course subsequently we apply the rules to it etc, but initially it's just kept in it's original form.

The suggestion is that perhaps it is profitable to substitute calls to the specialised versions of the function into the original definition. We can always recover the original definition if the inliner decides it's the right thing to do.

So, in this example, we start with the definition:

go !n bs = case bs of
  Empty                -> n
  Chunk p fpc o l bs'  ->
    case l of
      1  -> go (n+1) bs'
      _  -> go (n+1) (Chunk p fpc (o+1) (l-1) bs')

Then we construct the specialised versions:

go_empty n = n

go_chunk n p fpc o l bs'
  = case l of
      1  -> go (n+1) bs'
      _  -> go (n+1) (Chunk p fpc (o+1) (l-1) bs')

Of course we also make the rules. In the current system we then just move on to applying the specialisation rules in the body of these functions and the original.

But suppose instead that we first redefine the original go in terms of it's specialisations:

go !n bs = case bs of
  Empty                -> go_empty n
  Chunk p fpc o l bs'  -> go_chunk n p fpc o l bs'

I do not mean doing this by rewrite rules, I mean just doing it directly as part of the procedure of generating specialised versions. Pick a pattern, lift the correspoding portion of the body out to become the specialised function and replace it with a call to the new specialised function. (That's an over-simplification of course but hopefully illustrates the idea).

Then we move on as normal to apply rules and do ordinary simplification. This now gives the inliner the choice. It can choose to inline go_chunk in the body of go or it can choose to keep it as a call. If go_chunk is quite large then it might be better to keep it as a call. In the existing scheme it has no choice because the original function definition is never changed, it always contains the full body of each of its specialisations.

Does that make any sense?

comment:5 Changed 10 years ago by simonpj

Oh, now I see, thanks. But I have no clue how to implement it! Just because 'f' has a profitable call pattern (f (x:xs)) does NOT mean that f starts with case analysis on its first argument. Or to put it another more precise way, what do you propose to replace Section 3.2 of the paper with?

I think you may be misled by a special case.

Simon

comment:6 Changed 10 years ago by duncan

Yes perhaps it only works for top level case analysis. It may still be a frequent special case. I'll think about how it generalises and what a full algorithm would look like.

comment:7 Changed 10 years ago by simonpj

Resolution: fixed
Status: newclosed
Test Case: eyeball/T3116.hs

The original (excellent) suggestion is implemented by this patch:

Wed Mar 25 09:52:05 GMT 2009  simonpj@microsoft.com
  * Improve mkDupableCont; and fix Trac #3116
  
  It turns out that, as a result of a change I made a few months ago to
  the representation of SimplCont, it's easy to solve the optimisation
  challenge posed by Trac #3116.  Hurrah.
  
  Extensive comments in Note [Duplicating StrictArg].
  

    M ./compiler/simplCore/SimplUtils.lhs -1 +1
    M ./compiler/simplCore/Simplify.lhs -19 +67

Now you get perfect code for the example! Test in eyeball/T3118.hs.

I'm going to close this ticket; if you like, you can open another for the more speculative idea you were proposing.

Simon

comment:8 Changed 10 years ago by simonmar

Type of failure: Runtime performance bug
Note: See TracTickets for help on using tickets.