Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#14137 closed bug (fixed)

Do more inlining into non-recursive join points

Reported by: simonpj Owned by: nomeata
Priority: normal Milestone:
Component: Compiler Version: 8.2.1
Keywords: JoinPoints Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: simplCore/should_compile/T14137
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

In pursuing #14068, Joahim says found this code:

let {                                           
  ds5 :: [[Int]]                                
  ds5 = case ==# x1 ww1 of {                      
        __DEFAULT -> go1 (+# x1 1#);            
        1# -> n                                 
      } } in                                    
join {                                          
  lvl6 :: [[Int]]                               
  lvl6 = (ds4 : y) : ds5} in     
joinrec {                                       
  safe :: Int -> Int -> [Int] -> [[Int]] 
  safe (x2 :: Int) (d :: Int) (ds6 :: [Int])    
    = case ds6 of {                             
        [] -> jump lvl6;                        
        : q l ->                                
          case x2 of wild6 { I# x3 ->           
          case q of { I# y1 ->                  
          case /=# x3 y1 of {                   
            __DEFAULT -> ds5;                   
            1# ->                               
              case d of { I# y2 ->              
              case /=# x3 (+# y1 y2) of {       
                __DEFAULT -> ds5;               
                1# ->                           
                  case /=# x3 (-# y1 y2) of {   
                    __DEFAULT -> ds5;           
                    1# ->                       
                      jump safe                 
                        wild6 (I# (+# y2 1#)) l 
                  }}}}}}
      }; } in                                   
jump safe ds4 lvl y; } in ...

This is terrible! lvl6 is a join point, but ds5 is not. And it's easy to fix: we should simply float ds5 into lvl6's rhs.

Backgound. In general, if GHC sees this

-- Version A
x = f x : g y

it'll float out the (f x) and (g y) parts thus (B):

-- Version B
a1 = f x
a2 = g y
x = a1 : a2

Reasons:

  1. In the scope of this binding we might have case x of (a:b) -> rhs, and the new form allows us to eliminate the case.
  1. Version A builds a thunk (which, when eval'd) builds thunks for (f x) and (g y) and returns a cons; whereas Version B builds the two thunks ahead of time and allocates a cons directly. If x gets evaluated this is a win, by avoiding an extra thunk. We measured this in our let-floating paper, and B is much better on average.

But if x is a join point, all this is wrong wrong wrong:

  1. A join point is never scrutinised by a nested case, so there is no benefit in floating.
  2. A join point is not implemented by a thunk, so there is no thunk alloc/update to avoid.

In fact, for join points we want to turn Version B into Version A!

Specifically:

  • In Simplify, do not call prepareRhs on join RHSs; nor do floating (see tick LetFloatFromLet) from the RHS of a join. In fact this seems to be already the case.
  • In OccurAnal, do not use rhsCtxt for the RHS of a non-recursive join point (see occAnalNonRecRhs). Setting this context makes a1 and a2 get "inside-lam" occurrences (see occAnalApp), and that in turn stops a1 and a2 getting inlined straight back into x's RHS in Version B. But for join points we want them to be inlined, to get back to Version A.

For a recursive join point we probably do not want to inline a1 and a2 because then they'd be allocated each time round the loop. But in fact we can't have a recursive join point whose RHS is just a cons, so it doesn't really arise. The point is: we only need to take care with rhsCtxt for non-recursive join points.

Bottom line: just that one fix to OccurAnal should do the job.

Change History (24)

comment:1 Changed 2 years ago by simonpj

Keywords: JoinPoints added

comment:2 Changed 2 years ago by nomeata

Owner: set to nomeata

comment:3 Changed 2 years ago by nomeata

I added the join-point related line here:

occAnalNonRecRhs env bndr bndrs body
  = occAnalLamOrRhs rhs_env bndrs body
  where
    -- See Note [Cascading inlines]
    env1 | certainly_inline = env
         | Just _ <- willBeJoinId_maybe bndr    = env
         | otherwise        = rhsCtxt env

to set env instead of rhsCtxt, and verified that it matches lvl6 in the example above. But it did not cause any effect.

(The desired effect would be that ds5 gets inlined into lvl6, so that the remaining uses of ds5, all in joinrec safe, are tail-calls and ds5 gets turned into a join point. Right?)

I am attaching the Haskell’ification of the source code above that I was using for testing.

comment:4 Changed 2 years ago by nomeata

I get an trac error when attaching files, so here it is:

module T14137 where
foo x ds4 n = safe x ds4
  where
  ds5 :: [[Int]]
  ds5 = if x == 0 then foo (x-1) ds4 n else n

  lvl6 :: [[Int]]
  lvl6 = (x : ds4) : ds5

  safe :: Int -> [Int] -> [[Int]]
  safe x2 ds6
    = case ds6 of
        [] -> lvl6;
        q : l | x2     == q -> ds5
              | x2 + 1 == q -> ds5
              | x2 + 2 == q -> ds5
              | x2 + 3 == q -> ds5
              | otherwise   -> safe (x+10) l

(I did not simply create a test case directly because I need to see the actualy, desired output in order to write the check.)

comment:5 Changed 2 years ago by simonpj

         | Just _ <- willBeJoinId_maybe bndr    = env

I think isJoinId would be fine, and much faster.

(The desired effect would be that ds5 gets inlined into lvl6, so that the remaining uses of ds5, all in joinrec safe, are tail-calls and ds5 gets turned into a join point. Right?)

Not quite: ds5 won't be a join point; it'll disappear entirely by being inlined.

comment:6 Changed 2 years ago by nomeata

I think isJoinId would be fine, and much faster.

Isn’t it the simplifier that turns a potential join point into one? So if I use isJoinId, won’t this delay the effect of the patch to the first run off the occurrence analyser after a simplifier run, instead of doing it right in the first occurrence analyzer run? (tagNonRecBinder only modifies the idOccInfo).

Not quite: ds5 won't be a join point; it'll disappear entirely by being inlined.

What about the many occurrences of ds5 in safe? Don’t they prevent inlining, due to code duplication? Also, safe is recursive, so we don’t inline into it anyways.

Last edited 2 years ago by nomeata (previous) (diff)

comment:7 Changed 2 years ago by simonpj

So if I use isJoinId, won’t this delay the effect of the patch to the first run off the occurrence analyser after a simplifier run

True. And in fact willBeJoinId is simpler than I thought, so probably ok.

But you'll need to ensure that you pass a occ-anal-tagged binder to occAnalNonRecRhs. And that in turn means you need to pass a tagged binder to occAnalUnfolding (not currently done) and occAnalRules (currently done).

comment:8 Changed 2 years ago by simonpj

Also, safe is recursive, so we don’t inline into it anyways.

Ah! Here is yet another place where join points are special. Really 1vl6 should be inlined. Consider

join j = e in
joinrec j2 x = ...j...

where there is just one occurrence of j in j2. Normally we'd be leery about inlining j under that \x because we might lose sharing. But nullary join points aren't thunks, so the situation is just as if we'd said

join j _ = e in
joinrec j2 x = ...(j ())...

when we'd cerainly inline it (via preInlineUnconditinoally). So let's make join points do that. In preInlineUnconditionally:

  | otherwise = case idOccInfo bndr of
                  IAmDead                    -> True -- Happens in ((\x.1) v)
                  occ@OneOcc { occ_one_br = True }
                                             -> try_once (occ_in_lam occ)
                                                         (occ_int_cxt occ)
                  _                          -> False

In the occ_one_br = True alternative, add a guard

                       | isJoinId bndr -> True    -- New
                       | otherwise     -> try_Once (occ_in_lam occ) ... as before

That should be good for everyone!

Now lvl6 will be inlined at its single call site.

Next question: we'd like the FloatOut pass not to float out tail-calls from a joinrec:

joinrec j x = case x of
                0 -> f True
                n -> j (n-1)

Here we don't want to float out (f True) from j, because although j is recursive, the f True is in tail position and can only be called once. I guess that is also true for

joinrec j x = case x of
                0 -> g x (f True)
                n -> j (n-1)

I'm not sure what the right criterion is and I'm out of time, so just leaving breadcrumbs here.

Last edited 2 years ago by simonpj (previous) (diff)

comment:9 Changed 2 years ago by simonpj

Finally there are the multiple occurrences of ds5. I think that ds5 counts as smallEnoughToInline and hence postInLineUnconditionally inlines it in HEAD. But in this variant I'm not longer sure becuase it's outside the recursive loop. Maybe it didn't start that way?

comment:10 Changed 2 years ago by nomeata

Maybe it didn't start that way?

Quite possible (didn’t check). The problem of things floating out of recursive lets and never back in has beein bothering me for a while, especially in the context of https://ghc.haskell.org/trac/ghc/ticket/10918#comment:5.

Maybe the following thought (which, in a way, is the opposite of what you propose in the second half of comment:8) deserves a ticket of its own:

Currently, we cannot inline x in

  let x = f x0
  in let go 10 y = exit1 x y
         go 20 y = exit2 (y*y)
         go 30 y = exit3 0 
         go i = go (i+1) (y*2)
     in go y

because it goes into a recursive function and under lambdas, and that is, in general, dangerous. But in this case, it would be ok because the occurence of x is not actually on a recursive path. But that is hard to spot in this form!

It would be easy to spot if we take all codepaths of go that do not mention go and float them out of the recursive let as non-recursive jointpoints:

  let x = f x0
  in join j1 y = exit1 x y
     join j2 y = exit2 (y*y)
     join j3 = exit3 0
     let go 10 y = call j1 y
         go 20 y = call j2 y
         go 30 y = call j3
         go i = go (i+1) (y*2)
     in go y

and now the existing machinery will happily inline x into j1 or j2.

In the example of this ticket, we’d move all occurrences of ds5 out of save this way, and the inliner could do its work unhindered by the recursion.

Last edited 2 years ago by nomeata (previous) (diff)

comment:11 Changed 2 years ago by nomeata

Now lvl6 will be inlined at its single call site.

Indeed the change to preInlineUnconditionally has the prognosed effect. I passed it to perf.haskell.org for performance evaluation.

comment:12 Changed 2 years ago by nomeata

Also note that floating out the exit code paths from recursive join points greatly improves the effectivity of the demand analyser, possibly to the point of making Call Arity obsolete (at least in some common cases). Consider this code:

let t = f a -- a thunk
in let go 0 y = t y + 5
       go n y = go (n-1) (y*y)
   in case go 100 2 of …

The cardinality analysis is not able to eta-expand t because it has to assume it is used multiple times. Call Arity, with the rather complicate co-call graph analysis finds that t is called at most once and allows for its eta-expansion.

But let’s see what happens if we apply some of the ideas from this tickets and from #14068. First we apply #14068:

let t = f a -- a thunk
in let go n y = joinrec j 0 y = t y + 5
                        j n y = call j (n-1) (y*y)
                in call j n y
   in case go 100 2 of …

I’d like to call this first joinrec normal form, defined as “every recursive function where all recursive calls are tail-recursive is a recursive join point”.

Next we apply the idea above of floating out all expressions not mentioning the join point

let t = f a -- a thunk
in let go n y = join jexit y = t y + 5
                joinrec j 0 y = call jexit y
                        j n y = call j (n-1) (y*y)
                in call j n y
   in case go 100 2 of …

I’d like to call this second joinrec normal form, defined as “first joinrec normal form, and all subexpressions of a recursive join point j that are in tail-call position and do not mention j are invocations a join point”.

This form minimizes the amount of code that is inside the joinrec, which is good, as many parts of the compiler (simplifier, demand analyzer) have to make conservative assumptions with recursion.

Now the normal demand analyser can infer that the non-recursive go is called at most once, hence the join-point jexit is called at most once (by virtue of being a join-point, it can do so without looking at its use-site, which are inside a recursive group), and hence t is used at most once. Success!

comment:13 in reply to:  11 Changed 2 years ago by nomeata

Replying to nomeata:

Now lvl6 will be inlined at its single call site.

Indeed the change to preInlineUnconditionally has the prognosed effect. I passed it to perf.haskell.org for performance evaluation.

Two changes observed: Runtime of scs improved by 3% (could be noise). But compiler performance regressed slightly (around +1% allocations in a few cases). Biggest loser is the parsing compiler performance benchmark tests/alloc/parsing001, 3% increase of allocations. I am not inclined to hunt down that regression (I’d require building GHC twice with identical settings and staring at lots of ticky profiles, and then staring at core code), I hope that’s ok. My completely unfounded gut feeling is that a join-point is inlined into a recursive function, and then floated out again in a form that is no longer a join point.

Given these results, do you want me to write a Note and push it to master, or not do that just yet and maybe first follow the breadcrumb “we'd like the FloatOut pass not to float out tail-calls from a joinrec”?

(But see above why it seems to me that floating out tail-calls from a joinrec is very desirable…)

comment:14 Changed 2 years ago by simonpj

There are three threads here (maybe we need separate tickets):

  • The original ticket description: this remains valid. Did you try it? It may not affect the queens example because of the other occurrences of ds5 which I'd missed; but it's a valid and beneficial change regardless.
  • preInlineUnconditionally for join points (comment:7). This ok, but it really should not make much difference in either direction. Unlike most inlining, it doesn't eliminate an allocation; it only eliminates an unconditional jump; and that jump will probably be eliminated in Cmm anyway. Doing the inlining cannot (I think) give rise to any knock-on optimisations.

I'm intrigued why scs goes faster (comment:13). Was that runtime or allocation? Also in the parsing001 benchmark, try with -dshow-passes to see if there's a change in the volume of generated code. That's wha usually makes compiler perf change.

  • Floating out more join points. I'm intrigued by your suggestion in comment:10, but I am not persuaded.
    • Generally speaking GHC inlines things that are called once, the exact reverse of ANF. Doing is more direct and reduces clutter.
    • We'd need a way to stop these used-once join points being inlined.
    • Core has an operational interpretation, and introducing a join connotes an extra jump. Maybe Cmm will eliminate it, but still.
    • Join points with parameters pass the parameters according to the calling convention, which we don't need.
    • Join points with parameters might be strictness analysed, specialised, and worker/wrappered. All stupid for called-once join points.
    • Lastly, it doesn't solve the problem in general. For example
      let ys = expensive in
      letrec f xs = case xs of
                     []     -> ys
                     (p:ps) -> p : f ps
      in ...
      
      Even after loopification f is not a joinrec. But, if (and only if!) f is called once in ..., it's safe to inline ys in the RHS of f. Dually, if it was
      letrec f xs = case xs of
                     []     -> expensive
                     (p:ps) -> p : f ps
      in ...
      
      (and f was called once "outside") we would not want to float expensive out.

In short, this business of spotting the exit paths of a recursive loop is an interesting challenge, but it goes beyond join points.

comment:15 Changed 2 years ago by simonpj

Consider the example above:

let ys = expensive in
letrec f xs = case xs of
               []     -> ys
               (p:ps) -> p : f ps
in f ps : qs

Here f is called exactly once in the body of the letrec, but not in tail position.

Question: can cardinality analysis discover that ys is evaluated at most once? I would have thought the answer should be 'yes'. And if so, we can inline it, which is what we want. To make this go

  • We'd need to check that the cardinality analysis fixpoint stuff was up to snuff
  • We'd need to make the inliner take account of that used-once info.

comment:16 Changed 2 years ago by nomeata

That example is what #10918 is about. The Cardinality Analyser does not see that ys is evaluated at most once, but Call Arity does. We can make it inline, but the tricky bit is to prevent it from being floated out again. That’s what I feel that a call to a join point outside of letrec would be more explicit and more stable – although I see the clumsiness of that. There you write:

But it does seem hard. The float-out pass assumes, crudely, that it's good to float out a redex of any called-many lambda. But, as we see here, that's wrong for case branches that are only evaluated on one of those calls (the final one in this case). Not only is that info hard to record in the syntax tree, but it's also potentially quite fragile to program transformation, like other sorts of cardinality information.

And a call to a join-point outside the letrec is so far the only way I can think of to “record it in the syntax tree” – but maybe there are better ways?

comment:17 Changed 2 years ago by simonpj

That’s what I feel that a call to a join point outside of letrec would be more explicit

Yes but it doesn't work in general. What if it was

let ys = expensive in
letrec f xs = case xs of
               []     -> ys : []
               (p:ps) -> p : f ps
in f ps : qs

Then ys is not a join point in any shape or form; but all the argument about being able to inline it is equally valid. I don't want to build a complicated infrastructure that only solves part of the problem!

Ah, silly me. I was struck with a brilliant idea, and then realised that it's exactly what you describe in comment:12. Take any case alternative in the recursive function that does not mention the recursive function, and make it into a non-recursive join point, thus:

letrec f xs = case xs of
                [p]    -> e1   -- no f's
                (p:ps) -> e2[f]

===>

let f as = join j p = e1 in
           joinrec f x = case xs of
                [p] -> j p
                (p:ps) -> e2[f]
           in f as

Now we can readily inline ys into e1 if f if called once (and hence the \as is marked one-shot. Moreover, it works for arbitary expressions e1.

Of course you worked this out already a few days ago; I just didn't read carefully enough. Sorry. I acrually rather like it now, as a part of the loopification transformation.

Still to think about

  • A remaining tricky point is that we need to stop one of these carefully-constructed non-recursive join points being inlined into a recursive join point, even if it is invoked at just one place. That should not be hard. And in a final run of the simplifer (or in CorePrep) we could switch off that restriction and let it inline.

Go for it! (But do these other simpler things first.)

comment:18 Changed 2 years ago by nomeata

Ok, it’s high time time to split the discussions. The exit path join point idea is now at #14152, and this ticket can continue to explore the inlining of join-points into recursive join-points.

In comment:14 you see three threads here. I do not quite see the difference between the first two threads you identified. Isn’t comment:7 precisely the “that one fix to OccurAnal [that] should do the job.” from the description?

comment:19 Changed 2 years ago by simonpj

I do not quite see the difference between the first two threads

The original Description is about floating a non-join point into a nullary join point:

let x = e in
join j = Just x in 
...

where this is the only occurrence of x. Here we want to transform to

join j = Just e

so that the thunk creation occurs only if j is executed rather than always.

But for nullary lets we do the reverse transform

let x = Just e 
===>
let a = e in
let x = Just e

for reasons described above.

In contrast, the second thread is about inlining join points, not about inlining lets. And as I say in comment:14 I'm no longer so keen on this idea, esp in view of #14152

comment:20 Changed 2 years ago by nomeata

The original Description is about floating a non-join point into a nullary join point:

That is happening in HEAD already (at least if I modify the example in comment:4 so that ds5 occurs only in the join point lvl6, then ds5 is floated into lvl6.) So maybe there is actually nothing to do?

comment:21 Changed 2 years ago by simonpj

Ha! I tried

module T14137 where

f xs = let thunk = length xs
           j = Just thunk
           g 0 = j
           g n = g (n-1)
       in
       g 7

Indeed, the inliner does not inline thunk, because of the lack of comment:3. But later FloatIn does push it inwards (it knows about join points); and the simplifier also knows not to float things out of join points. So all is well.

Still, it's more direct to do it right away. I'll do it shortly because I'm meddling there anyway.

That's all for this ticket!

comment:22 Changed 2 years ago by Simon Peyton Jones <simonpj@…>

In 8649535/ghc:

Don't do the RhsCtxt thing for join points

This minor change fixes Trac #14137.

It is described in Note [Join point RHSs] in OccurAnal

comment:23 Changed 2 years ago by simonpj

Resolution: fixed
Status: newclosed

comment:24 Changed 2 years ago by simonpj

Test Case: simplCore/should_compile/T14137
Note: See TracTickets for help on using tickets.