Opened 9 years ago

Closed 5 years ago

#4428 closed bug (fixed)

Local functions lose their unfoldings

Reported by: rl Owned by:
Priority: low Milestone:
Component: Compiler Version: 7.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I'm attaching 2 modules, both should be compiled with -O2. Let's look at the inteface of Foo:

  foo :: (GHC.Types.Int, GHC.Types.Int)
         -> [GHC.Types.Int]
         -> [GHC.Types.Int]
    {- Arity: 2, Strictness: U(U(L)U(L))L, Inline: INLINE (sat-args=2),
       Unfolding: InlineRule (2, False, False)
                  (\ ds :: (GHC.Types.Int, GHC.Types.Int) xs :: [GHC.Types.Int] ->
                   case @ [GHC.Types.Int] ds of wild { (i, n) ->
                   let {
                     step :: forall t.
                             ([t], GHC.Types.Int, GHC.Types.Int)
                             -> Data.Maybe.Maybe (([t], GHC.Types.Int, GHC.Types.Int), t)
                       {- Inline: INLINE[0] (sat-args=1) -}
                     = ...

Note how step, which is local to foo, has an INLINE pragma but no unfolding. Now, let's look at Bar. For some reason, GHC only inlines foo into bar in phase 1. This is probably unrelated but still surprising. The main problem, though, is that step, not having an unfolding, doesn't get inlined at all despite the INLINE pragma.

Attachments (2)

Foo.hs (526 bytes) - added by rl 9 years ago.
Bar.hs (77 bytes) - added by rl 9 years ago.

Download all attachments as: .zip

Change History (28)

Changed 9 years ago by rl

Attachment: Foo.hs added

Changed 9 years ago by rl

Attachment: Bar.hs added

comment:1 Changed 9 years ago by simonpj

A fix for the late inlining of foo:

Mon Oct 25 08:26:22 PDT 2010  simonpj@microsoft.com
  * Do not (ever) use substExprSC in the simplifier
  
  "Short-cut" substitution means "do nothing if the substitution
  is empty". We *never* want do to that in the simplifier because
  even though the substitution is empty, the in-scope set has
  useful information:
  
   * We get up-to-date unfoldings; and that in turn may
     reduce the number of iterations of the simplifier
  
   * We avoid space leaks, because failing to substitute may
     hang on to old Ids from a previous iteration
  
  (This is what was causing the late inlining of foo in
  Trac #4428.)

    M ./compiler/coreSyn/CoreSubst.lhs -6 +8
    M ./compiler/simplCore/SimplEnv.lhs -8 +16

It's only a performance fix, but I believe it could close a space leak, so probably worth merging when convenient.

The main bug is fixed by this:

Mon Oct 25 08:28:17 PDT 2010  simonpj@microsoft.com
  * Serialise nested unfoldings across module boundaries
  
  As Roman reported in #4428, nested let-bindings weren't
  being recorded with their unfoldings.  Needless to say,
  fixing this had more knock-on effects than I expected.

    M ./compiler/coreSyn/CorePrep.lhs -3 +7
    M ./compiler/coreSyn/CoreTidy.lhs -12 +38
    M ./compiler/iface/BinIface.hs -5 +12
    M ./compiler/iface/IfaceSyn.lhs -21 +23
    M ./compiler/iface/IfaceType.lhs -11 +18
    M ./compiler/iface/MkIface.lhs -23 +15
    M ./compiler/iface/TcIface.lhs -35 +40
    M ./compiler/main/TidyPgm.lhs -25 +5

and

Tue Oct 26 01:17:57 PDT 2010  simonpj@microsoft.com
  * Fix initialisation of strictness in the demand analyser
  
  Previously, the demand analyser assumed that every binder 
  starts off with no strictness info.  But now that we are
  preserving strictness on nesting bindings in interface files,
  that assumption is no longer correct, because an inlined function
  might have a nested binding with strictness set.
  
  So we need to know when we are in the initial sweep, so that we can
  set the strictness to 'bottom'. 
  
  See Note [Initialising strictness]

    M ./compiler/stranal/DmdAnal.lhs -112 +151

These are bigger changes, and there's a small change to the format of interface files too, so I'm not certain about merging. Roman: is this a big deal for 'vector'?

Simon

comment:2 Changed 9 years ago by rl

Thanks, Simon! Both vector and DPH are littered with local INLINE functions. Occasionally, they grow big enough for the pragma to matter. I noticed the bug while investigating an issue that occured in the wild. I would be very much in favour of merging this if at all possible, especially because this one is difficult to spot and very hard to work around in the library.

comment:3 Changed 9 years ago by igloo

Merged this one:

Mon Oct 25 16:26:22 BST 2010  simonpj@microsoft.com                   
  * Do not (ever) use substExprSC in the simplifier

comment:4 Changed 9 years ago by igloo

Status: newmerge

We'll merge the others for 7.0.2.

comment:5 Changed 9 years ago by igloo

Milestone: 7.0.2

comment:6 Changed 9 years ago by rl

Alas, it seems that local unfoldings need more work as the patch seems to cause significant compile time increases and sometimes even to performance regressions with vector code. The reason for this is quite simple. Suppose we have this in module Foo:

{-# INLINE foo #-}
foo ... = let {-# INLINE [0] local #-}
              local = <unoptimised rhs>
          in
          <foo rhs>

We now inline this into bar (in a different module), turning this:

bar = ... foo ...

into this:

bar = let {-# INLINE [0] local #-}
          [Tmpl = <unoptimised rhs>]
          local = <unoptimised rhs>]
      in
      ... <foo rhs> ...

Now, GHC will merrily optimise both <foo rhs> and local's rhs but it will not optimise local's unfolding before it is inlined. Thus, right before phase 0 we will have:

bar = let {-# INLINE [0] local #-}
          [Tmpl = <unoptimised rhs>]
          local = <optimised rhs>]
      in
      ... local ...

In phase 0, we will throw away local's optimised rhs, inline the unoptimised unfolding and optimise that. In effect, we are optimising local twice at every call site, leading to increased compile times. Moreover, we only have one phase in which to optimise local's unfolding after inlining it so might produce worse code than before.

I'm not sure what to do here. Local INLINE functions should probably be treated specially somehow. Perhaps their unfoldings ought to be thrown away and then rerecorded right before they can be inlined.

comment:7 Changed 9 years ago by simonpj

I'm really not sure what behaviour you want here. When you write {-# INLINE foo #-} you mean "when you see code using foo, behave as if you'd seen that code using foo's RHS instead". So

bar = ...foo...

behaves as if you had written

bar = ...(let {-# INLINE [0] local #-}
            [Tmpl = <unoptimised rhs>]
            local = <unoptimised rhs>]
          in
          ... <foo rhs> ...) ...

I assume this is what you wanted.

Now suppose you wrote that in the first place. Well then, the {-# INLINE[0] local #-} says "in phase 0 please inline local, replacing it with the original (unoptimised) RHS. And that's exactly what happens.

Without understanding more clearly what you are trying to achieve I don't think I can help much more. Maybe an example showing how you are exploiting these nested inlinings?

Simon

comment:8 in reply to:  7 ; Changed 9 years ago by rl

Replying to simonpj:

Now suppose you wrote that in the first place. Well then, the {-# INLINE[0] local #-} says "in phase 0 please inline local, replacing it with the original (unoptimised) RHS. And that's exactly what happens.

Actually, didn't we say {-# INLINE[0] local #-} means: "in phase 0 please inline local, replacing it with the RHS it would have right before phase 0"? I was under the impression that GHC would inline into unfoldings as long as it didn't affect phasing. Has that changed?

Anyway, the biggest problem is that local INLINE functions are optimised twice (the rhs and the unfolding after it's been inlined) and usually the rhs is just thrown away so it's completely wasted work. For DPH/vector programs, this leads to significantly longer compile times. And compilation of such programs is quite slow even without this problem.

I'm not sure if there is a reason for local INLINE functions to ever have an unfolding that is different from their rhs. I can't think of any situation where this makes a difference semantically.

Without understanding more clearly what you are trying to achieve I don't think I can help much more. Maybe an example showing how you are exploiting these nested inlinings?

Here is an example from vector:

-- | Map a monadic function over a 'Stream'
mapM :: Monad m => (a -> m b) -> Stream m a -> Stream m b
{-# INLINE [1] mapM #-}
mapM f (Stream step s n) = Stream step' s n
  where
    {-# INLINE [0] step' #-}
    step' s = do
                r <- step s
                case r of
                  Yield x s' -> liftM  (`Yield` s') (f x)
                  Skip    s' -> return (Skip    s')
                  Done       -> return Done

We want to inline step' late because doing it early can introduce join points which affect other optimisations (well, perhaps not this particular step' but certainly more complex ones). But we want to make sure that it does get inlined eventually. So we say INLINE [0].

comment:9 in reply to:  8 ; Changed 9 years ago by simonpj

Replying to rl:

Actually, didn't we say {-# INLINE[0] local #-} means: "in phase 0 please inline local, replacing it with the RHS it would have right before phase 0"?

No. We specifically did not want that, because it's so unclear what it means to say "the RHS it would have right before phase 0". The idea with INLINE is that you know exactly what will get inlined, namely the source code of the definition. That was the big change with the new INLINE story.

I was under the impression that GHC would inline into unfoldings as long as it didn't affect phasing. Has that changed?

Only invisibly. If you write INLINE[1], then it's OK for GHC to optimise the rhs-to-be-inlined with phase=1 (only). Not phase=2, nor phase=0, because both would change the above semantics of INLINE. But while its OK to optimise the rhs-to-be-inlined with phase=1, it's not necessary, and it complicated things, so I took it out again.

Anyway, the biggest problem is that local INLINE functions are optimised twice (the rhs and the unfolding after it's been inlined) and usually the rhs is just thrown away so it's completely wasted work. For DPH/vector programs, this leads to significantly longer compile times.

That seems to be an inherent problem with keeping a template to inline separate from the main body of the function.

Let's review what you are trying to acheive. In your example, the key point is that in compositions of map all the step functions get fused together.

mapM f (mapM g (Stream step1 s1 n1)
= mapM f (Stream step2 s n)
  where
    step2 s = ...step1...g...
= Stream step3 s n 
  where
    step2 s = ...step1...g...
    step3 s = ...step2...g...

Now you want step2 to inline into step3 so that they can fuse, and so that f and g meet up and can fuse. (However you don't want step3 to inline into (Stream step3 s n) because Stream is a data constructor.)

Are you saying that you don't want this fusion of the step functions to take place until phase 0? If so, a NOINLINE pragma would do. You don't need to say INLINE, because there's only one occurrence of each step function anyway.

Let's suppose there were multiple occurrences of step. You probably don't mean "optimise as if there was no pragma, and then inline whatever you have in phase 0". Reason: suppose f or g are gigantic. Then the step functions would be gigantic, so duplicating them willy-nilly would make gigantic code.

But you don't want to inhibit the inlining of f,g into the step functions, because that's how f and g fuse together.

What goes wrong if you just don't put a pragma on step?

Simon

comment:10 in reply to:  9 ; Changed 9 years ago by rl

Replying to simonpj:

I was under the impression that GHC would inline into unfoldings as long as it didn't affect phasing. Has that changed?

Only invisibly. If you write INLINE[1], then it's OK for GHC to optimise the rhs-to-be-inlined with phase=1 (only). Not phase=2, nor phase=0, because both would change the above semantics of INLINE. But while its OK to optimise the rhs-to-be-inlined with phase=1, it's not necessary, and it complicated things, so I took it out again.

The previous behaviour was what I meant with "the rhs it would have right before phase 0". So this has changed. It's a pity, actually, I found the previous semantics much cleaner.

Anyway, the biggest problem is that local INLINE functions are optimised twice (the rhs and the unfolding after it's been inlined) and usually the rhs is just thrown away so it's completely wasted work. For DPH/vector programs, this leads to significantly longer compile times.

That seems to be an inherent problem with keeping a template to inline separate from the main body of the function.

Yes. And as I said, doing so makes sense for global functions but IMO not for local ones.

Let's review what you are trying to acheive. In your example, the key point is that in compositions of map all the step functions get fused together.

mapM f (mapM g (Stream step1 s1 n1)
= mapM f (Stream step2 s n)
  where
    step2 s = ...step1...g...
= Stream step3 s n 
  where
    step2 s = ...step1...g...
    step3 s = ...step2...g...

Now you want step2 to inline into step3 so that they can fuse, and so that f and g meet up and can fuse. (However you don't want step3 to inline into (Stream step3 s n) because Stream is a data constructor.)

A better example is this:

foldlM f z (mapM g (Stream step1 s1 n1))
  = let step2 s = case step1 s of
                    Yield x s' -> Yield (g x) s'
                    Skip s'    -> Skip s'
                    Done       -> Done

        loop x s = case step2 s of
                     Yield y s' -> loop (f x y) s'
                     Skip s'    -> loop x s'
                     Done       -> x
    in loop s1 z               

Here, I don't really care (I think...) whether step1 gets inlined into step2 before phase 0 or not. In fact, I would probably prefer step1 to be inlined. But I absolutely don't want step2 to be inlined into the loop before phase 0 because it might introduce unwanted join points there if it was more complex. I could delay inlining step2 at the call site but then I'd still need an INLINE pragma on it because it might not get inlined at all otherwise.

Are you saying that you don't want this fusion of the step functions to take place until phase 0? If so, a NOINLINE pragma would do. You don't need to say INLINE, because there's only one occurrence of each step function anyway.

NOINLINE doesn't work because step might get floated out and then not be inlined. That's what prompted the original bug report, in fact. It must have an INLINE pragma.

Let's suppose there were multiple occurrences of step. You probably don't mean "optimise as if there was no pragma, and then inline whatever you have in phase 0". Reason: suppose f or g are gigantic. Then the step functions would be gigantic, so duplicating them willy-nilly would make gigantic code.

I don't understand how not inlining until phase 0 would prevent this from happening. I think I do mean "optimise as if there was no pragma, and then inline whatever you have in phase 0".

What goes wrong if you just don't put a pragma on step?

It gets inlined into the loop which sometimes (rarely, but sometimes) leads to significantly worse code.

comment:11 in reply to:  10 ; Changed 9 years ago by simonpj

Replying to rl:

The previous behaviour was what I meant with "the rhs it would have right before phase 0". So this has changed. It's a pity, actually, I found the previous semantics much cleaner.

Lets just deal with this first. When I said that the change was "invisible" I meant that it had no observable effect. The only difference (I believe) is that if you say INLINE[n] then the rhs-to-be-inlined gets optimised with phase=n (only). I suppose that potentially might save a bit of work at the inline site. Is that what you mean?

I don't understand what you mean by the previous semantics being much cleaner. The semantics hasn't changed, I believe. Do you think it has?

comment:12 in reply to:  11 Changed 9 years ago by rl

Replying to simonpj:

Replying to rl:

Lets just deal with this first. When I said that the change was "invisible" I meant that it had no observable effect. The only difference (I believe) is that if you say INLINE[n] then the rhs-to-be-inlined gets optimised with phase=n (only). I suppose that potentially might save a bit of work at the inline site. Is that what you mean?

Probably. I'm not sure. I don't think this is too important.

I don't understand what you mean by the previous semantics being much cleaner. The semantics hasn't changed, I believe. Do you think it has?

I think I just liked reading the Core (including unfoldings) generated previously more as compared to now. By "semantics" I just mean the behaviour of the various transformations, not the semantics of the program.

comment:13 Changed 9 years ago by igloo

Status: mergenew

comment:14 in reply to:  10 ; Changed 9 years ago by simonpj

Replying to rl:

NOINLINE doesn't work because step might get floated out and then not be inlined. That's what prompted the original bug report, in fact. It must have an INLINE pragma.

I don't understand that. In your example there is precisely one call to step2, by desgn, so it'll be inlined at its (single) call site regardless of whether it has been floated.

I don't understand how not inlining until phase 0 would prevent this from happening. I think I do mean "optimise as if there was no pragma, and then inline whatever you have in phase 0".

Suppose we had no INLINE pragma on step. Then we'd get this:

mapM <\x.BIG> (Stream step s n)
= Steam step1 s n
  where
    step1 = ...<\x.BIG>...

On the other hand, if step1 does have an INLINE pragma you'd get

mapM <\x.BIG> (Stream step s n)
= Steam step1 s n
  where
    f = <\x.BIG>
    {-# INLINE[0] step1 #-}
      [Tmpl = ...f....]
    step1 = ...f...

In order to preserve the template in source form we must not inline f into the template, so we let-bind it instead. Now we can inline step1 at multiple call sites without duplicating <\x.BIG>. Does that explain the difference? Certainly, the person writing the INLINE pragma on step is saying "duplicate the code for step, but he is absolutely not saying "duplicate the code for whatever function mapM is applied to as well". That's what the INLINE pragma achieves.

Your original complaint was (a) compilation is slow, and (b) you generate worse code. I can see why (a) might happen, but I still don't understand (b) at all. Nor do I understand why (a) is worse than before.

For (a), why does GHC optimise the original body? It's in case step is not inlined at all; then we need something at all. This happens precisely in the mapM case, where step is returned as an argument to the Stream data constructor.

comment:15 in reply to:  14 ; Changed 9 years ago by rl

Replying to simonpj:

I don't understand that. In your example there is precisely one call to step2, by desgn, so it'll be inlined at its (single) call site regardless of whether it has been floated.

I don't have the original example here. It was floated out and then not inlined. There is no guarantee that step will only be used once anyway.

Certainly, the person writing the INLINE pragma on step is saying "duplicate the code for step, but he is absolutely not saying "duplicate the code for whatever function mapM is applied to as well". That's what the INLINE pragma achieves.

I see. Yes, that is a good point. I wouldn't be too concerned even if that happened occasionally in vector code but I understand that this might be bad in general.

Your original complaint was (a) compilation is slow, and (b) you generate worse code. I can see why (a) might happen, but I still don't understand (b) at all. Nor do I understand why (a) is worse than before.

As to (b), I haven't had time to investigate. The slowdown is minimal (a couple of % at worst) but seems quite consistent across the vector benchmarks. Maybe it's something different altogether but I doubt it. I'll find out over the weekend.

(a) is caused by recording unfoldings for local functions. My example from before explains why it happens. We first optimise local's rhs. Then, we inline local, i.e., it's unoptimised rhs (the unfolding) and optimise that again. Before, we would inline the already optimised rhs. In effect, we are now optimising local's rhs twice where before, we only optimised it once.

For (a), why does GHC optimise the original body? It's in case step is not inlined at all; then we need something at all. This happens precisely in the mapM case, where step is returned as an argument to the Stream data constructor.

I understand that (although this never happens in vector code - if it does, it's a bug in the library). But usually, we know that local functions won't be used in this way because they will certainly be inlined once we reach the right simplifier phase. This is something that we can't know for global functions.

Perhaps we shouldn't optimise local functions with INLINE[n] until phase n? That way, we avoid duplicating work if they get inlined in that phase.

comment:16 in reply to:  15 ; Changed 9 years ago by simonpj

Perhaps we shouldn't optimise local functions with INLINE[n] until phase n? That way, we avoid duplicating work if they get inlined in that phase.

I suppose that would be possible, but it would be very odd, because they'd miss out on rules and inlinings that only apply earlier than phase n. So you'd lose the claim that you get just as good optimisation of the function with INLINE(n) as you do without.

I'm still don't really understand your application for all this complexity, and without understanding it it's hard to suggest solutions. You seem to be doing essentially no fusion etc until phase 0. So why not write this?

mapM :: Monad m => (a -> m b) -> Stream m a -> Stream m b
{-# INLINE [1] mapM #-}
mapM f (Stream step s n) = Stream (step' step f) s n

{-# INLINE [0] step' #-}
step' step f s 
    = do        r <- step s
                case r of
                  Yield x s' -> liftM  (`Yield` s') (f x)
                  Skip    s' -> return (Skip    s')
                  Done       -> return Done

I'm still puzzled why the slow down is so great. After all, the template rhs is not acually optimised at all, and presumably it's quite small. Nothing happens to it until it is inlined. So it's not as if two large term are each undergoing extensive transformation.

comment:17 in reply to:  16 Changed 9 years ago by rl

Replying to simonpj:

I'm still don't really understand your application for all this complexity, and without understanding it it's hard to suggest solutions. You seem to be doing essentially no fusion etc until phase 0. So why not write this?

It's much less convenient. But I'll try that and see if it affects compilation times.

The setup isn't all that complex. In phase 2, we inline vector operations and apply rules like stream/unstream. In phase 1, we inline stream operations and let the simplifier work on the loops a bit. Finally, in phase 0 we inline the step functions and the simplifier removes all Step values. Previously, we inline step functions in phase 1, too, but inlining them in phase 0 sometimes leads to better code and speeds up compilation.

Anyway, let's drop this for now. I don't really have time to benchmark the compiler at the moment and I understand that it probably doesn't make sense to do anything about this without hard numbers.

comment:18 Changed 9 years ago by igloo

All three of these now merged:

Mon Oct 25 08:26:22 PDT 2010  simonpj@microsoft.com
  * Do not (ever) use substExprSC in the simplifier

Mon Oct 25 08:28:17 PDT 2010  simonpj@microsoft.com
  * Serialise nested unfoldings across module boundaries

Tue Oct 26 01:17:57 PDT 2010  simonpj@microsoft.com
  * Fix initialisation of strictness in the demand analyser

comment:19 Changed 9 years ago by igloo

Milestone: 7.0.27.2.1

comment:20 Changed 8 years ago by igloo

Milestone: 7.2.17.4.1

comment:21 Changed 8 years ago by igloo

Milestone: 7.4.17.6.1
Priority: normallow

comment:22 Changed 7 years ago by igloo

Milestone: 7.6.17.6.2

comment:23 Changed 5 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:24 Changed 5 years ago by thomie

difficulty: Unknown
Status: newinfoneeded

What is the status of this ticket?

comment:25 Changed 5 years ago by simonpj

I believe it's fixed; or at least some things were done, and the subsequent discussion is inconclusive. I'm inclined to close it if no one objects.

Simon

comment:26 in reply to:  25 Changed 5 years ago by thomie

Milestone: 7.10.1
Resolution: fixed
Status: infoneededclosed
Note: See TracTickets for help on using tickets.