Opened 3 years ago

Closed 2 years ago

#13379 closed bug (fixed)

Space leak / quadratic behavior when inlining

Reported by: jberryman Owned by: dfeuer
Priority: normal Milestone: 8.2.2
Component: Compiler Version: 8.0.1
Keywords: Inlining Cc: dfeuer, MikolajKonarski, George
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case: perf/compile/T13379
Blocked By: Blocking:
Related Tickets: #13586, #14281 Differential Rev(s):
Wiki Page:

Description

Compiling the attached example yields quadratic behavior and a big space leak (eventually the oom killer kills it) during inlining. Tested on both ghc 7.10.3 and 8.0.1

I compiled with -fforce-recomp --make -O2 -ddump-inlinings Main.hs

In cases where the process was able to run for a while I could see that each chunk of inlinings was getting slower linearly (every minute or so you can hit enter and which the line break go up your terminal a little slower than the last time). I assume the slowdown is due to GC over the growing heap due to a leak, but maybe the inlining algorithm is quadratic.

Apologies if this is a dupe; I had trouble searching for similar issues. (I did find #13316 but don't understand if this could be related)

Attachments (5)

Main.hs (47.6 KB) - added by jberryman 3 years ago.
Main.hs
ghc-wrap.hy.7.10.pdf (4.6 KB) - added by jberryman 3 years ago.
ghc-wrap.h.7.10.pdf (4.8 KB) - added by jberryman 3 years ago.
ghc-wrap.hy.8.0.pdf (15.3 KB) - added by jberryman 3 years ago.
ghc-wrap.h.8.0.pdf (10.4 KB) - added by jberryman 3 years ago.

Download all attachments as: .zip

Change History (39)

Changed 3 years ago by jberryman

Attachment: Main.hs added

Main.hs

comment:1 Changed 3 years ago by rwbarton

Even the space usage is quadratic, so something does seem amiss here. I recommend cutting the example down to at most 100 lines if you want to compile it in a reasonable amount of time.

comment:2 Changed 3 years ago by dfeuer

I'm not too surprised this is a difficult case. >> is infixl. So we start with (((f bs >> f b) >> f bs) >> f bs) >> f bs and we end up (eventually) re-associating everything the other way. How did an example like this arise in practice?

comment:3 Changed 3 years ago by jberryman

How did an example like this arise in practice?

In attempting to space profile some code where there seemed to be a constant small amount of allocation at each call site I tried creating a file like this one. I filed the bug because it seemed likely that this was evidence of performance issues in more realistic code.

I'm not too surprised this is a difficult case ... we end up (eventually) re-associating everything the other way

Ah, interesting! This does seem to be related to (>>) and not the code I marked INLINE in the Main file. The code compiles quickly with:

import Prelude hiding ((>>))
import qualified Prelude as P

infixr 1 >>
{-# NOINLINE (>>) #-}
(>>) = (P.>>)

The NOINLINE is necessary, and we see f get inlined everywhere.

But my naive response, knowing nothing about how to implement an inliner, and looking at a 191kb core file with a bunch of nested:

    of _ [Occ=Dead] { (# ipv, ipv1 #) ->
    case hPutStr2 stdout ipv1 True ipv
    of _ [Occ=Dead] { (# ipv2, ipv3 #) ->
    case hPutStr2 stdout ipv1 True ipv2
    of _ [Occ=Dead] { (# ipv4, ipv5 #) ->
    case hPutStr2 stdout ipv1 True ipv4
    of _ [Occ=Dead] { (# ipv6, ipv7 #) ->
    case hPutStr2 stdout ipv1 True ipv6
    ...

...is that I could do this inlining case manually in milliseconds (vs ~40 seconds for whatever number of lines I was just testing with) with vim, sed, a bash script, etc. and without GB of memory for my working set. And so I am inclined to believe that there is a space leak here. Who knows if real code will benefit from fixing it though.

comment:4 Changed 3 years ago by jberryman

These seem to be the slow inlinings which we don't see in the infixr variation I posted:

Inlining done: System.IO.putStrLn1
Inlining done: GHC.IO.Handle.Text.hPutStrLn1
Inlining done: GHC.Base.thenIO1

comment:5 Changed 3 years ago by dfeuer

Cc: dfeuer added

If the NOINLINE is necessary, then I think I must be quite wrong about the source of the problem.

comment:6 Changed 3 years ago by mpickering

I think that the problem is that >> and then thenIO gets inlined which in this case causes the compiler to generate heaps of code to simplify. I tried using Identity and it compiled instantly.

It seems plausible that that this shouldn't take a long time but more investigation into what exactly is taking a long time is needed.

comment:7 Changed 3 years ago by jberryman

Still looking into this slowly when I have time. It looks like this is a regression between ghc-7.10 and 8.0.1 at least. Here e.g. # 50 lines etc refers to 50 lines of

f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >> f bs >>

With ghc 7.10 this looks slow but linear:

jberryman /tmp/again » time ghc-7.10.3  -fforce-recomp -O2 Main.hs # 25 lines
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc-7.10.3 -fforce-recomp -O2 Main.hs  0.90s user 0.09s system 97% cpu 1.018 total
jberryman /tmp/again » time ghc-7.10.3  -fforce-recomp -O2 Main.hs # 50 lines
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc-7.10.3 -fforce-recomp -O2 Main.hs  1.59s user 0.13s system 98% cpu 1.744 total
jberryman /tmp/again » time ghc-7.10.3  -fforce-recomp -O2 Main.hs # 100 lines
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc-7.10.3 -fforce-recomp -O2 Main.hs  3.87s user 0.25s system 99% cpu 4.146 total

And in 8.0 we have a regression and what looks like worse asymptotics:

jberryman /tmp/again » time ghc  -fforce-recomp -O2 Main.hs # 25 lines  
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc -fforce-recomp -O2 Main.hs  2.41s user 0.15s system 99% cpu 2.583 total
jberryman /tmp/again » time ghc  -fforce-recomp -O2 Main.hs # 50 lines
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc -fforce-recomp -O2 Main.hs  7.36s user 0.21s system 99% cpu 7.586 total
jberryman /tmp/again » time ghc  -fforce-recomp -O2 Main.hs # 100 lines
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc -fforce-recomp -O2 Main.hs  27.99s user 0.45s system 99% cpu 28.463 total
jberryman /tmp/again » time ghc  -fforce-recomp -O2 Main.hs # 200 lines
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
ghc -fforce-recomp -O2 Main.hs  122.58s user 6.40s system 84% cpu 2:33.50 total

Also attached profiles using the ghc library via https://gist.github.com/nominolo/d18efaabf3e074d03969

Changed 3 years ago by jberryman

Attachment: ghc-wrap.hy.7.10.pdf added

Changed 3 years ago by jberryman

Attachment: ghc-wrap.h.7.10.pdf added

Changed 3 years ago by jberryman

Attachment: ghc-wrap.hy.8.0.pdf added

Changed 3 years ago by jberryman

Attachment: ghc-wrap.h.8.0.pdf added

comment:8 Changed 2 years ago by MikolajKonarski

Cc: MikolajKonarski added

comment:9 Changed 2 years ago by dfeuer

I'm working on bisecting this now. Unfortunately, the version/time function is far from monotone, so there may well be multiple issues involved.

comment:10 Changed 2 years ago by dfeuer

All right. I believe I've pinned this down to 45d9a15c4b85a2ed89579106bdafd84accf2cb39 ("Fix a huge space leak in the mighty Simplifier"). For around 100ish lines, I get the following:

  • Before: 2.5GB allocated in 4.3s
  • After: 28.2GB allocated in 19s

That's over ten times the allocation, and over 4 times the time! I'll start taking a look at why that may be.

Last edited 2 years ago by dfeuer (previous) (diff)

comment:11 Changed 2 years ago by simonpj

Keywords: Inlining added

comment:12 Changed 2 years ago by MikolajKonarski

comment:13 Changed 2 years ago by dfeuer

The first weird thing to notice about this problem is that 45d9a15c4b85a2ed89579106bdafd84accf2cb39 is all about LHSes of RULES, but there aren't any rules in the test module. So something seems to have gone wrong with one of the "knock on" changes.

Compiling before and after with -v3 indicates that the simplifier iteration immediately after float out takes much longer, and produces somewhat more coercions. That iteration takes many times longer than any other. Most of the coercions, before and after, go away after that iteration, stabilizing quickly at 19. Surprisingly, -dverbose-core2core looks exactly the same before and after that commit, suggesting something went funny with the size calculations, although I can't see how.

In HEAD, the number of coercions never goes up above 19. Furthermore, the time problem has moved. Now, the long iteration is the very first one after desugaring! Ugh. I haven't yet tried to find where the time shift took place. But the fact that there was one makes me suspect that the new implementation could have accidentally strictified something it shouldn't have. I don't understand what's going on well enough to know what. Perhaps the fact that a couple pure functions have turned monadic?

comment:14 Changed 2 years ago by MikolajKonarski

In HEAD [...] the time problem has moved. Now, the long iteration is the very first one after desugaring

May it have anything to do with the "early inlining" changes?

comment:15 in reply to:  14 Changed 2 years ago by dfeuer

Replying to MikolajKonarski:

In HEAD [...] the time problem has moved. Now, the long iteration is the very first one after desugaring

May it have anything to do with the "early inlining" changes?

Simon thinks so. And he thinks this whole problem has to do with simplifying the same expressions many times. Simon, have you worked out how to fix this yet?

comment:16 Changed 2 years ago by simonpj

Yes, just validating

comment:17 Changed 2 years ago by Simon Peyton Jones <simonpj@…>

In a1b753e8/ghc:

Cure exponential behaviour in the simplifier

This patch nails a Bad Bug exposed in Trac #13379. Roughly,
a deeply-nested application like
   f (f (f ....) ) )
could make the simplifier go exponential -- without producing
an exponential-sized result!

The reason was that we
  - simplified a (big) function argument
  - then decided to inline the function
  - then preInilneUnconditionally the argument
  - and then re-simplified the big argument

And if the "big argument" itself had a similar structure
things could get very bad.

Once I'd understood, it was easy to fix:

* See Note Note [Avoiding exponential behaviour] for an overview

* The key change is that Simplify.simplLam now as a case for
  (isSimplified dup). This is what removes the perf bug.

* But I also made simplCast more parsimonious about simplifying,
  avoiding doing so when the coercion is Refl

* And similarly I now try to avoid simplifying arguments
  where possible before applying rules.
  See Note [Trying rewrite rules]

The latter two points tackle common cases, and in those cases make the
simplifier take fewer iterations.

comment:18 Changed 2 years ago by Simon Peyton Jones <simonpj@…>

In 29d88ee1/ghc:

Be a bit more eager to inline in a strict context

If we see f (g x), and f is strict, we want to be a bit more eager to
inline g, because it may well expose an eval (on x perhaps) that can
be eliminated or shared.

I saw this in nofib boyer2, function RewriteFuns.onewayunify1.  It
showed up as a consequence of the preceding patch that makes the
simplifier do less work (Trac #13379).  We had

   f d (g x)

where f was a class-op. Previously we simplified both d and
(g x) with a RuleArgCtxt (making g a bit more eager to inline).
But now we simplify only d that way, then fire the rule, and
only then simplify (g x).  Firing the rule produces a strict
funciion, so we want to make a strict function encourage
inlining a bit.

comment:19 Changed 2 years ago by simonpj

Status: newmerge
Test Case: perf/compile/T13379

I claim this is fixed. The second patch fixes a problem that the first one exposed; so merge them both.

Would someone like to verify it's fixed in real live contexts? The Main.hs attachment is ok -- it's in the regression tests now.

I forgot to mention, but the first (main) patch produces some perf improvements in tests/perf/compiler:

T9020: 14% reduction in compiler allocation
T12425: 5% reduction in compiler allocation
Last edited 2 years ago by simonpj (previous) (diff)

comment:20 Changed 2 years ago by bgamari

Milestone: 8.2.1

comment:21 Changed 2 years ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:22 Changed 2 years ago by dfeuer

Simon, there are still a couple regressions for the combination of your two patches:

nofib/time/cryptarithm1   time increases 4.87% from 0.513s to 0.538s
nofib/size/scs            code size increases  5.28% from 1224890 bytes to 1289570 bytes

comment:23 Changed 2 years ago by George

Cc: George added

comment:24 Changed 2 years ago by simonpj

Yes... might you (or anyone else) investigate those two regressions? It's not that they are unacceptable, but sometimes they indicate some low-hanging fruit. Thanks

Simon

comment:25 Changed 2 years ago by bgamari

Resolution: fixed
Status: closednew

Reopening to ensure we don't lose track of these.

Last edited 2 years ago by bgamari (previous) (diff)

comment:26 in reply to:  25 Changed 2 years ago by dfeuer

Replying to bgamari:

Reopening to ensure we don't lose track of these.

Would it be better to close this and open a new ticket for the regressions? The original bug is fixed, right?

comment:27 Changed 2 years ago by bgamari

Indeed it probably would be better.

comment:28 Changed 2 years ago by jberryman

In case it's interesting or helpful, this same issue appears to have surfaced independently here: https://www.reddit.com/r/haskell/comments/6bojlj/inlineing_a_case_study/ (see my comment at bottom of thread).

It also might be interesting to look at the conclusions being drawn and questions being raised in the original blog post and in that discussion thread.

comment:29 Changed 2 years ago by jberryman

I forgot to add: I haven't checked out a build with the fix here to see if it fixes this other performance issue, but their test case seems to be very similar to mine and I assume suffering from the same issue.

comment:30 Changed 2 years ago by simonpj

As far as I can see it's not the same issue as this ticket.

In the blog post, as I read it, there is a function that

  • Is fairly small in source code terms
  • Calls several moderately large combinators
  • But finally optimises to something small again

This function is called quite a bit, so the cost of optimising it is paid many times, rather than once for all at the defintion site. Nothing exponential happens.

Thus motivated the author asks for a pragma to say "please inline the optimised code for this function". Maybe. The gain would be in compliation time, not in the quality of the resulting code. Is that worth another pragma. Maybe, but it does not seem compelling to me.

comment:31 Changed 2 years ago by jberryman

My thought was that the issue (long compilation times) was not actually primarily due to the linear extra simplifying of each unoptimized callsite, but rather the bug here. At least this seems consistent with the test case phadej posted to me (runtime quadratic in occrrences of "char7 <>"). I don't have any other context for the issue though.

comment:32 Changed 2 years ago by bgamari

Milestone: 8.2.18.2.2

dfeuer, could you open another ticket to track the remaining bits of this as suggested in comment:26 and close this?

comment:33 Changed 2 years ago by bgamari

Owner: set to dfeuer

comment:34 Changed 2 years ago by dfeuer

Resolution: fixed
Status: newclosed

As requested, I've opened #14281 for the regressions and am closing this ticket.

Note: See TracTickets for help on using tickets.