Opened 4 years ago

Closed 3 years ago

Last modified 2 years ago

#11095 closed bug (fixed)

-O0 -g slows GHC down on list literals (compared to -O0 without -g)

Reported by: slyfox Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 7.10.2
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): phab:D3001
Wiki Page:

Description

Tried to buld syb-0.6 testsuite with '-O0 -g' flags today and noticed compiler hung for a couple of hours.

Distilled example from syb-0.6 testsuite (exact example attached):

{-# LANGUAGE NoImplicitPrelude #-}
{-# OPTIONS_GHC -O0 #-}
module Bug (bug) where

{-# NOINLINE f #-}
f :: a -> a
f v = v

bug :: [()]
bug = f v

v :: [()]
v =
       [(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),(),()
,(),(),(),(),() ... --- a lot of these
       ]
$ time ghc -c -fforce-recomp -O0 Bug.hs 

real    0m0.333s
user    0m0.288s
sys     0m0.042s

$ time ghc -c -fforce-recomp -O0 -g Bug.hs 

real    0m40.241s
user    0m40.021s
sys     0m0.070s

For some reason -g manages slow things down at a large factor.

Attachments (1)

Bug.hs (4.4 KB) - added by slyfox 4 years ago.
Actual file,

Download all attachments as: .zip

Change History (20)

Changed 4 years ago by slyfox

Attachment: Bug.hs added

Actual file,

comment:1 Changed 4 years ago by slyfox

Type of failure: None/UnknownCompile-time performance bug

Result on "prof" flavour of ghc-HEAD build with -auto-all enabled:

        Sun Nov 15 16:31 2015 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/slyfox/dev/git/ghc-prof/inplace/lib -hide-all-packages -c -g Bug.hs

        total time  =      304.82 secs   (304823 ticks @ 1000 us, 1 processor)
        total alloc = 286,040,926,256 bytes  (excludes profiling overheads)

COST CENTRE      MODULE       %time %alloc

containsSpan     SrcLoc        32.1   55.2
bindIO.\         GHC.Base      10.6    0.0
==               FastString     7.7   13.8
tickishContains  CoreSyn        6.5    0.0
mkTick.mkTick'   CoreUtils      6.4    6.9
eqString         GHC.Base       5.9    0.0
<=               GHC.Classes    4.4    0.0
srcSpanStartLine SrcLoc         4.4    9.2
uniq             FastString     3.8    9.2
compareInt       GHC.Classes    3.6    0.0
srcSpanFile      SrcLoc         3.0    0.0
srcSpanEndLine   SrcLoc         2.3    4.6
>=               GHC.Classes    1.5    0.0
&&               GHC.Classes    1.4    0.0
==               GHC.Classes    1.2    0.0
/=               CoreSyn        1.1    0.0

...
                   mapOL                                                          OrdList                         154083      917328    0.0    0.0    88.9   99.6
                    wrapTicks.wrap                                                CorePrep                        154085      915975    0.0    0.0    88.9   99.6
                     wrapTicks.wrapBind                                           CorePrep                        154086      915975    0.0    0.1    88.9   99.6
                      mkTick                                                      CoreUtils                       154087      915975    1.0    0.0    88.8   99.5
                       mkTick.canSplit                                            CoreUtils                       154113      915975    0.0    0.0     0.0    0.0
                        &&                                                        GHC.Classes                     154115      915975    0.0    0.0     0.0    0.0
                        tickishCanSplit                                           CoreSyn                         154114      915975    0.0    0.0     0.0    0.0
                       mkTick.mkTick'                                             CoreUtils                       154088   411894225    6.4    6.9    87.8   99.5
                        tickishContains                                           CoreSyn                         154120   821043225    6.5    0.0    78.6   92.5
                         containsSpan                                             SrcLoc                          154125   821037825   32.0   55.1    64.3   92.5
...

comment:2 Changed 4 years ago by carter

What c compiler is being invoked. There's known issues with using clang and profiling build way when there's large numbers of literals in a source file. What does gcc --version and ghc --info tell you?

comment:3 in reply to:  2 Changed 4 years ago by slyfox

Replying to carter:

What c compiler is being invoked. There's known issues with using clang and profiling build way when there's large numbers of literals in a source file. What does gcc --version and ghc --info tell you?

The build of a Bug.hs is not a -prof build. GHC was built as a "prof" flavour to pinpoint what it does.

All GHC's I tried are affected:

  • ghc-HEAD (7.11.20151115, 7.11.20151114)
  • ghc-7.10.2 (7.10.2.20151030)

ghc -v shows the slowdown is not in a C part, but in CorePrep:

$ ghc -v -O0 -g Bug.hs
...
[1 of 1] Compiling Bug              ( Bug.hs, Bug.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 2,707, types: 1,360, coercions: 0}
*** Simplifier:
Result size of Simplifier
  = {terms: 2,707, types: 1,360, coercions: 0}
*** Tidy Core:
Result size of Tidy Core
  = {terms: 2,707, types: 1,360, coercions: 0}
writeBinIface: 1 Names
writeBinIface: 3 dict entries
*** CorePrep:
Result size of CorePrep <hangs here for a minute>
...
$ LANG=C gcc --version
gcc (Gentoo 5.2.0 p1.0, pie-0.6.3) 5.2.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

comment:4 Changed 3 years ago by niteria

Differential Rev(s): phab:D3001

This still reproduces on GHC HEAD.

@slyfox: Thanks for the profile, it was helpful to know where to look.

phab:D3001 optimizes the hot tickishContains function a bit.

I've experimented with this a bit and here's what I found:

  • 45 () literals generates 20929 mkTick' calls and 1371 mkTick calls.
  • 90 () literals generates 143074 mkTick' calls and 4746 mkTick calls.
  • 135 () literals generates 457594 mkTick' calls and 10146 mkTick calls.

In the 45 case almost all of the calls come from wrapTicks, and out of these most of them (~20k) come from:

wrapBind t (NonRec binder rhs) = NonRec binder (mkTick t rhs)

comment:5 Changed 3 years ago by Ben Gamari <ben@…>

In 532c6ad/ghc:

Make tickishContains faster

This just reorders some inequality checks to make the common case
cheaper.

The results are quite dramatic for #11095, but that's probably because
something else is causing it to do too much work.

Before (full https://phabricator.haskell.org/P136):
```
  13,589,495,832 bytes allocated in the heap
```

After (full https://phabricator.haskell.org/P137):
```
   7,885,575,872 bytes allocated in the heap
```

This is with `BuildFlavour = devel2`, so take it with a
a grain of salt.

For reference, with no `-g` I get:
```
155,703,112 bytes allocated in the heap
```
so we're still quite a way off.

Test Plan:
harbormaster
I still have to test locally

Reviewers: austin, bgamari

Subscribers: thomie, simonmar

Differential Revision: https://phabricator.haskell.org/D3001

GHC Trac Issues: #11095

comment:6 Changed 3 years ago by simonpj

Is there something non-linear going on? (non-linear in program size that is.) Certainly sounds like it.

If so, it'd be good to know what it is. We should try HARD to avoid anything worse than NlogN in program size.

Simon

comment:7 Changed 3 years ago by bgamari

There is a nubBy in Debug which uses tickishContains which would make this N2. This is likely the reason why the above patch made such a difference.

Last edited 3 years ago by bgamari (previous) (diff)

comment:8 Changed 3 years ago by niteria

I suspect (I have yet to read the paper!) that the resulting term has quadratic size.

We're building:

() : () : () : () : () : () : () : []

which becomes:

a = () : []
b = () : a
c = () : b
d = () : c
e = () : d
f = () : e
g = () : f

We then construct ticks for each of a..g:

ticks(a) = SrcLoc(a)
ticks(b) = ticks(a) ++ SrcLoc(b)
ticks(c) = ticks(b) ++ SrcLoc(c)
ticks(d) = ticks(c) ++ SrcLoc(d)
ticks(e) = ticks(d) ++ SrcLoc(e)
ticks(f) = ticks(e) ++ SrcLoc(f)
ticks(g) = ticks(f) ++ SrcLoc(g)
-- the order of arguments here is important, 
-- we end up doing an equivalent of appending one element to a list
-- we do that over all the expressions, for each SrcLoc

The reason my patch helped is this part of mkTick':

    -- For annotations this is where we make sure to not introduce
    -- redundant ticks.
      | tickishContains t t2              -> mkTick' top rest e
      | tickishContains t2 t              -> orig_expr
      | otherwise                         -> mkTick' top (rest . Tick t2) e

We almost always reach the otherwise case because the source locations are disjoint, so tickishContains is quite hot.

Not only the result is quadratic, the function wrapTicks is quadratic as well:

wrapTicks (Floats flag floats0) expr = (Floats flag floats1, expr')
  where (floats1, expr') = foldrOL go (nilOL, expr) floats0  
        -- ^ iterate over floats0
        go (FloatTick t) (fs, e) = ASSERT(tickishPlace t == PlaceNonLam)
                                   (mapOL (wrap t) fs, mkTick t e)
                                   -- ^ map over fs, proportional in size to floats0
        go other         (fs, e) = (other `consOL` fs, e)
        -- ^ fs can grow proportional in size to floats0
        wrap t (FloatLet bind)    = FloatLet (wrapBind t bind)
        wrap t (FloatCase b r ok) = FloatCase b (mkTick t r) ok
        wrap _ other              = pprPanic "wrapTicks: unexpected float!"
                                             (ppr other)
        wrapBind t (NonRec binder rhs) = NonRec binder (mkTick t rhs)
        wrapBind t (Rec pairs)         = Rec (mapSnd (mkTick t) pairs)

Sorry I can't explain it better, the important point I believe is that the result has quadratic size.

comment:9 Changed 3 years ago by dfeuer

Unfortunately, part of niteria's change was semantically invalid and needed to be reverted. I'm trying to understand a little more of what's going on around it that makes it a hot spot. niteria's observation that we almost always reach the otherwise suggests that perhaps we can find a quicker way to get there (rather than checking if each contains the other immediately, first find a way to discover or approximate that they certainly don't). But working out the asymptotic problems will surely be more important than that. Right now I'm struggling a bit to understand what mkTick is doing and how.

comment:10 Changed 3 years ago by niteria

Comment https://ghc.haskell.org/trac/ghc/ticket/11095#comment:8 is based on invalid semantics, sorry for the noise.

comment:11 Changed 3 years ago by dfeuer

I don't know if this is useful, but it's essential that v is not exported. If you export v, compilation is suddenly fast.

comment:12 Changed 3 years ago by dfeuer

It appears that a big problem here is in wrapTicks in coreSyn/CorePrep.hs. In particular, tracing its go function like this

        go (FloatTick t) (fs, e) = ASSERT(tickishPlace t == PlaceNonLam)
                                   pprTrace "mapOL called with" (text "t = " <+> ppr t $$ text "fs = " <+> ppr fs) (mapOL (wrap t) fs, mkTick t e)
        go other         (fs, e) = (other `consOL` fs, e)

produces a huge amount of output, showing that the fs get deeper and deeper, layering tick on tick on tick at least quadratically. See this paste. Every step in the fold calls mapOL (wrap t) on the (growing) accumulator list. I wonder if it might be possible to find a way to combine these ticks, or drop some of them, instead of layering them arbitrarily deeply. I don't know how many we need.

comment:13 Changed 3 years ago by scpmw

Right, the source note gets floated outside of the (:) application, at which point the inner-most () is covered by source notes of all () in the list. Thus we get a quadratic growth of source notes, and redundancy checks blow this up to O(n^3). Ouch.

Ironically, this would likely not happen without the syntactic sugar, as then a source note would span the whole (a:b) expression (instead of just b), which would get eliminated by the mentioned "contains" check. This is what normally gives this stability: Once a tick gets floated up, it immediately hits a tick that covers a greater span, causing it to get eliminated. So maybe changing desugaring could actually take care of this.

We could also attempt to merge ticks automatically. This would make them slightly less useful, but is clearly better than blowing up the compiler. On the other hand, not sure what the criterion for merging should look like. If we go by the span alone, we run into the tricky question of how far two spans can be apart until we must assume that merging them would stop making sense. Unfortunately, I doubt it would be hard to make up examples to make any given threshold look bad.

Last edited 3 years ago by scpmw (previous) (diff)

comment:14 Changed 3 years ago by simonpj

Thanks Peter. So, what should we do to solve it? It's a bad (cubic), unexpected, and unpredictable performance hole.

Could you perhaps work on a fix?

Simon

comment:15 Changed 3 years ago by scpmw

I think dfeuer has managed to identify the core of the issue, so the hard work should be done. As I sketched there are a good number of options for how to fix it, and we could even fall back to special handling for deeply nested constructors with tick annotations in CorePrep if the more elegant approaches turn out not to work. I'll put some time into it over the weekend if nobody else grabs it first.

comment:16 Changed 3 years ago by scpmw

Well, turns out the "elegant" way does actually not work: The intermediate source notes get floated up, and we still end up with O(n^3) complexity at intermediate wrapTicks stages. Plus the top-level source note had actually been in the list of floats anyway - just at first position, which gets applied by foldr last. So we were applying all sorts of fine-grained ticks on n expressions just to get rid of all of them in the end.

phab:D3037 fixes this behaviour of wrapTicks. It's rather straightforward, really: Instead of using a foldr to apply ticks from the right, we start at the left and drop ticks immediately when we see that they are redundant. This should be O(n) for list literals now.

Sorry for that, rather silly mistake :/

comment:17 Changed 3 years ago by Ben Gamari <ben@…>

In 2912231/ghc:

Improve wrapTicks performance with lots of redundant source notes

The old version had O(n^3) performance for n non-overlapping source
notes and let floats each, which is exactly what happens with -g if we
compile a list literal of length n.

The idea here is simply to establish early which source notes will
actually survive (e.g. use a left fold). The new code should be O(n) for
list literals.

Reviewers: austin, dfeuer, bgamari

Reviewed By: bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D3037

GHC Trac Issues: #11095

comment:18 Changed 3 years ago by bgamari

Milestone: 8.2.1
Resolution: fixed
Status: newclosed

comment:19 Changed 2 years ago by niteria

Thank you for fixing this! Without this we were looking at 15% compile time regression between -g and no -g.

Note: See TracTickets for help on using tickets.