Opened 18 months ago

Last modified 17 months ago

#14944 new bug

Compile speed regression

Reported by: br1 Owned by: dfeuer
Priority: normal Milestone:
Component: Compiler Version: 8.4.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by br1)

Compile time with -O went from 5 seconds in 7.6.1 to 12 seconds in 8.2.2 and then down to 10 seconds in 8.4.1.

Attachments (8)

Main.hs (5.3 KB) - added by br1 18 months ago.
Paper.hs (10.3 KB) - added by br1 18 months ago.
passes-7.8 (14.5 KB) - added by dfeuer 17 months ago.
ghc -O -dshow-passes -fforce-recomp -fcontext-stack=400 Main.hs
passes-7.10 (9.2 KB) - added by dfeuer 17 months ago.
passes-8.0 (14.4 KB) - added by dfeuer 17 months ago.
passes-8.2 (15.8 KB) - added by dfeuer 17 months ago.
passes-8.4 (16.4 KB) - added by dfeuer 17 months ago.
passes-7.8-redacted (8.6 KB) - added by dfeuer 17 months ago.
7.8 passes with CPSZ lines removed

Download all attachments as: .zip

Change History (24)

Changed 18 months ago by br1

Attachment: Main.hs added

Changed 18 months ago by br1

Attachment: Paper.hs added

comment:1 Changed 18 months ago by br1

Description: modified (diff)

comment:2 Changed 18 months ago by bgamari

Owner: set to dfeuer

David will characterise this.

comment:3 Changed 18 months ago by dfeuer

The reason for the big version gap seems likely to be the fact that the code didn't compile at all under 7.8.4 or 7.10.3 without extra help. Here are the times I get:

7.8.4

This version needed -fcontext-stack=197 to compile.

<<ghc: 2588144288 bytes, 592 GCs, 25821661/80891272 avg/max bytes residency (13 samples), 193M in use, 0.00 INIT (0.00 elapsed), 2.29 MUT (2.62 elapsed), 1.11 GC (1.19 elapsed) :ghc>>

7.10.3

This version needed -fcontext-stack=197 to compile.

<<ghc: 5430939912 bytes, 1222 GCs, 39907279/99169560 avg/max bytes residency (14 samples), 271M in use, 0.001 INIT (0.001 elapsed), 5.319 MUT (5.672 elapsed), 1.918 GC (1.999 elapsed) :ghc>>

8.0.2

<<ghc: 9858859496 bytes, 2263 GCs, 37541990/110403672 avg/max bytes residency (13 samples), 291M in use, 0.000 INIT (0.001 elapsed), 7.649 MUT (8.086 elapsed), 1.850 GC (1.959 elapsed) :ghc>>

8.2.2

<<ghc: 5344419472 bytes, 543 GCs, 33955491/118029880 avg/max bytes residency (14 samples), 265M in use, 0.000 INIT (0.001 elapsed), 5.230 MUT (5.678 elapsed), 1.589 GC (1.674 elapsed) :ghc>>

8.4.1

<<ghc: 5160053968 bytes, 577 GCs, 30157126/79839112 avg/max bytes residency (14 samples), 207M in use, 0.000 INIT (0.005 elapsed), 4.415 MUT (4.930 elapsed), 1.509 GC (1.654 elapsed) :ghc>>

So there was a major regression from 7.8.4 to 7.10.3, another large one from 7.10.3 to 8.0.2, a large improvement from 8.0.2 to 8.2.2, and a smaller improvement from 8.2.2 to 8.4.1. I can verify that we're still running considerably slower, and allocating considerably more, than 7.8.4. FYI, while we no longer need -fcontext-stack, that's because the default limit has been raised or removed. It seems versions 8.0.2 through 8.4.1 need a reduction depth of 199.

comment:4 Changed 18 months ago by simonpj

Does the size of the program change at all between 7.6 and 8.4?

comment:5 Changed 18 months ago by dfeuer

I'm looking into that now.

Last edited 18 months ago by dfeuer (previous) (diff)

comment:6 Changed 18 months ago by dfeuer

Yes, the size of the program changes quite a lot. The most notable thing about this program is the sheer number of types, which is enormous in every case, but varies quite a lot. In each case I give the result of CorePrep from Paper (which always compiles reasonably quickly) along with pertinent-looking information about Main (which takes a while).

7.8.4

Paper:

{terms: 1,719, types: 4,049, coercions: 768}

Main:

Result size of Desugar (after optimization)
  = {terms: 3,973, types: 630,023, coercions: 196}
...
Result size of Simplifier
  = {terms: 3,384, types: 628,456, coercions: 5}
Result size of Specialise
  = {terms: 4,785, types: 865,456, coercions: 5}
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False})
  = {terms: 6,373, types: 1,100,676, coercions: 5}
Result size of Simplifier iteration=1
  = {terms: 8,674, types: 1,377,245, coercions: 211,797}
Result size of Simplifier iteration=2
  = {terms: 4,051, types: 730,227, coercions: 84,460}
-- things go up and down a few times
Result size of Tidy Core
  = {terms: 5,157, types: 1,104,496, coercions: 84,451}

7.10.3

Paper:

Result size of CorePrep
  = {terms: 1,736, types: 3,973, coercions: 778}

Main:

Result size of Desugar (after optimization)
  = {terms: 3,973, types: 630,023, coercions: 196}
Result size of Specialise
  = {terms: 12,088, types: 2,092,764, coercions: 392}
*** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 13,674, types: 2,327,982, coercions: 392}
Result size of Simplifier iteration=1
  = {terms: 14,824, types: 2,416,989, coercions: 371,022}
Result size of Simplifier iteration=2
  = {terms: 10,101, types: 2,096,313, coercions: 321,082}
Result size of Simplifier iteration=1
  = {terms: 12,295, types: 2,650,706, coercions: 321,081}
... nothing much changes for a long time
-- Sudden size drop in core tidy
Result size of Tidy Core
  = {terms: 5,146, types: 1,104,452, coercions: 84,448}

This comment is getting long; I'll continue in the next one.

comment:7 Changed 18 months ago by dfeuer

8.0.2

Paper:

Result size of CorePrep
  = {terms: 2,704, types: 5,400, coercions: 996}

Main:

Result size of Desugar (after optimization)
  = {terms: 4,574, types: 632,608, coercions: 196}
Result size of Specialise
  = {terms: 13,274, types: 2,100,628, coercions: 392}
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 14,860, types: 2,335,846, coercions: 392}
Result size of Simplifier iteration=1
  = {terms: 15,614, types: 2,423,788, coercions: 366,704}
Result size of Simplifier iteration=2
  = {terms: 10,121, types: 2,098,814, coercions: 322,579}
....
Result size of Simplifier
  = {terms: 12,570, types: 2,807,500, coercions: 320,217}
Result size of Tidy Core
  = {terms: 5,208, types: 1,104,414, coercions: 84,359}

8.2.2

Paper:

Result size of CorePrep
  = {terms: 3,280, types: 5,538, coercions: 1,000, joins: 2/50}

Main:

Result size of Desugar (after optimization)
  = {terms: 4,564, types: 632,574, coercions: 196, joins: 0/982}
Result size of Simplifier iteration=1
  = {terms: 6,858, types: 1,023,896, coercions: 86,509, joins: 0/786}
Result size of Simplifier iteration=2
  = {terms: 4,014, types: 726,796, coercions: 85,137, joins: 1/2}
...
Result size of Specialise
  = {terms: 9,329, types: 1,646,724, coercions: 168,346, joins: 2/2}
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [Main]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 10,823, types: 1,939,926, coercions: 168,346, joins: 1/2}
Result size of Simplifier iteration=1
  = {terms: 9,648, types: 1,706,501, coercions: 227,811, joins: 2/2}
Result size of Simplifier iteration=2
  = {terms: 9,349, types: 1,647,213, coercions: 168,353, joins: 2/2}
...
Result size of Tidy Core
  = {terms: 5,212, types: 1,104,424, coercions: 84,359, joins: 2/2}

comment:8 Changed 18 months ago by dfeuer

8.4.1

Paper:

Result size of CorePrep
  = {terms: 3,228, types: 5,423, coercions: 1,008, joins: 3/50}

Main:

Result size of Desugar (after optimization)
  = {terms: 4,564, types: 632,574, coercions: 196, joins: 0/982}
Result size of Simplifier iteration=1
  = {terms: 6,858, types: 1,023,896, coercions: 86,509, joins: 0/786}
Result size of Simplifier iteration=2
  = {terms: 4,014, types: 726,796, coercions: 85,137, joins: 1/2}
Result size of Specialise
  = {terms: 9,329, types: 1,646,724, coercions: 168,346, joins: 2/2}
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 10,431, types: 1,862,310, coercions: 168,346, joins: 1/2}
Result size of Simplifier iteration=1
  = {terms: 9,648, types: 1,706,501, coercions: 227,811, joins: 2/2}
Result size of Simplifier iteration=2
  = {terms: 9,349, types: 1,647,213, coercions: 168,353, joins: 2/2}
...
Result size of Simplifier
  = {terms: 9,334, types: 1,647,199, coercions: 168,352, joins: 2/2}

-- What happened here??????

Result size of Simplifier iteration=1
  = {terms: 110,206,
     types: 1,818,532,
     coercions: 379,933,
     joins: 593/20,293}
Result size of Simplifier
  = {terms: 8,914, types: 1,272,932, coercions: 85,550, joins: 5/203}
....
Result size of Demand analysis
  = {terms: 8,323, types: 1,272,755, coercions: 85,550, joins: 5/6}
!!! Demand analysis [Main]: finished in 200.01 milliseconds, allocated 115.130 megabytes
*** CoreTidy [Main]:
Result size of Tidy Core
  = {terms: 1,867, types: 166,695, coercions: 1,988, joins: 5/6}

comment:9 Changed 18 months ago by simonpj

Thanks for this data. Several interesting things here.

  • There's a big jump in compilation cost between 7.8 and 7.10. (Like a factor of 2 in compiler allocation, according to comment:3. According to comment:6, the result of Specialise is much bigger in 7.10 than in 7.8. This extra code appears to be discarded at the end, but it persists through the compilation pipeline. Questions: what are the extra specialisations? Do we need them? Why are they eventually discarded? Could we discard them earlier?
  • As you say in comment:8 it seems that GHC 8.4 has a new, and huge (temporary) blow-up in porgram size. It'd be really good to work out what this is. (You'll presumably need to cut down the program size to make this tractable.)

comment:10 Changed 18 months ago by dfeuer

In case this helps any, the simplifier runs iteration 1,2,1,2,3,1,2,3, and then the next iteration (1) is the one that blows up the term count in 8.4.1. It appears that the blow-up is non-linear. When I increase the number of lines in the Main.hs source, the ratio of terms in that iteration to terms in the previous one increases. Similarly, the ratio of time and allocation in that simplifier run compared to the previous one increases. I've tried looking at a very cut-down version with -ddump-simpl-iterations, but there are still an awful lot of terms to look at. I'll keep at it.

comment:11 Changed 17 months ago by dfeuer

Running with a development build of either 8.2.2 or 8.4ish (I haven't checked others yet) turns up an interesting warning when compiling Paper.hs: the simplifier bails out after 4 iterations. So it seems at least possible that whatever goes wrong in Main.hs may start in Paper.hs, if we end up with incompletely simplified unfoldings. One problem I'm facing is that this source code is ... not very simple. It's full of rather involved constraint-based programming that's presumably documented in a paper somewhere.

comment:13 Changed 17 months ago by dfeuer

Simon suggests -dshow-passes comparing iterations in different GHC versions.

comment:14 Changed 17 months ago by dfeuer

Get back to bullet 1 of comment 9!

Changed 17 months ago by dfeuer

Attachment: passes-7.8 added

ghc -O -dshow-passes -fforce-recomp -fcontext-stack=400 Main.hs

Changed 17 months ago by dfeuer

Attachment: passes-7.10 added

Changed 17 months ago by dfeuer

Attachment: passes-8.0 added

Changed 17 months ago by dfeuer

Attachment: passes-8.2 added

Changed 17 months ago by dfeuer

Attachment: passes-8.4 added

Changed 17 months ago by dfeuer

Attachment: passes-7.8-redacted added

7.8 passes with CPSZ lines removed

comment:15 Changed 17 months ago by dfeuer

I'm going to attach -v results too, so we can look at how many simplifier passes were used. Is there a way to find out when the simplifier bailed out without a debug build? I don't really know how to get a debug build of 7.8. Building 7.8 with 7.8 didn't work, and I don't have 7.6 lying around.

comment:16 Changed 17 months ago by dfeuer

In 7.8, Main starts compiling like this (from -dshow-passes):

Result size of Desugar (after optimization)
  = {terms: 3,973, types: 630,023, coercions: 196}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 3,384, types: 628,452, coercions: 5}
Result size of Simplifier iteration=2
  = {terms: 3,386, types: 628,458, coercions: 5}
Result size of Simplifier iteration=3
  = {terms: 3,384, types: 628,456, coercions: 5}
Result size of Simplifier
  = {terms: 3,384, types: 628,456, coercions: 5}
*** Specialise:
Result size of Specialise
  = {terms: 4,785, types: 865,456, coercions: 5}

In 7.10, on the other hand, it starts like this:

Result size of Desugar (after optimization)
  = {terms: 3,973, types: 630,023, coercions: 196}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 4,362, types: 630,802, coercions: 196}
Result size of Simplifier
  = {terms: 4,362, types: 630,802, coercions: 196}
*** Specialise:
Result size of Specialise
  = {terms: 12,088, types: 2,092,764, coercions: 392}

So 7.8 immediately tossed a slew of coercions in the first simplifier iteration, and the simplification otherwise kept sizes pretty much the same. In 7.10, we only do one round of simplification, the coercion count doesn't go down, the term count goes up a bit, and then specialization goes wild. No, I don't know what any of this really means.

Note: See TracTickets for help on using tickets.