#15164 closed bug (fixed)

Slowdown in ghc compile times from GHC 8.0.2 to GHC 8.2.1 when doing Called arity analysis

Reported by: flip101 Owned by:
Priority: highest Milestone: 8.6.1
Component: Compiler Version: 8.2.1
Keywords: Cc: nomeata, sgraf
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Compile-time performance bug Test Case: perf/compiler/T15164
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by nomeata)

Hello,

I have this program that compiles against packages of stack LTS-9.21. When i compile with GHC 8.0.2 (version of LTS-9.21) the compiling is fast.

72.01user 34.61system 0:52.98elapsed 201%CPU (0avgtext+0avgdata 600684maxresident)k

When i upgrade to LTS-11.9 compiling gets slow. To test if it was a problem with new package or GHC (ghc gives some new packages too) i ran:

/usr/bin/time stack build --compiler ghc-8.2.1 --ghc-options="-v3 -j4 -O2 -fexcess-precision -optc-O3 -optc-ffast-math -rtsopts=none -no-rtsopts-suggestions"

with the old package of LTS-9.21 but with the newer GHC 8.2.1 compiler. Now the compile times are much higher:

2481.26user 663.66system 35:38.82elapsed 147%CPU (0avgtext+0avgdata 674732maxresident)k

I already took out parts of the grammar to get it even to compile. When i have the full grammar i can let my laptop compile 24 hours and it's still busy.

I was not sure how to further reduce the code. Maybe when a lot of effort i can cut out a few lines here and there. So i uploaded the entire project here https://github.com/flip111/parser-calledarityanalysis i already took out a lot of code. If anyone has suggestions on how to further reduce the code i can try.

Because i had debug info on (-v3 flag) i saw it was stuck for a long time at "Called arity analysis"

I prioritize this bug high because i think the slowdown is very problematic.

Change History (25)

comment:1 Changed 16 months ago by simonpj

Cc: nomeata added

Adding Joachim, who wrote Called Arity Analysis.

comment:2 Changed 16 months ago by nomeata

Thanks for the report. @flip101, are you able to test this with GHC-8.4, and see if the bug is still there?

comment:3 Changed 16 months ago by sgraf

Cc: sgraf added

comment:4 Changed 16 months ago by nomeata

Which module in particular is showing the slow down?

Does the problem go away if you derive less classes (in particular, Data)? (Unlikely, your data types are not big, but still worth checking.)

Does the problem go away with -fno-call-arity? (Using this flag might be a suitable work-around for you to continue with your actual program.)

comment:5 Changed 16 months ago by sgraf

I can confirm that

  • this happens for 8.2.* with lts-9.21 in module Level4
  • that -fno-call-arity fixes this, as does using 8.0.2
  • removing all derivings doesn't seem to have any (measurable) effect
  • can also be reproduced with 8.4.2 (on the lts-11.9 package set, lts-9.21 didn't build at all)
Last edited 16 months ago by sgraf (previous) (diff)

comment:6 Changed 16 months ago by sgraf

The -ddump-simpl-iterations of Level4 in 8.4.2 has 89,000 terms and contains one gigantic top-level recursive binding. The output for 8.0.2 has 3,055 terms and more, manageable recursive top-level bindigns. That's huge.

There's already a huge difference in compile-time for 8.4.2 with -fno-call-arity vs. 8.0.2 proper (>60s vs. 3s).

comment:7 Changed 16 months ago by flip101

Thank you for your input sgraf (i think it answers all questions by nomeata). Yes the slowdown was in module Level4.

comment:8 Changed 16 months ago by nomeata

Call Arity does a fixed-point iteration for mutual recursive groups; if there is one giant, then I am not surprised to see that it explodes there. We could detect absurdly large groups and do some conservative approximation.

But as Sebastian finds, there is *another*, independent regression that inflates the Core and compilation time (and blows it up in a way that completely chokes Call Arity)…

comment:9 Changed 16 months ago by simonpj

But as Sebastian finds, there is *another*, independent regression that inflates the Core

Yes, please let's characterise this aspect.

comment:10 Changed 16 months ago by nomeata

Description: modified (diff)

comment:11 Changed 16 months ago by nomeata

I have minimized the test to a self-contained file without package dependencies. More minimization might be possible, but not right now.

comment:12 Changed 16 months ago by nomeata

Uploading to trac seems to fail; I put the minimized file at https://github.com/nomeata/parser-calledarityanalysis

comment:13 Changed 16 months ago by nomeata

We already go from

==================== Desugar (after optimization) ====================  
Result size of Desugar (after optimization)                             
  = {terms: 3,189, types: 4,403, coercions: 132}                        

to

==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 82,881, types: 86,775, coercions: 132, joins: 0/20,633}   

Maybe the use of UndecidableInstances is the problem here?

comment:14 Changed 16 months ago by nomeata

If I focus on this part of the code:

newtype ActualParameterPart = APP (NT AssociationList) 
instance Rule f AssociationList => Rule f ActualParameterPart where
  get = APP <$> n93

then with 8.0 I get this code for the instance method:

-- RHS size: {terms: 12, types: 18, coercions: 0}
$cget_a1BB
  :: forall (f_a1By :: * -> *).
     (Rule f_a1By AssociationList, Decorator f_a1By) =>
     f_a1By ActualParameterPart
[LclId, Str=DmdType]
$cget_a1BB =
  \ (@ (f_a1By :: * -> *))
    ($dRule_a1Bz :: Rule f_a1By AssociationList)
    ($dDecorator_a1BD :: Decorator f_a1By) ->
    <$>
      @ f_a1By
      @ (NT AssociationList)
      @ ActualParameterPart
      (GHC.Base.$p1Applicative
         @ f_a1By
         (GHC.Base.$p1Monad
            @ f_a1By (Level4.$p1Decorator @ f_a1By $dDecorator_a1BD)))
      Level4.APP
      (n93 @ f_a1By $dDecorator_a1BD @ AssociationList $dRule_a1Bz)

but with 8.4 I get

-- RHS size: {terms: 2,474,
              types: 2,544,
              coercions: 0,
              joins: 0/630}
$cget_a2E1 [Occ=LoopBreaker]
  :: forall (f :: * -> *).
     (Rule f AssociationList, Decorator f) =>
     f ActualParameterPart
[LclId]
$cget_a2E1
  = \ (@ (f_a2DY :: * -> *))
      _ [Occ=Dead]
      ($dDecorator_a2E4 :: Decorator f_a2DY) ->
      letrec {
        $dRule_a7WP :: Rule f_a2DY QualifiedExpression
        [LclId]
        $dRule_a7WP
          = Level4.$fRulefQualifiedExpression
              @ f_a2DY $dRule_a7Qw $dRule_a7Wx;
        $dRule_a7VU :: Rule f_a2DY FunctionCall
        [LclId]
        $dRule_a7VU = Level4.$fRulefFunctionCall @ f_a2DY $dRule_a7VQ;
… thousands of lines omitted …
        $dRule_a7Qv [Occ=LoopBreaker] :: Rule f_a2DY Name
        [LclId]
        $dRule_a7Qv
          = Level4.$fRulefName
              @ f_a2DY $dRule_a7Ux $dRule_a7Uy $dRule_a7Uz; } in
      letrec {
        $dRule_a7Th :: Rule f_a2DY FunctionCall
        [LclId]
        $dRule_a7Th = Level4.$fRulefFunctionCall @ f_a2DY $dRule_a7Qy;
… again thousands of lines omitted …
        $dRule_a7Qy [Occ=LoopBreaker] :: Rule f_a2DY Name
        [LclId]
        $dRule_a7Qy
          = Level4.$fRulefName
              @ f_a2DY $dRule_a7Te $dRule_a7Tf $dRule_a7Tg; } in
      letrec {
        $dRule_a7QO :: Rule f_a2DY TypeMark
        [LclId]
        $dRule_a7QO = Level4.$fRulefTypeMark @ f_a2DY $dRule_a7QB;
… again …
        $dRule_a7Qx [Occ=LoopBreaker] :: Rule f_a2DY Expression
        [LclId]
        $dRule_a7Qx = Level4.$fRulefExpression @ f_a2DY $dRule_a7Qz; } in
      letrec {
        $dRule_a7Qo :: Rule f_a2DY QualifiedExpression
        [LclId]
        $dRule_a7Qo
          = Level4.$fRulefQualifiedExpression
              @ f_a2DY $dRule_a7ML $dRule_a7Q6;
…
… and a few more of those …
        $dRule_a7MM
          = Level4.$fRulefName
              @ f_a2DY $dRule_a7MN $dRule_a7MO $dRule_a7MP; } in
      <$>
        @ f_a2DY
        @ (NT AssociationList)
        @ ActualParameterPart
        (GHC.Base.$p1Applicative
           @ f_a2DY
           (GHC.Base.$p1Monad
              @ f_a2DY (Level4.$p1Decorator @ f_a2DY $dDecorator_a2E4)))
        Level4.APP
        (n93
           @ f_a2DY
           $dDecorator_a2E4
           @ AssociationList
           (Level4.$fRulefAssociationList
              @ f_a2DY
              (Level4.$fRulefAssociationElement
                 @ f_a2DY
                 (Level4.$fRulefFormalPart
                    @ f_a2DY
                    (Level4.$fRulefFormalDesignator @ f_a2DY $dRule_a7MM)
                    $dRule_a7MK
                    $dRule_a7ML)
                 (Level4.$fRulefActualPart
                    @ f_a2DY
                    (Level4.$fRulefActualDesignator @ f_a2DY $dRule_a7Qx $dRule_a7Qy)
                    $dRule_a7Qv
                    $dRule_a7Qw))))

The important change seems to be:

  • Previously, the code for get would use the dictionary for Rule f_a1By AssociationList passed to it, here named $dRule_a1Bz.
  • Now, it ignores the dictionary passed to it (note _ [Occ=Dead]) and instead seems to build the dictionary it passes to n93 locally, using many very large local let-recs.

comment:15 Changed 16 months ago by nomeata

Ha, the code still compiles, and does so quickly with all versions, if I simply remove the contexts from the instance declarations, e.g.

-instance (Rule f Expression, Rule f Name) => Rule f ActualDesignator where
+instance Rule f ActualDesignator where

This also lets me remove UndecidableInstances. See this commit: https://github.com/nomeata/parser-calledarityanalysis/commit/cb2e23b081d1016278feaa1397ad62a1672a52cf

@flip101, I understand that this is minimized code. Can you check your real code if you really need the contexts in the instance declarations, and if the problem goes away when you remove them?

comment:16 Changed 16 months ago by simonpj

Great detective work. I see what is happening.

First, a workaround is to use -fno-solve-constant-dicts.

Second, if you don't actually need the contexts on your instances then yes, try omitting them.

What's happening is this. The "short cut solver" TcInteract.shortCutSolver tries to solve goals using top-level evidence only (see Note [Shortcut solving]). But in doing so, it doesn't have enough sharing. To solve G it may need G1 and G2. To solve G1 it may need X; and to solve G2 it may need X. But it doesn't spot that it only needs to solve X once. So it does the "solve X" work (successfully) twice. Then the same thing happens when solving X. And the structure of this particular example is that we get successive doubling.

Solution is simple: just be a bit more intelligent about sharing. I can do that.

comment:17 Changed 16 months ago by flip101

Hi nomeata. When i started to develop my program i got some errors that some deductions/proof could not be made. Actually i don't remember the exact term for it and the exact error message. Anyway after asking around it was clear i had to put these constraints. I left these constraints in place while further developing my code and also using other packages. Now i tried to remove them again and i don't get any errors about it and the compiler speed looks good too (minutes, not days/weeks). I still need to verify if the program does what i expect it to do, but at least it compiles fast now. I think it's a good solution for me, but still a problem for GHC, so i will leave this issue open.

Good job on identifying this issue everyone. It's very nice.

comment:18 Changed 16 months ago by Simon Peyton Jones <simonpj@…>

In f2ce86c/ghc:

Do better sharing in the short-cut solver

Trac #15164 showed that it sometimes really matters to share
sub-proofs when solving constraints.  Without it, we can get
exponentialy bad behaviour.

Fortunately, it's easily solved.
Note [Shortcut try_solve_from_instance] explains.

I did some minor assocaited refactoring.

comment:19 Changed 16 months ago by simonpj

Resolution: fixed
Status: newclosed
Test Case: perf/compiler/T15164

Thanks for the detective work. Should now be fixed in 8.6. And we have a solid workaround for 8.4 (see comment:16).

comment:20 Changed 16 months ago by simonpj

Resolution: fixed
Status: closednew

Actually, Joachim, you might still want to look at this. Although the code is now the same as when you leave out those instance contexts, the Called Arity pass still allocates 10x more than any other pass, and there is a visible pause when doing -dshow-passes

[1 of 1] Compiling T15164           ( T15164.hs, T15164.o ) [Optimisation flags changed]
*** Parser [T15164]:
!!! Parser [T15164]: finished in 138.64 milliseconds, allocated 14.548 megabytes
*** Renamer/typechecker [T15164]:
!!! Renamer/typechecker [T15164]: finished in 564.37 milliseconds, allocated 75.649 megabytes
*** Desugar [T15164]:
Result size of Desugar (before optimization)
  = {terms: 8,197, types: 8,602, coercions: 0, joins: 0/1,317}
Result size of Desugar (after optimization)
  = {terms: 7,303, types: 7,589, coercions: 132, joins: 0/888}
!!! Desugar [T15164]: finished in 397.68 milliseconds, allocated 44.898 megabytes
*** Simplifier [T15164]:
Result size of Simplifier iteration=1
  = {terms: 8,488, types: 9,420, coercions: 4,178, joins: 0/1,271}
Result size of Simplifier iteration=2
  = {terms: 7,722, types: 8,271, coercions: 8,086, joins: 0/888}
Result size of Simplifier
  = {terms: 7,722, types: 8,271, coercions: 8,086, joins: 0/888}
!!! Simplifier [T15164]: finished in 1026.47 milliseconds, allocated 163.355 megabytes
*** Specialise [T15164]:
Result size of Specialise
  = {terms: 7,788, types: 8,373, coercions: 8,102, joins: 0/894}
!!! Specialise [T15164]: finished in 49.66 milliseconds, allocated 9.532 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [T15164]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 9,133, types: 18,102, coercions: 8,102, joins: 0/116}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [T15164]: finished in 296.67 milliseconds, allocated 56.540 megabytes
*** Simplifier [T15164]:
Result size of Simplifier iteration=1
  = {terms: 9,019, types: 17,757, coercions: 8,096, joins: 0/116}
Result size of Simplifier iteration=2
  = {terms: 8,860, types: 17,474, coercions: 8,054, joins: 0/109}
Result size of Simplifier
  = {terms: 8,860, types: 17,474, coercions: 8,054, joins: 0/109}
!!! Simplifier [T15164]: finished in 1315.09 milliseconds, allocated 178.537 megabytes
*** Simplifier [T15164]:
Result size of Simplifier iteration=1
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
Result size of Simplifier
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
!!! Simplifier [T15164]: finished in 833.40 milliseconds, allocated 123.259 megabytes
*** Simplifier [T15164]:
Result size of Simplifier
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
!!! Simplifier [T15164]: finished in 317.06 milliseconds, allocated 60.850 megabytes
*** Float inwards [T15164]:
Result size of Float inwards
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
!!! Float inwards [T15164]: finished in 42.00 milliseconds, allocated 10.019 megabytes
*** Called arity analysis [T15164]:
Result size of Called arity analysis
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
!!! Called arity analysis [T15164]: finished in 6284.17 milliseconds, allocated 1793.237 megabytes
*** Simplifier [T15164]:
Result size of Simplifier iteration=1
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
Result size of Simplifier
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
!!! Simplifier [T15164]: finished in 793.92 milliseconds, allocated 121.870 megabytes
*** Demand analysis [T15164]:
Result size of Demand analysis
  = {terms: 8,716, types: 17,272, coercions: 8,054, joins: 0/109}
!!! Demand analysis [T15164]: finished in 440.65 milliseconds, allocated 49.624 megabytes
*** Worker Wrapper binds [T15164]:
Result size of Worker Wrapper binds
  = {terms: 11,775, types: 29,390, coercions: 8,054, joins: 0/761}
!!! Worker Wrapper binds [T15164]: finished in 12.07 milliseconds, allocated 3.429 megabytes
*** Simplifier [T15164]:
Result size of Simplifier iteration=1
  = {terms: 17,573, types: 104,236, coercions: 890, joins: 0/543}
Result size of Simplifier
  = {terms: 5,574, types: 18,021, coercions: 890, joins: 0/63}
!!! Simplifier [T15164]: finished in 1502.93 milliseconds, allocated 170.271 megabytes
*** Exitification transformation [T15164]:
Result size of Exitification transformation
  = {terms: 5,574, types: 18,021, coercions: 890, joins: 0/63}
!!! Exitification transformation [T15164]: finished in 2.65 milliseconds, allocated 0.638 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [T15164]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 5,574, types: 18,021, coercions: 890, joins: 0/63}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [T15164]: finished in 139.13 milliseconds, allocated 31.777 megabytes
*** Common sub-expression [T15164]:
Result size of Common sub-expression
  = {terms: 4,921, types: 12,615, coercions: 862, joins: 0/63}
!!! Common sub-expression [T15164]: finished in 57.98 milliseconds, allocated 9.568 megabytes
*** Float inwards [T15164]:
Result size of Float inwards
  = {terms: 4,915, types: 12,604, coercions: 862, joins: 0/60}
!!! Float inwards [T15164]: finished in 8.74 milliseconds, allocated 4.139 megabytes
*** Simplifier [T15164]:
Result size of Simplifier iteration=1
  = {terms: 4,719, types: 12,081, coercions: 862, joins: 0/60}
Result size of Simplifier
  = {terms: 4,719, types: 12,081, coercions: 862, joins: 0/60}
!!! Simplifier [T15164]: finished in 237.42 milliseconds, allocated 48.927 megabytes
*** Demand analysis [T15164]:
Result size of Demand analysis
  = {terms: 4,719, types: 12,081, coercions: 862, joins: 0/60}
!!! Demand analysis [T15164]: finished in 93.32 milliseconds, allocated 22.688 megabytes
*** CoreTidy [T15164]:
Result size of Tidy Core
  = {terms: 4,815, types: 13,321, coercions: 865, joins: 0/60}
!!! CoreTidy [T15164]: finished in 52.63 milliseconds, allocated 14.762 megabytes
Created temporary directory: /tmp/ghc8754_0
*** CorePrep [T15164]:
Result size of CorePrep
  = {terms: 6,049, types: 15,480, coercions: 865, joins: 0/363}
!!! CorePrep [T15164]: finished in 35.54 milliseconds, allocated 10.242 megabytes
*** Stg2Stg:
*** CodeGen [T15164]:
!!! CodeGen [T15164]: finished in 907.70 milliseconds, allocated 209.890 megabytes
*** Assembler:
*** CorePrep [T15164]:
Result size of CorePrep
  = {terms: 6,049, types: 15,480, coercions: 865, joins: 0/363}
!!! CorePrep [T15164]: finished in 223.32 milliseconds, allocated 10.223 megabytes
*** Stg2Stg:
*** CodeGen [T15164]:
!!! CodeGen [T15164]: finished in 859.65 milliseconds, allocated 218.926 megabytes
*** Assembler:

comment:21 Changed 16 months ago by Simon Peyton Jones <simonpj@…>

In 5f3fb71/ghc:

Fix perf numbers for #15164

comment:22 Changed 16 months ago by nomeata

JFTR, there is a patch that fixes most of it at Phab:D4718, I am just waiting for Harbormaster to finish and tell me the new perf test numbers, then I will push it to master.

comment:23 Changed 16 months ago by Joachim Breitner <mail@…>

In db6085b8/ghc:

Improve performance of CallArity

the hot path contained a call to

    v `elemUnVarSet` (neighbors g v)

and creating the set of neighbors just to check if `v` is inside
accounted for half the allocations of the test case of #15164.

By introducing a non-allocating function `hasLoopAt` for this we shave
off half the allocations. This brings the total cost of Call Arity down
to 20% of time and 23% of allocations, according to a profiled run. Not
amazing, but still much better.

Differential Revision: https://phabricator.haskell.org/D4718

comment:24 Changed 16 months ago by nomeata

Status: newmerge

This should fix most of the cost of running Call Arity. The patch is small and self-contained, so maybe a good candidate for a merge into stable.

Of course more could be done about the cost of running Call Arity, but not here and now.

comment:25 Changed 16 months ago by bgamari

Resolution: fixed
Status: mergeclosed

This is already present in 8.6 and there likely won't be an another 8.4.

Note: See TracTickets for help on using tickets.