#14683 closed bug (fixed)

Slow compile times for Happy-generated source

Reported by: harpocrates Owned by:
Priority: high Milestone: 8.4.1
Component: Compiler Version: 8.2.2
Keywords: Cc: cblp, tdammers, simonmar, goldfire
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by harpocrates)

This came up while investigating the compile times of Happy grammars: https://github.com/simonmar/happy/issues/109. GHC 8 and later take a lot longer to compile grammars with types that rely on PartialTypeSignatures and have lots of wildcards.

There is an example repo (https://github.com/wiz/too-happy) in that thread containing a grammar whose generated code has lots of wildcards.

On my laptop, the project takes about 25 seconds to compile on 7.10.3. With 8.0.2, 8.2.2, and HEAD it takes on the order of 4 minutes.

Attachments (1)

Grammar.hs (243.0 KB) - added by harpocrates 23 months ago.

Download all attachments as: .zip

Change History (20)

comment:1 Changed 23 months ago by simonpj

a grammar whose generated code has lots of wildcards.

Would be possible to upload a standalone example? Perhaps just grab the offending generated file, remove unnecessary imports...

comment:2 Changed 23 months ago by harpocrates

Description: modified (diff)

Changed 23 months ago by harpocrates

Attachment: Grammar.hs added

comment:3 Changed 23 months ago by harpocrates

I've distilled the repo down to a single file. In the process, I've realized that the problem is actually not related to PartialTypeSignatures! On my machine:

$ time ghc-7.10.3 Grammar.hs
17.71s user 1.30s system 97% cpu 19.543 total

$ time ghc-8.0.2 Grammar.hs
139.23s user 1.77s system 99% cpu 2:22.31 total

$ time ghc-8.2.2 Grammar.hs
127.42s user 1.53s system 98% cpu 2:11.08 total

$ time ghc-head Grammar.hs
130.42s user 1.06s system 99% cpu 2:12.58 total

comment:4 Changed 23 months ago by harpocrates

Summary: Hole-y partial type signatures lead to slow compile timesSlow compile times for Happy-generated source

comment:5 Changed 23 months ago by simonpj

Type of failure: None/UnknownCompile-time performance bug

comment:6 Changed 23 months ago by simonpj

Thanks for distilling it. Ben and colleagues: worth looking into this to characterise what's going on. Thanks.

comment:7 Changed 23 months ago by cblp

Cc: cblp added

comment:8 Changed 23 months ago by bernie

I believe this issue is causing a massive slow down in the compilation of the language-python library (https://github.com/bjpop/language-python). It used to take a couple of minutes to compile the library, now it takes 7 hours with ghc 8.2.1.

comment:9 Changed 23 months ago by bgamari

Cc: tdammers added
Milestone: 8.4.1
Priority: normalhigh

Tobias, perhaps you want to have a look at this when you are done with your current task?

comment:10 Changed 23 months ago by simonmar

Cc: simonmar added

comment:11 Changed 23 months ago by tdammers

First thing that stands out here is long lists of type vars in the generated code. Maybe something similar to #7258 is going on here? Going to investigate.

comment:12 Changed 23 months ago by alexbiehl

Tobias, I saw you mentioning #7258 here.

I ran over Grammar.hs with my sampling code from for #14461:

  • there are 245 non-toplevel StgRhsClosures
  • the deepest StgRhsClosure is nested 5 levels
  • Of that 245,
    • 168 have 0 free variables
    • 53 have 1 free variables
    • 16 have 2 free variables
    • 6 have 3 free variables
    • 1 has 4 free variables
    • 1 has 7 free variables
    • 206 share 0 free variables with their parent
    • 32 share 1 free variable with their parent
    • 6 share 2 free variables with their parent
    • 1 shares 3 free variables with their parent

I think this is not related to code generation. My timing shows most of the time is spent before CoreToStg. Codegen is a lot less. Our pathological cases #7258 had A LOT more (basically around 100 StgRhsClosures sharing around 150 fvs with their parents) free variable business going on.

comment:13 in reply to:  12 Changed 23 months ago by tdammers

Replying to alexbiehl:

Tobias, I saw you mentioning #7258 here.

I ran over Grammar.hs with my sampling code from for #14461:

  • there are 245 non-toplevel StgRhsClosures
  • the deepest StgRhsClosure is nested 5 levels
  • Of that 245,
    • 168 have 0 free variables
    • 53 have 1 free variables
    • 16 have 2 free variables
    • 6 have 3 free variables
    • 1 has 4 free variables
    • 1 has 7 free variables
    • 206 share 0 free variables with their parent
    • 32 share 1 free variable with their parent
    • 6 share 2 free variables with their parent
    • 1 shares 3 free variables with their parent

I think this is not related to code generation. My timing shows most of the time is spent before CoreToStg. Codegen is a lot less. Our pathological cases #7258 had A LOT more (basically around 100 StgRhsClosures sharing around 150 fvs with their parents) free variable business going on.

Thanks for stepping in, you seem to be much quicker at getting a clearer picture here than me. Good to know that this isn't the direction to look in.

comment:14 Changed 23 months ago by tdammers

SCC profiling so far points at simplLazyBind in Simplify.hs.

comment:15 Changed 23 months ago by tdammers

Further profiling result:

COST CENTRE         MODULE   SRC                                        %time %alloc

coercionKind        Coercion compiler/types/Coercion.hs:1707:3-7         87.5   88.7
CoreTidy            HscMain  compiler/main/HscMain.hs:1253:27-67          3.1    3.3
simplCast-addCoerce Simplify compiler/simplCore/Simplify.hs:1225:53-71    2.8    2.8
simplCast           Simplify compiler/simplCore/Simplify.hs:871:62-87     2.1    1.5
Stg2Stg             HscMain  compiler/main/HscMain.hs:1489:12-44          1.6    1.9

In other words, we're spending close to 90% time and alloc on the coercionKind function.

alexbiehl mentioned that #11735 might be related, in the sense that possible optimizations mentioned there could solve this performance problem here.

comment:16 Changed 23 months ago by simonpj

Cc: goldfire added

In other words, we're spending close to 90% time and alloc on the coercionKind function.

Ha! A smoking gun. Well done!

Moreover, it's a gun that has fired before: see #5631, and the comment in Coercion/hs

Note [Nested InstCos]
~~~~~~~~~~~~~~~~~~~~~
In Trac #5631 we found that 70% of the entire compilation time was
being spent in coercionKind!  The reason was that we had
   (g @ ty1 @ ty2 .. @ ty100)    -- The "@s" are InstCos
where 
   g :: forall a1 a2 .. a100. phi
If we deal with the InstCos one at a time, we'll do this:
   1.  Find the kind of (g @ ty1 .. @ ty99) : forall a100. phi'
   2.  Substitute phi'[ ty100/a100 ], a single tyvar->type subst
But this is a *quadratic* algorithm, and the blew up Trac #5631.
So it's very important to do the substitution simultaneously.

cf Type.applyTys (which in fact we call here)

But clearly the fix for #5631 isn't solving the current problem.

Does the profiling info tell us anything about which calls to coercionKind are so expensive?

I see two ways forward

  1. Identify the non-linearity in coercionKind. It really should not be so expensive. The fix to #5631 fixed one, but presumably there is another.

It's not immediately obvious how to do this. One way might be to instrument coercionKind so that it returns (as well as the kind) the number of recursive invocations of coercionKind and of substTy, and then print out coercions that produce big numbers for either of these.

Alternatively, just fix #11735, and see if that helps. That's easy to do.

  1. Resurrect #11598. I have some ideas.

Actually we might want to do both. Even if we did (2) it'd just cover up badness in (1).

comment:17 Changed 23 months ago by tdammers

I'll take a stab at 1., and then maybe look into 2. Would love to see some of those ideas, see if I can make any sense of it given my limited understanding.

comment:18 Changed 22 months ago by tdammers

#1 has been addressed fairly thoroughly in #11735, bringing execution time down from ~250 s to ~12 s. Closing this for now, especially considering how #11598 doesn't seem very promising and how we have implemented some role caching in more specific spots as part of #11735.

comment:19 Changed 22 months ago by tdammers

Resolution: fixed
Status: newclosed

Fixed / will be fixed via #11735.

Note: See TracTickets for help on using tickets.