Opened 11 months ago

Last modified 11 months ago

#15703 new bug

Significant compilation time blowup when refactoring singletons-heavy code

Reported by: RyanGlScott Owned by:
Priority: high Milestone: 8.8.1
Component: Compiler Version: 8.6.1
Keywords: Cc: bgamari, xnning
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #15725 Differential Rev(s):
Wiki Page:

Description

At ICFP, I griped to Ben about some code I'm writing (which makes use of singletons to the nth degree) that takes absolutely forever to compile with optimizations. I suspected that type families (i.e., coercions) were the culprit, but Ben requested that I put the code on Trac anyways, so here it is.

I've attached two modules (Lib.hs and Lib2.hs). You'll need GHC 8.6.1 or later to compile it. To compile it, run:

$ time /opt/ghc/8.6.1/bin/ghc -O1 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    3m28.367s
user    3m28.512s
sys     0m0.212s

Note that if you compile without optimizations, it'll finish almost immediately:

$ time /opt/ghc/8.6.1/bin/ghc -O0 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m0.528s
user    0m0.480s
sys     0m0.036s

Also, if you look at the source code for Lib.hs, you'll notice some code which says:

instance SApplicative f => SApplicative (M1 i c f) where
  sPure x = singFun3 @(.@#@$) (%.) @@ singFun1 @M1Sym0 SM1 @@ (singFun1 @PureSym0 sPure) @@ x
  -- If I change the implementation of sPure above to be this:
  --
  --   sPure x = SM1 (sPure x)
  --
  -- Then Lib.hs compiles quickly again (< 1 second) with -O1.

  SM1 f %<*> SM1 x = SM1 (f %<*> x)

If you apply this suggested change, then you can see the difference when you compile it with optimizations again:

$ time /opt/ghc/8.6.1/bin/ghc -O1 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m0.986s
user    0m0.940s
sys     0m0.040s

Given that this particular change of code causes such a dramatic increase in compilation, I have to wonder if there's more going on here than just the usual type-family slowness.

Attachments (3)

Lib2.hs (4.5 KB) - added by RyanGlScott 11 months ago.
Lib2.hs
Lib.hs (5.1 KB) - added by RyanGlScott 11 months ago.
Lib.hs
dshow-passes.txt (13.9 KB) - added by RyanGlScott 11 months ago.

Download all attachments as: .zip

Change History (17)

Changed 11 months ago by RyanGlScott

Attachment: Lib2.hs added

Lib2.hs

comment:1 Changed 11 months ago by RyanGlScott

If you're wondering why I split the code up into two separate modules (instead of combining it into a single module), I observed that combining things into a single module actually made the compilation slowdown less severe (it took about 1 minute to compile, as opposed to the ~3m30s compile time shown above). I suspect that dividing up the code into further modules might exacerbate the compile times even more, since in the original project that I took this code from, it actually takes longer than 3m30s to compile the whole thing with -O1.

Changed 11 months ago by RyanGlScott

Attachment: Lib.hs added

Lib.hs

comment:2 Changed 11 months ago by RyanGlScott

I built this code using a profiled GHC 8.6.1. Here are the highlights:

        Wed Oct  3 22:18 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/u/rgscott/Software/ghc4/inplace/lib -O1 -fforce-recomp Lib.hs
                                                                                                   
        total time  =      276.05 secs   (276053 ticks @ 1000 us, 1 processor)                     
        total alloc = 403,405,509,720 bytes  (excludes profiling overheads)                        
                                                                                                   
COST CENTRE             MODULE   SRC                                            %time %alloc       
                                                                                                   
simplCast-simplCoercion Simplify compiler/simplCore/Simplify.hs:1248:57-77       75.8   76.8       
substTyWith             TyCoRep  compiler/types/TyCoRep.hs:(2213,23)-(2214,50)   23.8   23.0

Changed 11 months ago by RyanGlScott

Attachment: dshow-passes.txt added

comment:3 Changed 11 months ago by RyanGlScott

I attached the results of compiling Lib.hs with -dshow-passes enabled. Of particular note is:

Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 950, types: 14,499, coercions: 83,275, joins: 0/12}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [Lib]: finished in 28.01 milliseconds, allocated 19.390 megabytes
*** Simplifier [Lib]:
Result size of Simplifier iteration=1
  = {terms: 924, types: 12,141, coercions: 235,672, joins: 0/46}
Result size of Simplifier iteration=2
  = {terms: 755, types: 5,948, coercions: 312,927, joins: 0/17}
Result size of Simplifier iteration=3
  = {terms: 715, types: 3,828, coercions: 403,226, joins: 0/2}
Result size of Simplifier
  = {terms: 703, types: 3,759, coercions: 228,342, joins: 0/2}
!!! Simplifier [Lib]: finished in 208488.74 milliseconds, allocated 373891.763 megabytes
*** Simplifier [Lib]:
Result size of Simplifier
  = {terms: 703, types: 3,759, coercions: 228,342, joins: 0/2}

The number of coercions appears to jump between float-out and the first simplifier pass. However, it's a later simplifier pass that takes up the bulk of the time (208488.74 milliseconds of it).

comment:4 Changed 11 months ago by goldfire

It might be worth checking whether optCoercion is helping or hurting: you can check the size of the input to the function and the size of the output. The output shouldn't be bigger than the input!

comment:5 Changed 11 months ago by RyanGlScott

I added {-# OPTIONS_GHC -fprof-auto #-} to the top of OptCoercion, which gives me some slightly more fine-grained information:

        Mon Oct  8 13:50 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/u/rgscott/Software/ghc3/inplace/lib -O1 -fforce-recomp Lib.hs

        total time  =      282.87 secs   (282872 ticks @ 1000 us, 1 processor)
        total alloc = 403,927,138,248 bytes  (excludes profiling overheads)

COST CENTRE    MODULE      SRC                                             %time %alloc

matchAxiom     OptCoercion compiler/types/OptCoercion.hs:(903,1)-(916,11)   47.3   43.4
substTyWith    TyCoRep     compiler/types/TyCoRep.hs:(2213,23)-(2214,50)    23.0   22.9
opt_trans_rule OptCoercion compiler/types/OptCoercion.hs:(512,1)-(681,30)   18.4   21.5
etaAppCo_maybe OptCoercion compiler/types/OptCoercion.hs:(985,1)-(996,11)    6.1    9.2
opt_co4        OptCoercion compiler/types/OptCoercion.hs:(174,1)-(386,71)    3.4    2.2

comment:6 Changed 11 months ago by simonpj

Those are staggeringly large coercions decorating relatively tiny types.

I wonder: could any of them be very large representations of Refl? I note that OptCoercion doesn't use isReflexiveCo (which tests for equality of the two types) because doing so at every node would be expensive. But you could try testing for isReflexiveCo, and spit out the smallest coercion which says True to isReflexiveCo but is not actually Refl. You'd want to make this test for every sub-tree in the coercion.

That's a guess. But it's very hard to imagine that you really need such large proofs. The trick is to find where the redundancy is.

comment:7 Changed 11 months ago by RyanGlScott

I don't know if this is related to the performance issue witnessed here, but it turns out that Lib.hs compiles cleanly with -dcore-lint -O0, but crashes with -dcore-lint -O1. I've opened #15725 for this. (Interestingly, in the more minimal example I provide in #15725, the Core Lint error happens regardless of optimization level.)

comment:8 Changed 11 months ago by RyanGlScott

I tried compiling this program using the experimental Phab:D4766 (Zap coercions when not building with -dcore-lint) branch:

$ time ~/Software/ghc3/inplace/bin/ghc-stage2 -O0 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m0.497s
user    0m0.460s
sys     0m0.028s

$ time ~/Software/ghc3/inplace/bin/ghc-stage2 -O1 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m17.164s
user    0m16.980s
sys     0m0.200s

This looks promising! Perhaps this really is a duplicate of #8095 in disguise?

comment:9 Changed 11 months ago by bgamari

Priority: normalhigh

comment:10 Changed 11 months ago by simonpj

I still think (comment:6) that it may well be that there is massive redundancy here, and if so it'd be a pity to miss out on fixing that, good though #8095 is.

comment:11 Changed 11 months ago by RyanGlScott

Hold on a second... now that I've compiled GHC HEAD, it turns out that HEAD compiles this program much faster than 8.6.1 does:

$ time ~/Software/ghc2/inplace/bin/ghc-stage2 -O0 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m0.462s
user    0m0.404s
sys     0m0.048s

$ time ~/Software/ghc2/inplace/bin/ghc-stage2 -O1 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m22.205s
user    0m22.180s
sys     0m0.048s

In fact, much of the speed gain that I saw in comment:8 would more accurately be attributed to using HEAD, not necessarily Phab:D4766 (although Phab:D4766 does seem to shave off about five seconds, which is nothing to shake a stick at). It's not clear to me what makes HEAD so much faster, but that's quite encouraging!

comment:12 Changed 11 months ago by simonpj

Interesting. It's great that HEAD is faster; and it'd be even better to understand why.

comment:13 Changed 11 months ago by RyanGlScott

After some digging, I've found the commit that made compiling this program become so much faster. It's commit 55a3f8552c9dc9b84e204ec6623c698912795347 (Refactor coercion rule). In the commit prior, it took this long to build:

$ time ~/Software/ghc2/inplace/bin/ghc-stage2 -O1 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    3m17.647s
user    3m17.776s
sys     0m0.168s

But if I apply that commit, it then takes this long to build:

$ time ~/Software/ghc2/inplace/bin/ghc-stage2 -O1 -fforce-recomp Lib.hs
[1 of 2] Compiling Lib2             ( Lib2.hs, Lib2.o )
[2 of 2] Compiling Lib              ( Lib.hs, Lib.o )

real    0m26.102s
user    0m26.032s
sys     0m0.108s

This is quite a humorous coincidence, since up to this point I had been under the impression that that commit had made compilation performance slightly worse overall. But this shows I was completely wrong!

comment:14 Changed 11 months ago by goldfire

Cc: xnning added

Ha! One of the motivations for that refactoring was increased performance, and we were befuddled why that goal didn't materialize. And now it has. :)

Note: See TracTickets for help on using tickets.