Opened 3 years ago

Closed 2 years ago

#12545 closed bug (fixed)

Compilation time/space regression in GHC 8.0/8.1 (search in type-level lists and -O)

Reported by: mikhail.vorozhtsov Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 8.0.1
Keywords: Inlining Cc: nh2
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case: perf/compiler/T12545
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D3632
Wiki Page:

Description

After upgrading to GHC 8.0.1 I've noticed that compilation time of one of my libraries got unacceptably long. Please find the attached test case. Here are some numbers from my machine:

$ time ~/Prefixes/ghc-7.10.3/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 19103924976 bytes, 1441 GCs, 73048470/256579104 avg/max bytes residency (18 samples), 542M in use, 0.000 INIT (0.002 elapsed), 14.163 MUT (14.257 elapsed), 5.707 GC (5.830 elapsed) :ghc>>

real	0m20.160s
user	0m19.937s
sys	0m0.197s
$ time ~/Prefixes/ghc-8.0.1/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 147662211104 bytes, 9635 GCs, 190336418/433097448 avg/max bytes residency (38 samples), 1210M in use, 0.000 INIT (0.001 elapsed), 101.707 MUT (101.840 elapsed), 27.907 GC (28.259 elapsed) :ghc>>

real	2m10.195s
user	2m9.773s
sys	0m0.400s

Both versions are the official Debian 8 x86_64 binaries from haskell.org. I also compiled the git master and unfortunately it is affected as well.

It is worth mentioning that turning off optimization makes the problem disappear:

$ time ~/Prefixes/ghc-8.0.1/bin/ghc -O0 -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 3779057184 bytes, 225 GCs, 17923355/81566792 avg/max bytes residency (8 samples), 164M in use, 0.000 INIT (0.001 elapsed), 2.847 MUT (2.941 elapsed), 0.553 GC (0.572 elapsed) :ghc>>

real	0m3.593s
user	0m3.507s
sys	0m0.073s

Attachments (2)

TypeList.hs (1.7 KB) - added by mikhail.vorozhtsov 3 years ago.
Regression.hs (1.6 KB) - added by mikhail.vorozhtsov 3 years ago.

Download all attachments as: .zip

Change History (20)

Changed 3 years ago by mikhail.vorozhtsov

Attachment: TypeList.hs added

Changed 3 years ago by mikhail.vorozhtsov

Attachment: Regression.hs added

comment:1 Changed 3 years ago by mpickering

Thanks for the great test case!

comment:2 Changed 3 years ago by mikhail.vorozhtsov

The results of putting NOINLINE on elemWitness:

$ time ~/Prefixes/ghc-7.10.3/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 4483083448 bytes, 546 GCs, 31794039/124563608 avg/max bytes residency (13 samples), 294M in use, 0.000 INIT (0.001 elapsed), 3.247 MUT (3.279 elapsed), 1.677 GC (1.753 elapsed) :ghc>>

real	0m5.092s
user	0m4.983s
sys	0m0.097s
$ time ~/Prefixes/ghc-8.0.1/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs 
<<ghc: 25719759160 bytes, 2550 GCs, 51924484/183848928 avg/max bytes residency (16 samples), 510M in use, 0.000 INIT (0.000 elapsed), 19.883 MUT (20.050 elapsed), 3.770 GC (3.880 elapsed) :ghc>>

real	0m23.988s
user	0m23.747s
sys	0m0.190s

comment:3 Changed 3 years ago by mpickering

Keywords: Inlining added

comment:4 Changed 3 years ago by mpickering

The interface file produced by both versions is absolutely huge (17mb). A very large number of coercions are generated by the instance declarations in Regression.hs.

Interestingly, one way to improve the performance on 7.10.3 is marking elemWitness as inlinable.. This stops the unfoldings being optimised but it doesn't bring performance to the same level as omitting unfoldings completely.

Last edited 3 years ago by mpickering (previous) (diff)

comment:5 Changed 3 years ago by simonpj

Several points here

  • We don't yet know why 8.0 is worse than 7.10. That would e helpful.
  • The coercions record a proof of all the list manipulations involved. It's not surprising that they can get big. What is surprising is why they are bigger with 8.0
  • GHC includes a "coercion optimiser" OptCoercion.hs, which tries to shrink a big coercion to a smaller but equivalent coercion. It's possible that 8.0 is somehow missing a crucial optimisation that fires in 7.10.
  • We also have a plan, recorded in Trac #8095, to discard coercions altogether. But neither Richard or I have found time to execute on that plan. See also #11598

Bottom line: more insight into what is going on would be really helpful.

comment:6 Changed 3 years ago by mikhail.vorozhtsov

Here are some numbers from -v4 logs:

$ grep -E '(=====|^Result|coercions)' out-7_10.log | grep -v 'RHS size' | head -40
==================== Parser ====================
==================== Source Statistics ====================
==================== Occurrence analysis ====================
==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 2,144, types: 128,368, coercions: 415,496}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 3,200, types: 173,424, coercions: 415,528}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 3,200, types: 173,424, coercions: 415,528}
==================== Specialise ====================
Result size of Specialise
  = {terms: 12,416, types: 658,320, coercions: 455,704}
==================== FloatOut stats: ====================
Result size of Float out(FOS {Lam = Just 0,
  = {terms: 13,472, types: 739,808, coercions: 455,704}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
Result size of Float inwards
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Called arity analysis ====================
Result size of Called arity analysis
  = {terms: 30,336, types: 1,848,096, coercions: 6,417,128}
==================== Occurrence analysis ====================

for 7.10.3 versus

$ grep -E '(=====|^Result|coercions)' out-8_0.log | grep -v 'RHS size' | head -40
==================== Parser ====================
==================== Source Statistics ====================
==================== Occurrence analysis ====================
==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 3,835, types: 147,262, coercions: 549,848}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 4,897, types: 161,681, coercions: 549,849}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 4,897, types: 161,681, coercions: 549,849}
==================== Specialise ====================
Result size of Specialise
  = {terms: 17,793, types: 899,977, coercions: 3,094,577}
==================== FloatOut stats: ====================
Result size of Float out(FOS {Lam = Just 0,
  = {terms: 20,769, types: 1,109,785, coercions: 3,094,577}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 16,553, types: 871,953, coercions: 2,232,281}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 16,429, types: 864,761, coercions: 2,184,448}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 16,429, types: 864,761, coercions: 2,184,448}
==================== Occurrence analysis ====================
Result size of Simplifier iteration=1
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
==================== Occurrence analysis ====================
Result size of Simplifier
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
Result size of Float inwards
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
==================== Called arity analysis ====================
Result size of Called arity analysis
  = {terms: 87,357, types: 4,366,893, coercions: 10,008,352}
==================== Occurrence analysis ====================

for 8.0.1

comment:7 Changed 3 years ago by goldfire

This, to me, does not look like a cut-and-dried type-level issue. Note that the number of terms is much greater in 8.0. And the ratio of coercions to terms is lower in 8.0 than 7.10.

It looks like the occurrence analysis in 8.0 is to blame for the blowup in terms... whereas that same pass in 7.10 caused more of a blowup in coercions. Weird.

comment:8 Changed 2 years ago by RyanGlScott

A sizable chunk of the regression came from commit 6746549772c5cc0ac66c0fce562f297f4d4b80a2 (Add kind equalities to GHC.)

Commit 6e56ac58a6905197412d58e32792a04a63b94d7e (Fix infix record field fixity (#11167 and #11173).)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 19608856112 bytes, 2265 GCs, 63045253/234165360 avg/max bytes residency (19 samples), 498M in use, 0.001 INIT (0.001 elapsed), 9.087 MUT (9.117 elapsed), 3.649 GC (3.647 elapsed) :ghc>>

real    0m12.829s
user    0m12.688s
sys     0m0.140s

Commit 6746549772c5cc0ac66c0fce562f297f4d4b80a2 (Add kind equalities to GHC.)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 68214058248 bytes, 5147 GCs, 71712416/267022672 avg/max bytes residency (20 samples), 586M in use, 0.001 INIT (0.001 elapsed), 29.850 MUT (29.871 elapsed), 4.806 GC (4.804 elapsed) :ghc>>

real    0m34.744s
user    0m34.572s
sys     0m0.176s

But that's not the full story, since I get a realtime of 1m12.849s with GHC 8.0.1. I'll try to figure out what other commit(s) are contributing.

comment:9 Changed 2 years ago by RyanGlScott

Found it. The other culprit is commit 1722fa106e10e63160bb2322e2ccb830fd5b9ab3 (Fix #11230.):

Commit ae86eb9f72fa7220fe47ac54d6d21395691c1308 (Fix tcTyClTyVars to handle SigTvs)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 68211633024 bytes, 4916 GCs, 73233972/211211872 avg/max bytes residency (20 samples), 571M in use, 0.001 INIT (0.001 elapsed), 28.299 MUT (28.320 elapsed), 4.831 GC (4.828 elapsed) :ghc>>

real    0m33.216s
user    0m33.040s
sys     0m0.176s

Commit 1722fa106e10e63160bb2322e2ccb830fd5b9ab3 (Fix #11230.)
-----
$ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs
<<ghc: 161224532608 bytes, 8789 GCs, 195013385/460659576 avg/max bytes residency (37 samples), 1272M in use, 0.001 INIT (0.001 elapsed), 66.710 MUT (66.753 elapsed), 18.669 GC (18.660 elapsed) :ghc>>

real    1m25.494s
user    1m25.252s
sys     0m0.260s

What's alarming is that that commit is so small! There must be something interesting happening there...

comment:10 Changed 2 years ago by nh2

Cc: nh2 added

comment:11 Changed 2 years ago by Ben Gamari <ben@…>

In a786b136/ghc:

Use lengthIs and friends in more places

While investigating #12545, I discovered several places in the code
that performed length-checks like so:

```
length ts == 4
```

This is not ideal, since the length of `ts` could be much longer than 4,
and we'd be doing way more work than necessary! There are already a slew
of helper functions in `Util` such as `lengthIs` that are designed to do
this efficiently, so I found every place where they ought to be used and
did just that. I also defined a couple more utility functions for list
length that were common patterns (e.g., `ltLength`).

Test Plan: ./validate

Reviewers: austin, hvr, goldfire, bgamari, simonmar

Reviewed By: bgamari, simonmar

Subscribers: goldfire, rwbarton, thomie

Differential Revision: https://phabricator.haskell.org/D3622

comment:12 Changed 2 years ago by simonpj

See this ghc-devs thread, and this one.

There seem to be several issues:

  • Why does the small change identified by Ryan have such a huge effect? We need insight.
  • The change that Richard made to CoreOpt.simple_opt_expr actually fixed an outright bug. But it's still wrong now. See Trac #13798.
  • The fuss that is caused by this map/coerce rule is out of all proportion to its importance. We should think of a better way.

comment:13 Changed 2 years ago by simonpj

Incidentally, HEAD is much much better:

7.10   19.2G allocated by GHC
       50 seconds

HEAD    7.1G allocated by GHC
        30 seconds

I gave up waiting for 8.0. So apparently the original problem has gone away, although I'd quite like to know how!

comment:14 Changed 2 years ago by RyanGlScott

Well hot dang. It turns out that commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 (Stop the specialiser generating loopy code) is once again responsible for fixing the problem!

Commit 92a4f908f2599150bec0530d688997f03780646e (Spelling typos)
<<ghc: 161228659408 bytes, 6342 GCs, 101585298/269563168 avg/max bytes residency (15 samples), 712M in use, 0.001 INIT (0.001 elapsed), 53.742 MUT (53.674 elapsed), 5.431 GC (5.420 elapsed) :ghc>>

real    0m59.150s
user    0m59.020s
sys     0m0.224s

Commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 (Stop the specialiser generating loopy code)
<<ghc: 7146221432 bytes, 755 GCs, 30379988/108444792 avg/max bytes residency (10 samples), 240M in use, 0.001 INIT (0.001 elapsed), 3.583 MUT (3.603 elapsed), 0.890 GC (0.888 elapsed) :ghc>>

real    0m4.549s
user    0m4.440s
sys     0m0.096s

For reference, the time for GHC 7.10.3 was:

<<ghc: 19108000616 bytes, 1539 GCs, 62086227/176263896 avg/max bytes residency (20 samples), 453M in use, 0.001 INIT (0.001 elapsed), 9.188 MUT (9.210 elapsed), 3.974 GC (3.972 elapsed) :ghc>>

real    0m13.477s
user    0m13.128s
sys     0m0.104s

So now it's even better than it used to be! That's pretty cool.

I don't have any particular insight as to why that commit made things so much better, but sure enough, TypeList.hs contains some classes with significant superclasses:

class (FindElem 'HeadElem a l ~ 'Elem p) => ElemAt p (a :: k) (l :: [k]) where
  elemWitness :: Proxy a -> Proxy l -> ElemWitness p a l

class IsElem t (ElemsOf a) => ElemOf a t where

So it's entirely possible that their corresponding dictionaries are just getting optimized better. It would be great to have someone verify this at the Core level.

comment:15 Changed 2 years ago by nh2

I've backported the above mentioned commit to GHC 8.0.2 and created bindist for Linux:

https://github.com/nh2/ghc/releases/tag/ghc-8.0.2-bugfix-12545-backport-2017-06-08

Last edited 2 years ago by nh2 (previous) (diff)

comment:16 Changed 2 years ago by RyanGlScott

Differential Rev(s): Phab:D3632
Status: newpatch

comment:17 Changed 2 years ago by Ben Gamari <ben@…>

In 6ddb3aaf/ghc:

Add perf test for #12545

Commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 did wonders for the
program reported in #12545. Let's add a perf test for it to make sure it
stays fast.

Test Plan: make test TEST=T12545

Reviewers: austin, bgamari

Reviewed By: bgamari

Subscribers: rwbarton, thomie

GHC Trac Issues: #12545

Differential Revision: https://phabricator.haskell.org/D3632

comment:18 Changed 2 years ago by bgamari

Milestone: 8.2.1
Resolution: fixed
Status: patchclosed
Test Case: perf/compiler/T12545
Note: See TracTickets for help on using tickets.