Opened 3 years ago
Closed 2 years ago
#12545 closed bug (fixed)
Compilation time/space regression in GHC 8.0/8.1 (search in type-level lists and -O)
Reported by: | mikhail.vorozhtsov | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 8.2.1 |
Component: | Compiler | Version: | 8.0.1 |
Keywords: | Inlining | Cc: | nh2 |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Compile-time performance bug | Test Case: | perf/compiler/T12545 |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | Phab:D3632 | |
Wiki Page: |
Description
After upgrading to GHC 8.0.1 I've noticed that compilation time of one of my libraries got unacceptably long. Please find the attached test case. Here are some numbers from my machine:
$ time ~/Prefixes/ghc-7.10.3/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 19103924976 bytes, 1441 GCs, 73048470/256579104 avg/max bytes residency (18 samples), 542M in use, 0.000 INIT (0.002 elapsed), 14.163 MUT (14.257 elapsed), 5.707 GC (5.830 elapsed) :ghc>> real 0m20.160s user 0m19.937s sys 0m0.197s $ time ~/Prefixes/ghc-8.0.1/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 147662211104 bytes, 9635 GCs, 190336418/433097448 avg/max bytes residency (38 samples), 1210M in use, 0.000 INIT (0.001 elapsed), 101.707 MUT (101.840 elapsed), 27.907 GC (28.259 elapsed) :ghc>> real 2m10.195s user 2m9.773s sys 0m0.400s
Both versions are the official Debian 8 x86_64 binaries from haskell.org. I also compiled the git master and unfortunately it is affected as well.
It is worth mentioning that turning off optimization makes the problem disappear:
$ time ~/Prefixes/ghc-8.0.1/bin/ghc -O0 -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 3779057184 bytes, 225 GCs, 17923355/81566792 avg/max bytes residency (8 samples), 164M in use, 0.000 INIT (0.001 elapsed), 2.847 MUT (2.941 elapsed), 0.553 GC (0.572 elapsed) :ghc>> real 0m3.593s user 0m3.507s sys 0m0.073s
Attachments (2)
Change History (20)
Changed 3 years ago by
Attachment: | TypeList.hs added |
---|
Changed 3 years ago by
Attachment: | Regression.hs added |
---|
comment:1 Changed 3 years ago by
comment:2 Changed 3 years ago by
The results of putting NOINLINE
on elemWitness
:
$ time ~/Prefixes/ghc-7.10.3/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 4483083448 bytes, 546 GCs, 31794039/124563608 avg/max bytes residency (13 samples), 294M in use, 0.000 INIT (0.001 elapsed), 3.247 MUT (3.279 elapsed), 1.677 GC (1.753 elapsed) :ghc>> real 0m5.092s user 0m4.983s sys 0m0.097s $ time ~/Prefixes/ghc-8.0.1/bin/ghc -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 25719759160 bytes, 2550 GCs, 51924484/183848928 avg/max bytes residency (16 samples), 510M in use, 0.000 INIT (0.000 elapsed), 19.883 MUT (20.050 elapsed), 3.770 GC (3.880 elapsed) :ghc>> real 0m23.988s user 0m23.747s sys 0m0.190s
comment:3 Changed 3 years ago by
Keywords: | Inlining added |
---|
comment:4 Changed 3 years ago by
The interface file produced by both versions is absolutely huge (17mb). A very large number of coercions are generated by the instance declarations in Regression.hs.
Interestingly, one way to improve the performance on 7.10.3 is marking elemWitness
as inlinable.. This stops the unfoldings being optimised but it doesn't bring performance to the same level as omitting unfoldings completely.
comment:5 Changed 3 years ago by
Several points here
- We don't yet know why 8.0 is worse than 7.10. That would e helpful.
- The coercions record a proof of all the list manipulations involved. It's not surprising that they can get big. What is surprising is why they are bigger with 8.0
- GHC includes a "coercion optimiser"
OptCoercion.hs
, which tries to shrink a big coercion to a smaller but equivalent coercion. It's possible that 8.0 is somehow missing a crucial optimisation that fires in 7.10.
- We also have a plan, recorded in Trac #8095, to discard coercions altogether. But neither Richard or I have found time to execute on that plan. See also #11598
Bottom line: more insight into what is going on would be really helpful.
comment:6 Changed 3 years ago by
Here are some numbers from -v4
logs:
$ grep -E '(=====|^Result|coercions)' out-7_10.log | grep -v 'RHS size' | head -40 ==================== Parser ==================== ==================== Source Statistics ==================== ==================== Occurrence analysis ==================== ==================== Desugar (after optimization) ==================== Result size of Desugar (after optimization) = {terms: 2,144, types: 128,368, coercions: 415,496} ==================== Occurrence analysis ==================== Result size of Simplifier iteration=1 = {terms: 3,200, types: 173,424, coercions: 415,528} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 3,200, types: 173,424, coercions: 415,528} ==================== Specialise ==================== Result size of Specialise = {terms: 12,416, types: 658,320, coercions: 455,704} ==================== FloatOut stats: ==================== Result size of Float out(FOS {Lam = Just 0, = {terms: 13,472, types: 739,808, coercions: 455,704} ==================== Occurrence analysis ==================== Result size of Simplifier iteration=1 = {terms: 30,336, types: 1,848,096, coercions: 6,417,128} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 30,336, types: 1,848,096, coercions: 6,417,128} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 30,336, types: 1,848,096, coercions: 6,417,128} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 30,336, types: 1,848,096, coercions: 6,417,128} Result size of Float inwards = {terms: 30,336, types: 1,848,096, coercions: 6,417,128} ==================== Called arity analysis ==================== Result size of Called arity analysis = {terms: 30,336, types: 1,848,096, coercions: 6,417,128} ==================== Occurrence analysis ====================
for 7.10.3 versus
$ grep -E '(=====|^Result|coercions)' out-8_0.log | grep -v 'RHS size' | head -40 ==================== Parser ==================== ==================== Source Statistics ==================== ==================== Occurrence analysis ==================== ==================== Desugar (after optimization) ==================== Result size of Desugar (after optimization) = {terms: 3,835, types: 147,262, coercions: 549,848} ==================== Occurrence analysis ==================== Result size of Simplifier iteration=1 = {terms: 4,897, types: 161,681, coercions: 549,849} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 4,897, types: 161,681, coercions: 549,849} ==================== Specialise ==================== Result size of Specialise = {terms: 17,793, types: 899,977, coercions: 3,094,577} ==================== FloatOut stats: ==================== Result size of Float out(FOS {Lam = Just 0, = {terms: 20,769, types: 1,109,785, coercions: 3,094,577} ==================== Occurrence analysis ==================== Result size of Simplifier iteration=1 = {terms: 16,553, types: 871,953, coercions: 2,232,281} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 16,429, types: 864,761, coercions: 2,184,448} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 16,429, types: 864,761, coercions: 2,184,448} ==================== Occurrence analysis ==================== Result size of Simplifier iteration=1 = {terms: 87,357, types: 4,366,893, coercions: 10,008,352} ==================== Occurrence analysis ==================== Result size of Simplifier = {terms: 87,357, types: 4,366,893, coercions: 10,008,352} Result size of Float inwards = {terms: 87,357, types: 4,366,893, coercions: 10,008,352} ==================== Called arity analysis ==================== Result size of Called arity analysis = {terms: 87,357, types: 4,366,893, coercions: 10,008,352} ==================== Occurrence analysis ====================
for 8.0.1
comment:7 Changed 3 years ago by
This, to me, does not look like a cut-and-dried type-level issue. Note that the number of terms is much greater in 8.0. And the ratio of coercions to terms is lower in 8.0 than 7.10.
It looks like the occurrence analysis in 8.0 is to blame for the blowup in terms... whereas that same pass in 7.10 caused more of a blowup in coercions. Weird.
comment:8 Changed 2 years ago by
A sizable chunk of the regression came from commit 6746549772c5cc0ac66c0fce562f297f4d4b80a2 (Add kind equalities to GHC.)
Commit 6e56ac58a6905197412d58e32792a04a63b94d7e (Fix infix record field fixity (#11167 and #11173).) ----- $ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 19608856112 bytes, 2265 GCs, 63045253/234165360 avg/max bytes residency (19 samples), 498M in use, 0.001 INIT (0.001 elapsed), 9.087 MUT (9.117 elapsed), 3.649 GC (3.647 elapsed) :ghc>> real 0m12.829s user 0m12.688s sys 0m0.140s Commit 6746549772c5cc0ac66c0fce562f297f4d4b80a2 (Add kind equalities to GHC.) ----- $ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 68214058248 bytes, 5147 GCs, 71712416/267022672 avg/max bytes residency (20 samples), 586M in use, 0.001 INIT (0.001 elapsed), 29.850 MUT (29.871 elapsed), 4.806 GC (4.804 elapsed) :ghc>> real 0m34.744s user 0m34.572s sys 0m0.176s
But that's not the full story, since I get a realtime of 1m12.849s
with GHC 8.0.1. I'll try to figure out what other commit(s) are contributing.
comment:9 Changed 2 years ago by
Found it. The other culprit is commit 1722fa106e10e63160bb2322e2ccb830fd5b9ab3 (Fix #11230.):
Commit ae86eb9f72fa7220fe47ac54d6d21395691c1308 (Fix tcTyClTyVars to handle SigTvs) ----- $ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 68211633024 bytes, 4916 GCs, 73233972/211211872 avg/max bytes residency (20 samples), 571M in use, 0.001 INIT (0.001 elapsed), 28.299 MUT (28.320 elapsed), 4.831 GC (4.828 elapsed) :ghc>> real 0m33.216s user 0m33.040s sys 0m0.176s Commit 1722fa106e10e63160bb2322e2ccb830fd5b9ab3 (Fix #11230.) ----- $ time ghc4/inplace/bin/ghc-stage2 -fforce-recomp -O -c -Rghc-timing TypeList.hs Regression.hs <<ghc: 161224532608 bytes, 8789 GCs, 195013385/460659576 avg/max bytes residency (37 samples), 1272M in use, 0.001 INIT (0.001 elapsed), 66.710 MUT (66.753 elapsed), 18.669 GC (18.660 elapsed) :ghc>> real 1m25.494s user 1m25.252s sys 0m0.260s
What's alarming is that that commit is so small! There must be something interesting happening there...
comment:10 Changed 2 years ago by
Cc: | nh2 added |
---|
comment:12 Changed 2 years ago by
See this ghc-devs thread, and this one.
There seem to be several issues:
- Why does the small change identified by Ryan have such a huge effect? We need insight.
- The change that Richard made to
CoreOpt.simple_opt_expr
actually fixed an outright bug. But it's still wrong now. See Trac #13798.
- The fuss that is caused by this
map/coerce
rule is out of all proportion to its importance. We should think of a better way.
comment:13 Changed 2 years ago by
Incidentally, HEAD is much much better:
7.10 19.2G allocated by GHC 50 seconds HEAD 7.1G allocated by GHC 30 seconds
I gave up waiting for 8.0. So apparently the original problem has gone away, although I'd quite like to know how!
comment:14 Changed 2 years ago by
Well hot dang. It turns out that commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 (Stop the specialiser generating loopy code) is once again responsible for fixing the problem!
Commit 92a4f908f2599150bec0530d688997f03780646e (Spelling typos) <<ghc: 161228659408 bytes, 6342 GCs, 101585298/269563168 avg/max bytes residency (15 samples), 712M in use, 0.001 INIT (0.001 elapsed), 53.742 MUT (53.674 elapsed), 5.431 GC (5.420 elapsed) :ghc>> real 0m59.150s user 0m59.020s sys 0m0.224s Commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 (Stop the specialiser generating loopy code) <<ghc: 7146221432 bytes, 755 GCs, 30379988/108444792 avg/max bytes residency (10 samples), 240M in use, 0.001 INIT (0.001 elapsed), 3.583 MUT (3.603 elapsed), 0.890 GC (0.888 elapsed) :ghc>> real 0m4.549s user 0m4.440s sys 0m0.096s
For reference, the time for GHC 7.10.3 was:
<<ghc: 19108000616 bytes, 1539 GCs, 62086227/176263896 avg/max bytes residency (20 samples), 453M in use, 0.001 INIT (0.001 elapsed), 9.188 MUT (9.210 elapsed), 3.974 GC (3.972 elapsed) :ghc>> real 0m13.477s user 0m13.128s sys 0m0.104s
So now it's even better than it used to be! That's pretty cool.
I don't have any particular insight as to why that commit made things so much better, but sure enough, TypeList.hs
contains some classes with significant superclasses:
class (FindElem 'HeadElem a l ~ 'Elem p) => ElemAt p (a :: k) (l :: [k]) where elemWitness :: Proxy a -> Proxy l -> ElemWitness p a l class IsElem t (ElemsOf a) => ElemOf a t where
So it's entirely possible that their corresponding dictionaries are just getting optimized better. It would be great to have someone verify this at the Core level.
comment:15 Changed 2 years ago by
I've backported the above mentioned commit to GHC 8.0.2 and created bindist for Linux:
https://github.com/nh2/ghc/releases/tag/ghc-8.0.2-bugfix-12545-backport-2017-06-08
comment:16 Changed 2 years ago by
Differential Rev(s): | → Phab:D3632 |
---|---|
Status: | new → patch |
comment:18 Changed 2 years ago by
Milestone: | → 8.2.1 |
---|---|
Resolution: | → fixed |
Status: | patch → closed |
Test Case: | → perf/compiler/T12545 |
Thanks for the great test case!