Opened 5 years ago

Last modified 21 months ago

#9669 new bug

Long compile time/high memory usage for modules with many deriving clauses

Reported by: snoyberg Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.3
Keywords: deriving-perf Cc: gidyn, RyanGlScott, crockeea, steshaw
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I've seen many different people complain about this in various different contexts. One of the most common I personally see is when using Persistent with a large number of entities, which results in multi-minute build times for that single module. Usually, these reports have come from private repositories, which makes for a difficult repro. Also, the usage of Template Haskell in those modules tends to confuse the issue.

So I'd like to report this issue from a completely separate project, with the guess that this represents an issue that is affecting many other users. Steps to reproduce:

  • cabal unpack fpco-api-1.2.0.1
  • cd fpco-api-1.2.0.1
  • cabal install
  • ghc --make -isrc/library src/library/FP/API/Types.hs
  • touch src/library/FP/API/Types.hs

Then, to see the time and memory usage of compiling just the one module, run:

  • time ghc -O2 --make -isrc/library src/library/FP/API/Types.hs +RTS -s

On my system (16GB RAM, quadcore i7), the results are:

$ time ghc -O2 --make -isrc/library src/library/FP/API/Types.hs +RTS -s
[3 of 3] Compiling FP.API.Types     ( src/library/FP/API/Types.hs, src/library/FP/API/Types.o )
  51,846,533,568 bytes allocated in the heap
   6,799,246,288 bytes copied during GC
     389,297,584 bytes maximum residency (22 sample(s))
      11,330,472 bytes maximum slop
            1041 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2041 colls,     0 par    9.19s    9.19s     0.0045s    0.0998s
  Gen  1        22 colls,     0 par    4.33s    4.33s     0.1969s    0.5436s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   34.21s  ( 37.74s elapsed)
  GC      time   13.52s  ( 13.53s elapsed)
  EXIT    time    0.07s  (  0.07s elapsed)
  Total   time   47.81s  ( 51.34s elapsed)

  Alloc rate    1,515,662,937 bytes per MUT second

  Productivity  71.7% of total user, 66.8% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m51.375s
user    0m50.864s
sys     0m0.456s

By contrast, with optimizations turned off:

$ time ghc -O0 --make -isrc/library src/library/FP/API/Types.hs +RTS -s
[3 of 3] Compiling FP.API.Types     ( src/library/FP/API/Types.hs, src/library/FP/API/Types.o )
  12,767,593,936 bytes allocated in the heap
   1,078,202,664 bytes copied during GC
     179,551,768 bytes maximum residency (13 sample(s))
       6,684,544 bytes maximum slop
             439 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       623 colls,     0 par    1.24s    1.24s     0.0020s    0.0651s
  Gen  1        13 colls,     0 par    0.76s    0.76s     0.0586s    0.2397s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    8.28s  (  9.23s elapsed)
  GC      time    2.00s  (  2.00s elapsed)
  EXIT    time    0.05s  (  0.05s elapsed)
  Total   time   10.34s  ( 11.28s elapsed)

  Alloc rate    1,542,443,897 bytes per MUT second

  Productivity  80.6% of total user, 73.9% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m11.304s
user    0m11.108s
sys     0m0.172s

and -O1:

$ time ghc -O1 --make -isrc/library src/library/FP/API/Types.hs +RTS -s
[3 of 3] Compiling FP.API.Types     ( src/library/FP/API/Types.hs, src/library/FP/API/Types.o )
  45,550,443,664 bytes allocated in the heap
   5,721,700,512 bytes copied during GC
     358,036,456 bytes maximum residency (21 sample(s))
       9,167,176 bytes maximum slop
             906 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1642 colls,     0 par    7.70s    7.70s     0.0047s    0.1031s
  Gen  1        21 colls,     0 par    3.68s    3.69s     0.1756s    0.4968s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   30.21s  ( 33.09s elapsed)
  GC      time   11.38s  ( 11.39s elapsed)
  EXIT    time    0.07s  (  0.07s elapsed)
  Total   time   41.68s  ( 44.55s elapsed)

  Alloc rate    1,507,571,106 bytes per MUT second

  Productivity  72.7% of total user, 68.0% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

real    0m44.577s
user    0m44.164s
sys     0m0.368s

Note that this module is essentially just 900 lines of data declarations, with a large number of derived classes, and a few manually written helper functions and instances.

This high memory usage hasn't just resulted in user complaints: automated Jenkins and Travis jobs will often times fail without optimizations disabled, which can be problematic for proper testing and production code. In the case of fpco-api, I've worked around this by adding -O0 to the cabal file, but it would be much nicer to not have to do that.

For some other examples of complaints along these lines:

I believe there are a few other threads discussing this, if it would be helpful. I did my testing on GHC 7.8.3 64-bit, Ubuntu 12.04.

Attachments (4)

generate.hs (947 bytes) - added by snoyberg 5 years ago.
show-passes-fpco-api.txt (145.4 KB) - added by snoyberg 5 years ago.
ghc.prof (69.5 KB) - added by snoyberg 5 years ago.
bar-hc.png (220.6 KB) - added by thoughtpolice 4 years ago.
+RTS -hc -RTS of 'Bar.hs' at 50 data types.

Download all attachments as: .zip

Change History (23)

Changed 5 years ago by snoyberg

Attachment: generate.hs added

comment:1 Changed 5 years ago by snoyberg

I've attached a file (generate.hs) to hopefully give a simpler view of the problem. This module (which depends only on base) will generate a Haskell source file which itself depends only on base. You can test this out with runghc generate.hs > Bar.hs && time ghc --make Bar.hs +RTS -s. You can also edit the generate.hs file to change which classes are derived and how many datatypes are generated.

In my testing, without switching optimization flags, I noticed the following (all results generated when using 200 types):

  • All classes: 247MB max residency, 17.7s runtime
  • Removing Generic: 109MB, 13.4s
  • Removing Generic and Data: 78MB, 9.4s
  • Removing all but Typeable: 23MB, 1.1s
  • Removing all deriving: 13MB, 0.6s

comment:2 Changed 5 years ago by rehno-lindeque

I have a largish repo of (private) persistent entities I could reformat in order to make public if it would be helpful for anyone working on this.

On my macbook air running ubuntu it takes several minutes to build from scratch:

cabal clean && cabal configure && time cabal build

real	9m3.969s
user	16m2.376s
sys	4m9.406s

(This is with conservative use of deriving as well as {-# OPTIONS_GHC -O0 #-}, it could probably be made much slower)

Last edited 5 years ago by rehno-lindeque (previous) (diff)

comment:3 Changed 5 years ago by simonpj

One thing that would help woul dbe if someone could characterise where the time is going. Build a profiled compiler (GhcProfiled=YES in your build.mk) and do some of the above tests.

Using -dshow-passes also is good at showing if there is a big size blow-up anywhere.

My fear is that these deriving clauses simply generate a lot of code, and that (naturally) takes a long time to compile. My hope is that there is a "big eater" somewhere that we can kill. But we need more information, and would appreciate help in getting it.

Thanks

Simon

Changed 5 years ago by snoyberg

Attachment: show-passes-fpco-api.txt added

comment:4 Changed 5 years ago by snoyberg

I've attached show-passes-fpco-api.txt​, which contains the -dshow-passes output for fpco-api's FP.API.Types module. I'll see about getting a profiled compiler built as well.

Changed 5 years ago by snoyberg

Attachment: ghc.prof added

comment:5 Changed 5 years ago by snoyberg

I've attached ghc.prof, which is a profile from compiling the Bar.hs module generated by the attached generate.hs. The hotspot appears to be the SimplTopBinds cost center, though my lack of experience with the GHC codebase prevents me from understanding what that means.

If there are any other reports (e.g. heap profiles) that would be useful, let me know.

comment:6 Changed 5 years ago by gidyn

Cc: gidyn added

comment:7 Changed 5 years ago by simonpj

Thanks. The code is indeed pretty big, with a "term size" of over 200,000. A module of a few hundred lines is more like 40,000. The simplifier is run many times over this code so it's perhaps not surprising that it takes most of the compile time.

Still, I wonder if there is anything non-linear going on, tickled by the form of the boilerplate code. The occurrence analyser runs exacty as often as the simplifier, yet only uses 5% of the time.

You could try adding -auto-all (or whatever the flag is) for Simplify to get more detailed cost centre information on the simplifier?

comment:8 Changed 5 years ago by snoyberg

I'd be happy to add the appropriate flag... but I'm afraid I don't really know the right way to do it. Is anyone reading this able to give me some advice?

comment:9 Changed 5 years ago by simonpj

It's -fprof-auto. Manual page here. Sorry was on train.

comment:10 Changed 5 years ago by snoyberg

I don't know how useful that profiling output will be, it gives

bindIO.\         GHC.Base                  96.5   17.5

I've uploaded the full output at:

https://gist.github.com/snoyberg/ca67756944235e4e2d96

comment:11 Changed 4 years ago by gidyn

Duplicate of #9557?

Changed 4 years ago by thoughtpolice

Attachment: bar-hc.png added

+RTS -hc -RTS of 'Bar.hs' at 50 data types.

comment:12 Changed 4 years ago by thoughtpolice

I've attached a quick profile from looking at this, and it looks like there's still more room to do around TidyCore/CorePrep in this area.

comment:13 Changed 4 years ago by bgamari

I suspect this will be positively affected by my recent work on #7450 (which focussed on compile time of datatypes with large numbers of constructors). This work has thusfar produced two fixes (Phab:D1012 and Phab:D1041). Let's characterize their effect on this and a couple of other related testcases with 7.11.

This test will start with 7a3d85e705665fbf2c28f83bb3997e8979f2b88c, the parent commit of the merge of D1012 (4681f55970cabc6e33591d7e698621580818f9a2), as the base commit. I will also test D1041 (which I have yet to land) applied directly on top of this.

I will look at the "maximum residency", "MUT time", "Total time", and "allocated in the heap" statistics from +RTS -s.

testcase commit max residency MUT time total time allocated
T9669 base commit 410 MB 46.0 s 79.4 s 81.0 GB
+ D1012 438 MB 46.0 s 79.7 s 80.9 GB
+ D1041 438 MB 46.2 s 79.9 s 80.8 GB
T7450 base commit 404 MB 35.5 s 48.4 s 61.0 GB
+ D1012 460 MB 23.0 s 35.0 s 38.3 GB
+ D1041 438 MB 22.3 s 35.0 s 37.0 GB
T7450-2k base commit 919 MB 112.1 s 140.0 s 191.5 GB
+ D1012 961 MB 57.7 s 84.1 s 102.0 GB
+ D1041 986 MB 54.6 s 82.1 s 91.9 GB
T7450-4k base commit not tested
+ D1012 2.4 GB 174.0 s 235.0 s 338.3 GB
+ D1041 1.9 GB 158.8 s 219.3 s 254.5 GB
T7258 base commit 191 MB 30.5 s 42.0 s 49.5 GB
+ D1012 171 MB 31.4 s 41.3 s 49.5 GB
+ D1041 179 MB 31.2 s 43.0 s 49.5 GB

Here the T9696 testcase refers to the Bar.hs produced by generate.hs on this ticket. The T7258 case is the W2.hs on #7458 which features a large number of fields in a single record constructor. The T7450 case refers to program like that described in #7450(comment:18) defining a single type with 1024 constructors deriving Read. The T7450-2k case is identical to T7450 but with 2096 constructors. T7450-4k is again identical but with 4096 constructors.

comment:14 Changed 4 years ago by bgamari

From the above we can safely assume that my work on #7450 has not had any measureable impact on #9669 or #7258. Hrm.

comment:15 Changed 4 years ago by bgamari

comment:16 Changed 4 years ago by ezyang

Keywords: deriving-perf added

comment:17 Changed 3 years ago by RyanGlScott

Cc: RyanGlScott added

comment:18 Changed 2 years ago by crockeea

Cc: crockeea added

comment:19 Changed 21 months ago by steshaw

Cc: steshaw added
Note: See TracTickets for help on using tickets.