Changes between Version 32 and Version 33 of CompilerPerformance


Ignore:
Timestamp:
Aug 25, 2015 1:17:00 PM (4 years ago)
Author:
bgamari
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CompilerPerformance

    v32 v33  
    1 [[PageOutline]]
     1This page has been superceded by,
    22
    3 == Nofib results ==
    4 
    5 === Austin, 5 May 2015 ===
    6 Full results [https://gist.githubusercontent.com/thoughtpolice/498d51153240cc4d899c/raw/9a43f6bbfd642cf4e7b15188f9c0b053d311f7b9/gistfile1.txt are here] (updated '''May 5th, 2015''')
    7 
    8 '''NB''': The baseline here is 7.6.3
    9 
    10 === Ben, 31 July 2015 ===
    11 
    12 http://home.smart-cactus.org/~ben/nofib.html
    13 
    14 Baseline is 7.4.2.
    15 
    16 === Nofib outliers ===
    17 
    18 ==== Binary sizes ====
    19 
    20 ===== 7.6 to 7.8 =====
    21 
    22   - Solid average binary size increase of '''5.3%'''.
    23 
    24 ==== Allocations ====
    25 
    26 ===== 7.4 to 7.6 =====
    27 
    28   - '''fannkuch-redux''': increased by factor of 10,000?!?!
    29     - 7.6.3: `<<ghc: 870987952 bytes, 1668 GCs (1666 + 2), 0/0 avg/max bytes residency (0 samples), 84640 bytes GC work, 1M in use, 0.00 INIT (0.00 elapsed), 2.43 MUT (2.43 elapsed), 0.00 GC (0.00 elapsed), 0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>`
    30     - 7.4.2: `<<ghc: 74944 bytes, 1 GCs (0 + 1), 0/0 avg/max bytes residency (0 samples), 3512 bytes GC work, 1M in use, 0.00 INIT (0.00 elapsed), 2.25 MUT (2.25 elapsed), 0.00 GC (0.00 elapsed), 0.00 GC(0) (0.00 elapsed), 0.00 GC(1) (0.00 elapsed), 1 balance :ghc>>`
    31     - According to [FoldrBuildNotes] this test is very sensitive to fusion
    32     - Filed #10717 to track this.
    33 
    34 ===== 7.6 to 7.8 =====
    35 
    36   - '''spectral-norm''': increases by '''17.0%'''.
    37     - A '''lot''' more calls to `map`, over 100 more! Maybe inliner failure?
    38     - Over '''twice''' as many calls to `ghc-prim:GHC.Classes.$fEqChar_$c=={v r90O}` (& similar functions). Also over twice as many calls to `elem`,
    39     - Similarly, many more calls to other specializations, like `base:Text.ParserCombinators.ReadP.$fMonadPlusP_$cmplus{v r1sr}`, which adds even more allocations (from 301 to 3928 for this one entry!)
    40     - Basically the same story up to `HEAD`!
    41 
    42 ===== 7.8 to 7.10 =====
    43 
    44   - '''gcd''': increases by '''20.7%'''.
    45     - Ticky tells us that this seems to be a combination of a few things; most everything seems fairly similar, but we see a large amount of allocations attributable to 7.10 that I can't figure out where they came from, aside from the new `integer-gmp`: `integer-gmp-1.0.0.0:GHC.Integer.Type.$WS#{v rwl}` accounts for 106696208 extra bytes of allocation! It also seems like there are actual extant calls to `GHC.Base.map` in 7.10, and none in 7.8. These are the main differences.
    46   - '''pidigits''': increases by '''7.4%'''.
    47     - Ticky tells us that this seems to be, in large part, due to `integer-gmp` (which is mostly what it benchmarks anyway). I think part of this is actually an error, because before integer-gmp, a lot of things were done in C-- code or whatnot, while the new `integer-gmp` does everything in Haskell, so a lot more Haskell code shows up in the profile. So the results aren't 1-to-1. One thing that seems to be happening is that there are a lot more specializations going on that are called repeatedly, it seems; many occurrences of things like `sat_sad2{v} (integer-gmp-1.0.0.0:GHC.Integer.Type) in rfK` which don't exist in the 7.8 profiles, each with a lot of entries and allocations.
    48   - '''primetest''': went down '''27.5%''' in 7.6-to-7.8, but '''8.8%''' slower than 7.6 now - in total it got something like '''36.6%''' worse.
    49     - Much like '''pidigits''', a lot more `integer-gmp` stuff shows up in these profiles. While it's still just like the last one, there are some other regressions; for example, `GHC.Integer.Type.remInteger` seems to have 245901/260800 calls/bytes allocated, vs 121001/200000 for 7.8
    50 
    51 TODO: Lots of fusion changes have happened in the last few months too - but these should all be pretty diagnosable with some reverts, since they're usually very localized. Maybe worth looking through `base` changes.
    52 
    53 ==== Runtime ====
    54 
    55 ===== 7.6 to 7.8 =====
    56 
    57   - `lcss`: increases by '''12.6%'''.
    58     - Ticky says it seems to be `map` calls yet again! These jump hugely here from 21014 to 81002.
    59     - Also, another inner loop with `algb` it looks like gets called a huge number of times too - `algb2` is called '''2001056 times vs 7984760 times'''!
    60       - Same with `algb` and `algb1`, which seem to be called more often too.
    61     - Some other similar things; a few regressions in the # of calls to things like `Text.ParserCombinator.ReadP` specializations, I think.
    62     - Same story with HEAD!
    63 
    64 ===== 7.8 to 7.10 =====
    65 
    66   - `lcss`: decreased by ~5% in 7.10, but still '''7%''' slower than 7.6.
    67     - See above for real regressions.
    68   - `multiplier`: increases by '''7.6%'''.
    69     - `map` strikes again? 2601324 vs 3597333 calls, with an accompanying allocation delta.
    70     - But some other inner loops here work and go away correctly (mainly `go`), unlike e.g. `lcss`.
    71  
    72 ==== Comparing integer-gmp 0.5 and 1.0 ====
    73 
    74 One of the major factors that has changed recently is `integer-gmp`. Namely, GHC 7.10 includes `integer-gmp-1.0`, a major rework of `integer-gmp-0.5`. I've compiled GHC 7.10.1 with `integer-gmp` 0.5 and 1.0. [http://home.smart-cactus.org/~ben/nofib.html Here] is a nofib comparison. There are a few interesting points here,
    75 
    76   - Binary sizes dropped dramatically and consistently (typically around 60 to 70%) from 0.5 to 1.0.
    77   - Runtime is almost always within error. A few exceptions,
    78       - `binary-trees`: 6% slower with 1.0
    79       - `pidigits`: 5% slower
    80       - `integer`: 4% slower
    81       - `cryptarithm1`: 2.5% slower
    82       - `circsim`: 3% faster
    83       - `lcss`: 5% faster
    84       - `power`: 17% faster
    85   - Allocations are typically similar. The only test that improves significantly
    86     is `prime` whose allocations decreased by 24% Many more tests regress
    87     considerably,
    88       - `bernoulli`: +15%
    89       - `gcd`: +21%
    90       - `kahan`: +40%
    91       - `mandel` +34%
    92       - `primetest`: +50%
    93       - `rsa`: +53%
    94 
    95 The allocation issue is actually discussed in the commit message (c774b28f76ee4c220f7c1c9fd81585e0e3af0e8a),
    96 > Due to the different (over)allocation scheme and potentially different
    97 > accounting (via the new `{shrink,resize}MutableByteArray#` primitives),
    98 > some of the nofib benchmarks actually results in increased allocation
    99 > numbers (but not necessarily an increase in runtime!).  I believe the
    100 > allocation numbers could improve if `{resize,shrink}MutableByteArray#`
    101 > could be optimised to reallocate in-place more efficiently.
    102 The message then goes on to list exactly the nofib tests mentioned above. Given that there isn't a strong negative trend in runtime corresponding with these increased allocations, I'm leaning towards ignoring these for now.
    103 
    104 
    105 == tests/perf/compiler` results ==
    106 
    107 === 7.6 vs 7.8 ===
    108 
    109   - A bit difficult to decipher, since a lot of the stats/surrounding numbers were totally rewritten due to some Testsuite API overhauls.
    110   - The results are a mix; there are things like `peak_megabytes_allocated` being bumped up a lot, but a lot of them also had `bytes_allocated` go down as well. This one seems pretty mixed.
    111  
    112 === 7.8 vs 7.10 ===
    113 
    114   - Things mostly got **better** according to these, not worse!
    115   - Many of them had drops in `bytes_allocated`, for example, `T4801`.
    116   - The average improvement range is something like 1-3%.
    117   - But one got much worse; `T5837`'s `bytes_allocated` jumped from 45520936 to 115905208, 2.5x worse!
    118 
    119 === 7.10 vs HEAD ===
    120 
    121   - Most results actually got **better**, not worse!
    122   - Silent superclasses made HEAD drop in several places, some noticeably over 2x
    123     - `max_bytes_used` increased in some cases, but not much, probably GC wibbles.
    124   - No major regressions, mostly wibbles.
    125 
    126 == Compile/build times ==
    127 
    128 (NB: Sporadically updated)
    129 
    130 '''As of April 22nd''':
    131 
    132   - GHC HEAD: 14m9s  (via 7.8.3) (because of Joachim's call-arity improvements)
    133   - GHC 7.10: 15m43s (via 7.8.3)
    134   - GHC 7.8:  12m54s (via 7.8.3)
    135   - GHC 7.6:  8m19s  (via 7.4.1)
    136 
    137 Random note: GHC 7.10's build system actually disabled DPH (half a dozen more packages and probably a hundred extra modules), yet things *still* got slower over time!
    138 
    139 == Performance-related tickets ==
    140 
    141 Relevant tickets
    142 
    143  * #10370: OpenGLRaw
    144  * #10289: 2.5k static HashSet takes too much memory to compile
    145    - Significantly improved in memory usage from #10370, but worse at overall wall-clock time!
    146  * #9583, #9630: code blowup in Generics/Binary
    147  * #10228: regression from 7.8.4 to 7.10.1
    148  * #7428: Non-linear compile time: addFingerprint??
    149    - Still a huge problem with GHC 7.10.1: looks like quadratic behavior around `TidyCore`/`CorePrep`.
    150  * #2346: desugaring let-bindings
    151  * #10491: Huge explosion in compilation time for `Accelerate`
    152 
    153 https://ghc.haskell.org/trac/ghc/query?status=!closed&failure=Runtime+performance+bug&type=bug
    154 
    155 == Compile time ==
    156 
    157  * #9557: Deriving instances is slow
    158  * #8731: long compilation time for module with large data type and partial record selectors
    159  * #7258: Compiling DynFlags is jolly slow
    160  * #7450: Regression in optimisation time of functions with many patterns (6.12 to 7.4)?
    161    * Phab:D1041, Phab:D1012
    162    * Unnecessary recomputation of free variables (Phab:D1012)
    163    * Thunk leak in `Bitmap` (Phab:D1040)
    164  * #9669: Long compile time/high memory usage for modules with many deriving clauses
    165 
    166 https://ghc.haskell.org/trac/ghc/query?status=!closed&failure=Compile-time+performance+bug
     3 * [[Performance/Runtime]] for issues pertaining to the performance of code generated by GHC.
     4 * [[Performance/Compiler]] for issues pertaining to the performance of GHC itself.