Opened 8 years ago

Last modified 8 months ago

#5793 new task

make nofib awesome

Reported by: dterei Owned by: michalt
Priority: normal Milestone:
Component: NoFib benchmark suite Version:
Keywords: Cc: ndmitchell@…, rwbarton@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: #9571 Blocking:
Related Tickets: #15357 #15333 #15999 Differential Rev(s):
Wiki Page:

Description (last modified by sgraf)

Nofib is the standard tool GHC developers use to benchmark changes to the compiler. Its overall design is OK but it's had no love and care for many years and has bittrotted such that it isn't useful in a lot of situations.

This task is about making nofib useful again.

The breakdown for this is something like:

  1. Think and maybe fix nofib framework design. It has 'ways' which I think correspond to compilation method but more in the sense of 'dynamic' vs 'static', seems it may not suite being able to use ways for 'fasm' vs 'fllvm'. There is also the concept of 'modes' which corresponds to different benchmark input. So 'normal' and 'slow' for getting different run-times. At moment no easy way to select which benchmark groups to run, so may want to change that. I guess we should just decide, what knobs do we want to be able to easily tweak, and see how well the current design allows that.

Note there is a shake build system attached that does a lot of this (done by Neil Mitchell!). An explanation of it can be found here: http://neilmitchell.blogspot.com/2013/02/a-nofib-build-system-using-shake.html

The design discussion of it is mostly lost as it was done on private email sorry.

  1. Fixup the runtimes for benchmarks to be significant. This might be best done by changing the way we run benchmarks and collect results to make sure they are meaningful (cf. #15357) done in #15999.

E.g., Lots of great discussion and links to papers on this thread

http://www.haskell.org/pipermail/ghc-devs/2013-February/000307.html

  1. Above task is to fix normal but we may want to fixup slow as well and perhaps add a 'fast' mode where benchmarks run in around 1 second. Done, we determined that 0.1-0.2/1-2/5-10 for fast/norm/slow are the brackets to aim for.
  1. Maybe add more benchmarks to the suite (text, bytestring, performance regressions from ghc testsuite, vector....)

Attachments (1)

Main.2.hs (11.6 KB) - added by dterei 8 years ago.
Shake build system

Download all attachments as: .zip

Change History (43)

comment:1 Changed 8 years ago by tibbe

I assume you mean nofib, not fibon.

comment:2 Changed 8 years ago by dterei

Blocking: 5794 added

comment:3 Changed 8 years ago by dterei

Pushed a whole bunch of fixes so that fibon all compiles now at least. Now need to do some work on framework and make sure all the tests have significant run-times.

comment:4 Changed 8 years ago by dterei

Summary: make fibon not suckmake nofib not suck

Yes. meant nofib :). With fibon, nofib, nobench... it gets a little jumbled in my head at times.

comment:5 Changed 8 years ago by dterei

Description: modified (diff)

comment:6 Changed 8 years ago by dterei

comment:7 Changed 8 years ago by simonmar

difficulty: Unknown

Here are a few notes I've been collecting about what we should do with nofib.

  • Build system:
    • get rid of runstdtest
    • Make it independent of a GHC build
    • Maybe use Shake instead of make (but retain the ability to compile and run individual benchmarks, with custom options etc.)
  • beef up nofib-analyse:
    • generate gnuplot graphs directly
    • measure std.dev. better (omit results based on std.dev. rather than < 0.2s?)
  • benchmarks themselves:
    • we need a suite of ~10 large real-world programs for publishing results in papers.
    • generally: add new benchmarks, retire old ones

If we modify the programs and inputs to run for longer, then do them all at once: we can't do this incrementally, because each change invalidates all the old logs and they have to be regenerated. Note that we need the benchmarks to build with old compilers, so that we can run comparisons (or at least make sure we can get results even if some of the programs fail to compile).

comment:8 Changed 8 years ago by dterei

What do you think about using Criterion Simon?

comment:9 in reply to:  8 Changed 8 years ago by simonmar

Replying to dterei:

What do you think about using Criterion Simon?

I have thought about this from time to time. There might be a place for criterion here, but it won't be a complete replacement for the benchmark infrastructure, and it might be difficult to integrate it.

We measure lots of things that aren't runtime (allocations, residency, GC time, etc.), and we don't really want criterion messing up these figures. So we would have to measure these things from within the program itself. We have some new infrastructure for doing that (GHC.Stats), but I don't think it is quite up to the job yet.

Measurements of GC time are only meaningful over long runs, but criterion is good at measuring very quick things. IIRC criterion wants to do a large number of runs in order to get good stats, and that might make our runs take too long (perhaps I'm wrong here, but this was the case last time I tried criterion out).

comment:10 Changed 8 years ago by dterei

Of the list you gave Simon where do you feel is best to start? I'm not really interested in implementing any of the build system ones as for my uses cases make and inplace ghc is fine. What is the issue with runstdtest?

I'd most like to just get the benchmarks in good order. How should we proceed here? I've fixed up the Fibon benchmarks to all work again so they might be useful. It also seems useful to pull in the benchmarks from say bytestring, text, attoparssec, and vector... If I start reworking the 'imaginary, spectral, real' inputs to get more significant times, what should we be aiming for? Around 10 seconds a run seems a 'good' value to me. We also need to be careful the benchmarks hit the GC.

comment:11 Changed 8 years ago by simonmar

I agree working on the benchmarks themselves should be the highest priority.

Some of the benchmarks aren't very amenable to running for longer - we end up just repeating the same task many times. I think for benchmarks where we can't come up with a suitable input that keeps the program busy for long enough, we should just put these in a separate category and use them for regression testing only. Measuring allocation still works reliably even for programs that run for a tiny amount of time.

I said "retire old benchmarks" but on seconds thoughts a better plan is to not throw anything away, just keep them all around as regression tests. Make an exception only for programs which are broken beyond repair, or are definitely not measuring anything worthwhile (I occasionally come across a program that has been failing immediately with an error, and somebody accepted the output as correct in 1997...).

I still like to keep the microbenchmarks collected together. These are very useful for spot testing and debugging.

Keep an eye out for good candidates for a real-world benchmark suite. I'm thinking ~10 or so programs that have complex behaviour, preferably with multiple phases or multiple algorithms.

I think ~10s is perhaps slightly on the high side, but I don't feel too strongly about it. Currently it takes ~20min to run the real+spectral+imaginary suites, I think a good target to aim for is less than an hour, with the option to have a longer run.

comment:12 Changed 8 years ago by simonpj

Yes, please keep imaginary and spectral! It doesn't matter at all if they have a tiny runtime... but spotting when their allocation jumps is a very useful signal that some optimisation has gone wonky, and far easy to narrow down than in some giant program.

We should have substantial programs too, with significant runtime, but let's not lose the tiny ones!

comment:13 Changed 8 years ago by dterei

For the benchmarks we want to keep for regression would everyone be happy moving them to a new folder? So some of the benchmarks in 'imaginary' and 'spectral' would be moved to a new folder called, 'regression' say? I would prefer this as I agree that we shouldn't just throw benchmarks away but I want to have a clear distinction of the use case of individual benchmarks.

comment:14 Changed 8 years ago by simonpj

Aren't "imaginary" and "spectral" reasonable folder names already. They are all, without exception, regression tests. The one ones that can possibly be considered real benchmarks are in "real". In short, isn't the distinction you want to make made already?

comment:15 Changed 8 years ago by dterei

Maybe. I was under the impression 'imaginary' and 'spectral' were microbenchmarks. However, I guess the distinction between a microbenchmark and a regression benchmark is pretty minimal to non-existent? so sure lets keep the folder structure the same and just make sure to document this somewhere.

comment:16 Changed 8 years ago by simonmar

The rationale for the naming is described in Will's paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.4124. The distinction between spectral and imaginary is mainly size: the imaginary programs are microbenchmarks, whereas spectral programs are algorithmic kernels - larger than microbenchmarks but not real programs. I think it's a useful distinction.

Perhaps we also want another distinction - programs that run for long enough that we can obtain reliable timing results. However that's an orthogonal dimension, and we can't sensibly use both dimensions for organising the directory structure. So I suggest we just have a flag per benchmark that is true if it runs for long enough, and have a simple way to run just those benchmarks.

comment:17 Changed 8 years ago by igloo

Milestone: _|_

comment:18 Changed 8 years ago by NeilMitchell

Cc: ndmitchell@… added

I might try and give a go at a Shake version of the makefile soup that is currently in there.

Changed 8 years ago by dterei

Attachment: Main.2.hs added

Shake build system

comment:19 Changed 7 years ago by dterei

Summary: make nofib not suckmake nofib awesome

comment:20 Changed 7 years ago by dterei

Description: modified (diff)

comment:21 Changed 7 years ago by dterei

Description: modified (diff)

comment:22 Changed 7 years ago by morabbin

Not sure if this is the place to report this, but without

cabal install html

nofib-analyse won't compile, as it uses Text.Html. Where is the correct place to put this dependency?

comment:23 Changed 7 years ago by dterei

I put a note in http://hackage.haskell.org/trac/ghc/wiki/Building/RunningNoFib and also in the README file about the dependency. I don't think we should otherwise include the actual dependency if that was your implication. Thanks morabbin!

comment:24 Changed 7 years ago by dterei

Description: modified (diff)

comment:25 in reply to:  22 Changed 7 years ago by simonmar

Replying to morabbin:

Not sure if this is the place to report this, but without

cabal install html

nofib-analyse won't compile, as it uses Text.Html. Where is the correct place to put this dependency?

Moving nofib-analyse out into a Cabal package might be a good idea, and would resolve this.

comment:26 Changed 6 years ago by rwbarton

Cc: rwbarton@… added

Another nofib build-system improvement I'd like to see is putting all the generated files off to the side in a build/ directory. I often want to do two sequential nofib runs with different compiler options, and afterwards I might want to investigate one test (e.g. whichever one got slower by the most) by running the two executables with some RTS option, or comparing the disassembly of the two .o files, or whatever. That would be a lot easier if I could just mv build/ build1/ before starting the second nofib run.

I'll probably take a look at some of these projects after 7.8 is released.

comment:27 Changed 6 years ago by jstolarek

One thing I really dislike about nofib design is that for some reasons it checks correctness of results produced by programs. Perhaps I wouldn't mind if it didn't try to compare floating point values for equality. But still, these are benchmarks, not tests.

Having said that here, I can now close #7968 as duplicate :-)

comment:28 Changed 5 years ago by ezyang

Blocked By: 9571 added

comment:29 Changed 5 years ago by thomie

Blocking: 5794 removed

comment:30 Changed 5 years ago by rwbarton

We should consider incorporating something along the lines of Stabilizer in order to reduce the magnitude of the effects seen in #8279.

comment:31 Changed 5 years ago by hvr

Somewhat related, there's a new serious attempt at Shaking up GHC

comment:32 Changed 5 years ago by NeilMitchell

Reflecting on why this ticket didn't go anywhere, perhaps the scope is too broad. Rather than "make nofib awesome" maybe we need a few separate tickets - the one to switch nofib to Shake just requires integrating the code that is already written. The other issues such as criterion/stabilizer and changing the set of benchmarks might do best being tackled as individual pieces.

comment:33 Changed 2 years ago by mpickering

Owner: changed from dterei to michalt

This is Michal's domain currently.

comment:34 Changed 20 months ago by AndreasK

Fixup the runtimes for benchmarks to be significant. This might be best done by changing the way we run benchmarks and collect results to make sure they are meaningful.

Windows suffers especially from this as the cpuTime measurement has less accuracy there. See #14714

comment:35 Changed 14 months ago by AndreasK

comment:36 Changed 13 months ago by sgraf

comment:37 Changed 10 months ago by sgraf

Description: modified (diff)

comment:38 in reply to:  16 ; Changed 10 months ago by sgraf

Replying to simonmar:

The rationale for the naming is described in Will's paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.4124. The distinction between spectral and imaginary is mainly size: the imaginary programs are microbenchmarks, whereas spectral programs are algorithmic kernels - larger than microbenchmarks but not real programs. I think it's a useful distinction.

Perhaps we also want another distinction - programs that run for long enough that we can obtain reliable timing results. However that's an orthogonal dimension, and we can't sensibly use both dimensions for organising the directory structure. So I suggest we just have a flag per benchmark that is true if it runs for long enough, and have a simple way to run just those benchmarks.

I think flagging offending benchmarks is the last resort to get reliable measurements, but sadly I don't see any way around that for some benchmarks.

Let me elaborate. I experienced fluctuations (random improvements and regressions) for benchmarks with either of the following traits:

  1. Runtime too short, so that we mostly measure system overhead (partly addressed by Phab:D4989, in response to #15357)
  2. Programs like paraffins, binary-trees, etc., which consist of a relatively long and allocation intensive build-up phase, followed by a fold that immediately collapses the maximum residency. These kind of benchmarks are very sensitive to changes in allocations. Also see #15333.
  3. Microbenchmarks like the real/eff suite, which optimise to a very small counting loop. Even when they run long enough for runtime measurements to be reproducible (e.g. VSM runs for about 670ms-700ms on my machine), a mostly irrelevant change in base leads to wibbles.

System overhead

Easy to fix and Phab:D4989 did most (if not all?) of the work.

GC

The GC parameterisation is much harder to remove from the equation. There's an example in #15333 and https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55 ff. The idea is the following:

Varying allocations leads to different points in time at which nursery and gen 2 heap run out of space, in turn triggering GC at different points in time.

I.e. imagine that because of an overall reduction in allocations, the 5th major GC run happens immediately before the entire structure is collapsed, when without the reduction the 5th run would happen some time earlier (left baseline, right less total allocations):

https://i.imgur.com/C5224Ua.jpg https://i.imgur.com/C05SC5K.jpg

The residency at the 5th GC run would approach the maximum residency (not the number sampled by GC, but the one depending on program semantics), which entails a maximally costly GC run, for an impact of as much as 10% on total runtime. This is something that could be tuned by profile-guided optimisation, but otherwise there is no way to predict when a program will hit its maximum residency.

The above profile is from paraffins, which has the typical structure of an allocating Haskell benchmark: Build up and then consume a large data structure. Although total allocations can be measured realiably on them, these benchmarks aren't particularly indicative of runtime performance and contribute to nofib's noise.

I can think of ways to fix the situation without flagging them away, but not without leaving the respective benchmarks untouched:

  1. Modify every benchmark in a way that at least one collection will happen at the point where we think maximum residency is hit, through performMajorGC or some other kind of allocation in a loop. Yuck.
  2. Iterate every benchmark $n times from within the same Haskell process. In particular, this won't reset the GC between runs and will lead to better distribution of heap samples without actually changing the characteristics of the benchmark. We could shrink input problems to get $n sufficiently high, while still running in acceptable time (e.g. 0.5-5s).

TODO Identify them

Microbenchmarks

I blame non-determinism in code layout or some other caching effect until I have a better hypothesis. These are the candidats we want to flag away for runtime measurements, especially real/eff, or make them actually do something meaningful.

TODO Identify them

comment:39 in reply to:  38 Changed 10 months ago by AndreasK

I blame non-determinism in code layout or some other caching effect until I have a better hypothesis. These are the candidats we want to flag away for runtime measurements, especially real/eff, or make them actually do something meaningful.

They generally compile down to a loop of a handfull of instructions. So flipping an if (invert the condition, swap then and else expressions) can already lead to noteable differences as the cpu might proccess one more instruction.

For the really small loops we can also get counterintuitive behaviour. Eg I reduced on of them by one instruction and performance got slower because of instruction sheduling effects.

Having an "effective runtime" modus that excludes these seems worthwhile. But I'm torn on excluding them by default.

Ideally people would see these unexpected results. Then look at the generated code and be able to tell if the code is actually worse or if it's just one of these edge cases.

But that requires a lot of low level knowledge, which not everyone has. So people might throw out perfectly good ideas/patches because of missleading nofib results.

Last edited 10 months ago by AndreasK (previous) (diff)

comment:40 Changed 10 months ago by sgraf

I opened #15999 to track the implicit task in comment:38.

comment:41 Changed 8 months ago by Sebastian Graf <sebastian.graf@…>

In 8632268/nofib:

Stabilise benchmarks wrt. GC

Summary:
This is due to #15999, a follow-up on #5793 and #15357 and changes all
benchmarks, some of them (i.e. `wheel-sieve1`, `awards`) rather drastically.

The general plan is outlined in #15999: Identify GC-sensitive benchmarks by
looking at how productivity rates change over different nursery sizes and
iterate `main` of these benchmarks often enough for the non-monotony and
discontinuities to go away.

I was paying attention that the benchmarked logic is actually run $n times more
often, rather than just benchmarking IO operations printing the result of CAFs.

When I found benchmarks with insignificant runtime (#15357), I made sure that
parameters/input files were adjusted so that runtime of the different modes
fall within the ranges proposed in
https://ghc.haskell.org/trac/ghc/ticket/15357#comment:4

- fast: 0.1-0.2s
- norm: 1-2s
- slow: 5-10s

This is what I did:

- Stabilise bernoulli
- Stabilise digits-of-e1
- Stabilise digits-of-e2
- Stabilise gen_regexp
- Adjust running time of integrate
- Adjust running time of kahan
- Stabilise paraffins
- Stabilise primes
- Adjust running time of rfib
- Adjust running time of tak
- Stabilise wheel-sieve1
- Stabilise wheel-sieve2
- Adjust running time of x2n1
- Adjust running time of ansi
- Adjust running time of atom
- Make awards benchmark something other than IO
- Adjust running time of banner
- Stabilise boyer
- Adjust running time of boyer2
- Adjust running time of queens
- Adjust running time of calendar
- Adjust runtime of cichelli
- Stabilise circsim
- Stabilise clausify
- Stabilise constraints with moderate success
- Adjust running time of cryptarithm1
- Adjust running time of cryptarythm2
- Adjust running time of cse
- Adjust running time of eliza
- Adjust running time of exact-reals
- Adjust running time of expert
- Stabilise fft2
- Stabilise fibheaps
- Stabilise fish
- Adjust running time for gcd
- Stabilise comp_lab_zift
- Stabilise event
- Stabilise fft
- Stabilise genfft
- Stabilise ida
- Adjust running time for listcompr
- Adjust running time for listcopy
- Adjust running time of nucleic2
- Attempt to stabilise parstof
- Stabilise sched
- Stabilise solid
- Adjust running time of transform
- Adjust running time of typecheck
- Stabilise wang
- Stabilise wave4main
- Adjust running time of integer
- Adjust running time of knights
- Stabilise lambda
- Stabilise lcss
- Stabilise life
- Stabilise mandel
- Stabilise mandel2
- Adjust running time of mate
- Stabilise minimax
- Adjust running time of multiplier
- Adjust running time of para
- Stabilise power
- Adjust running time of primetest
- Stabilise puzzle with mild success
- Adjust running time for rewrite
- Stabilise simple with mild success
- Stabilise sorting
- Stabilise sphere
- Stabilise treejoin
- Stabilise anna
- Stabilise bspt
- Stabilise cacheprof
- Stablise compress
- Stablise compress2
- Stabilise fem
- Adjust running time of fluid
- Stabilise fulsom
- Stabilise gamteb
- Stabilise gg
- Stabilise grep
- Adjust running time of hidden
- Stabilise hpg
- Stabilise infer
- Stabilise lift
- Stabilise linear
- Attempt to stabilise maillist
- Stabilise mkhprog
- Stabilise parser
- Stabilise pic
- Stabilise prolog
- Attempt to stabilise reptile
- Adjust running time of rsa
- Adjust running time of scs
- Stabilise symalg
- Stabilise veritas
- Stabilise binary-trees
- Adjust running time of fasta
- Adjust running time of k-nucleotide
- Adjust running time of pidigits
- Adjust running time of reverse-complement
- Adjust running time of spectral-norm
- Adjust running time of fannkuch-redux
- Adjust running time for n-body

Problematic benchmarks:

- `last-piece`: Unclear how to stabilise. Runs for 300ms and I can't make up smaller inputs because I don't understand what it does.
- `pretty`: It's just much too small to be relevant at all. Maybe we want to get rid of this one?
- `scc`: Same as `pretty`. The input graph for which SCC analysis is done is much too small and I can't find good directed example graphs on the internet.
- `secretary`: Apparently this needs `-package random` and consequently hasn't been run for a long time.
- `simple`: Same as `last-piece`. Decent runtime (70ms), but it's unstable and I see no way to iterate it ~100 times in fast mode.
- `eff`: Every benchmark is problematic here. Not from the point of view of allocations, but because the actual logic is vacuous. IMO, these should be performance tests, not actual benchmarks. Alternatively, write an actual application that makes use of algebraic effects.
- `maillist`: Too trivial. It's just String/list manipulation, not representative of any Haskell code we would write today (no use of base library functions which could be fused, uses String instead of Text). It's only 75 loc according to `cloc`, that's not a `real` application.

Reviewers: simonpj, simonmar, bgamari, AndreasK, osa1, alpmestan, O26 nofib

GHC Trac Issues: #15999

Differential Revision: https://phabricator.haskell.org/D5438

comment:42 Changed 8 months ago by sgraf

Description: modified (diff)
Note: See TracTickets for help on using tickets.