Opened 10 months ago

Closed 8 months ago

#15999 closed task (fixed)

Stabilise nofib runtime measurements

Reported by: sgraf Owned by:
Priority: normal Milestone:
Component: NoFib benchmark suite Version: 8.6.2
Keywords: Cc: osa1
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #5793 #9476 #15333 #15357 Differential Rev(s): Phab:D5438
Wiki Page:

Description (last modified by sgraf)

With Phab:D4989 (cf. #15357) having hit nofib master, there are still many benchmarks that are unstable in one way or another. I identified three causes for unstability in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38. With system overhead mostly out of the equation, there are still two related tasks left:

  1. Identify benchmarks with GC wibbles. Plan: Look at how productivity rate changes while increasing gen 0 heap size. A GC-sensitive benchmark should have a non-monotonic or discontinuous productivity-rate-over-nursery-size curve. Then fix these by iterating main often enough for the curve to become smooth and monotone.
  2. Now, all benchmarks should have monotonically decreasing instruction count for increasing nursery sizes. If not, maybe there's another class of benchmarks I didn't identify yet in #5793. Of these benchmarks, there are a few, like real/eff/CS, that still have highly code layout-sensitive runtimes. Fix these 'microbenchmarks' by hiding them behind a flag.

Attachments (1)

prod.py (1.6 KB) - added by sgraf 9 months ago.
Python script for sampling and plotting productivity rates

Download all attachments as: .zip

Change History (11)

comment:1 Changed 9 months ago by sgraf

I figured out a more useful metric to identify GC wibbles: Watching productivity rate change while increasing the nursery.

Here's the current state of affairs for paraffins:

https://i.imgur.com/XxdNJQ8.png

While there are some wibbles due to short running time, it's clear that the function is highly discontinuous. We want something more like this:

https://i.imgur.com/fA4S38S.png

That's a much smoother curve (which of course is subject to measuring inaccuracy). I got this just by iterating main 100 times. Obviously, this has implications on running time of the benchmark, so the *MODEs have to be adjusted accordingly.

Changed 9 months ago by sgraf

Attachment: prod.py added

Python script for sampling and plotting productivity rates

comment:2 Changed 9 months ago by osa1

Cc: osa1 added
Differential Rev(s): Phab:D5438

Thanks for doing this!

I think running the benchmarks multiple times is a good idea. That's what criterion does, and it provides quite reliable results, even for very fast programs.

That said, looking at the patch and your paraffins example, I have some questions:

  • I wonder if it'd be better to run the process multiple times, instead of running the main function multiple times in the program. Why? That way we know GHC won't fuse or somehow optimize the replicateM_ 100 call in the program, and we properly reset all the resources/global state (both the program's and the runtime system's, e.g. weaks pointers, threads, stable names). It just seems more reliable.
    • Of course this would make the analysis harder as each run will print GC stats which we need to parse and somehow combine ...
    • That said, I wonder if GC numbers are important for the purposes of nofib. In nofib we care about allocations and runtimes, as long as these numbers are stable it should be fine. So perhaps it's not too hard to repeat the process run instead of main function.
  • You say "GC wibbles", but I'm not sure if these are actually GC wibbles. I just checked paraffins: it doesn't do any IO (other than printing the results), and it's not even threaded (does not use threaded runtime, does not do forkIO). So I think it should be quite deterministic, and I think any wibbles are due to OS side of things. In other words, if we could have an OS that only runs paraffins and nothing else I think the results would be quite deterministic.

Of course this doesn't change the fact that we're getting non-deterministic results and we should do something about it, I'm just trying to understand the root cause here.

On my first point: if a solution for benchmarking "processes" (instead of "functions") using criterion-style iteration (by which I mean "provides stable results") I think it may worth trying. Few years back we used hsbencher for this purpose at IU, but IIRC it's a bit too heavy (lots of dependencies), and it seems unmaintained now. I vaguely recall another program for this purpose but I can't remember the name...

comment:3 Changed 9 months ago by osa1

I'm also wondering if having a fixed iteration number is a good idea. What if, in 5 years, 100 iterations for paraffins is not enough to get reliable numbers? I think criterion also has a solution for this (it does different number of repetitions depending on results).

comment:4 Changed 9 months ago by osa1

Here's a library that can be useful for this purpose: http://hackage.haskell.org/package/bench

comment:5 in reply to:  3 Changed 9 months ago by sgraf

Replying to osa1:

  • I wonder if it'd be better to run the process multiple times, instead of running the main function multiple times in the program. Why? That way we know GHC won't fuse or somehow optimize the replicateM_ 100 call in the program, and we properly reset all the resources/global state (both the program's and the runtime system's, e.g. weaks pointers, threads, stable names). It just seems more reliable.

The whole point of this patch is that we iterate within the process, so that GC paramerisation doesn't affect performance (counted instructions, even) on the same program in 'unforeseen' ways, like the discontinuous, non-monotonic paraffins example. There we would expect that increasing nursery always leads to higher productivity, because the GC has to run less often. That clearly isn't the case, due to effects outlined in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38.

My hypothesis here is that fixing the productivity curve above has the following effect on benchmarking a patch that changes allocations:

Basically, we fix the GC paramerisation and vary allocations instead of fixing allocations and varying GC parameters. For a fixed GC parameterisation (which is always the case when running NoFib), we would expect an optimisation that reduces allocations by 10% to lead to less GC time, thus higher productivity. Yet, in https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55, there is a regression in total allocations by 18.6%, while counted instructions improve by 11.7% (similar results when measuring actual runtime). As the thread reveals, I had to do quite some digging to find out why (https://ghc.haskell.org/trac/ghc/ticket/9476#comment:71): The program produces more garbage, leading to more favorable points at which GC is done. We don't want that to dilute our comparison!

This is also the reason why iterating the same program by restarting the whole process is impossible: GC paramerisation is deterministic (rightly so, IMO) and restarting the process will only measure the same dilusion over and over.

On the other hand, iterating the logic $n times from within the program leads to different GC states at which the actual benchmark logic starts, thus leading to a more uniform distribution at the points in the program when GC happens.

For every benchmark in the patch, I made sure that replicateM_ 100 is actually a sensible thing to do. Compare that to the current version of awards where that is not the case: GHC will float out the actual computation and all that is measured is IO. In paraffins I used relicateM_ 100 (or forM_ [1..100] $ const, rather, TODO), because the inner action has a call to getArgs, the result of which is used to build up the input data. GHC can't know that the result of getArgs doesn't change, so it can't memoize the benchmark result and this measures what we want. In other cases I actually had to use forM_ and make use of the counter somewhere in generating the input data.

  • You say "GC wibbles", but I'm not sure if these are actually GC wibbles. I just checked paraffins: it doesn't do any IO (other than printing the results), and it's not even threaded (does not use threaded runtime, does not do forkIO). So I think it should be quite deterministic, and I think any wibbles are due to OS side of things. In other words, if we could have an OS that only runs paraffins and nothing else I think the results would be quite deterministic.

Of course this doesn't change the fact that we're getting non-deterministic results and we should do something about it, I'm just trying to understand the root cause here.

The numbers are deterministic, but they are off in the sense above. By GC wibble, I mean that varying tiny parameters of the program or the GC has huge, non-monotonic, 'discontinuous' impact on the perceived performance, which makes the benchmark suite unsuited to evaluate any optimisation that changes allocations.

Replying to osa1:

I'm also wondering if having a fixed iteration number is a good idea. What if, in 5 years, 100 iterations for paraffins is not enough to get reliable numbers? I think criterion also has a solution for this (it does different number of repetitions depending on results).

The point of iterating 100 times is not to adjust runtime so that we measure something other than System overhead (type 1 in https://ghc.haskell.org/trac/ghc/ticket/5793#comment:38). It's solely to get smoother results in the above sense. There's still the regular fast/norm/slow mechanism which are there to tune for specific runtimes, which are quite likely to vary in the future.

Of course, we could just as well increment 100 to 243 for even smoother results instead of modifying the actual input problem. But having it fixed means that the curve for fast is as smooth as slow, which is a good thing, I think. It means we can measure counted instructions on fast without having to worry too much about said dilusions.

Replying to osa1:

Here's a library that can be useful for this purpose: http://hackage.haskell.org/package/bench

Ah, yes. That could indeed be worthwhile as a replacement for the current runner, but only if it would support measuring more than just time.

comment:6 Changed 9 months ago by simonmar

So as I understand it, the GC "wibbles" you're talking about are caused by the number of GCs we run? Making a small change to the nursery size can make the difference between N and N+1 GC runs, which could be a large difference in runtime.

You're only looking at -G1, right? Generational GC often has weird effects based on the timing of when a GC runs. I think there will still be issues when there's an old-gen collection right around the end of the program run - making a small change may mean the difference between running or not running the expensive GC.

comment:7 in reply to:  6 Changed 9 months ago by sgraf

Replying to simonmar:

So as I understand it, the GC "wibbles" you're talking about are caused by the number of GCs we run? Making a small change to the nursery size can make the difference between N and N+1 GC runs, which could be a large difference in runtime.

Yes, that's one source of wibble (in hindsight, that may have been a bad term to use here). But it's not exactly the reason why I'm doing this: Have a look at the numbers in https://ghc.haskell.org/trac/ghc/ticket/9476#comment:55. The ./default had significantly fewer Gen 0 collections and the same number of Gen 1 collections as ./allow-cg (which produces more garbage but is faster in total). Gen 1 collections where cheaper for ./allow-cg for some reason. Also note how this correlates with the productivity rate: 10% vs 15% for the latter. The findings in the thread led me to plot the above curves.

You're only looking at -G1, right? Generational GC often has weird effects based on the timing of when a GC runs. I think there will still be issues when there's an old-gen collection right around the end of the program run - making a small change may mean the difference between running or not running the expensive GC.

This is not -G1 and I agree that a single old-gen collection might make the difference. But when we modify the program in a way that there are more Gen 1 collections, at more uniformly distributed points in the program, I argue we will have a much better experience comparing nofib numbers. There are multiple ways to achieve this, but I think the simplest one is what I outline above and more closely corresponds to the workload of real applications (e.g. long running time, growing and shrinking working sets).

comment:8 Changed 9 months ago by sgraf

Description: modified (diff)

comment:9 Changed 8 months ago by Sebastian Graf <sebastian.graf@…>

In 8632268/nofib:

Stabilise benchmarks wrt. GC

Summary:
This is due to #15999, a follow-up on #5793 and #15357 and changes all
benchmarks, some of them (i.e. `wheel-sieve1`, `awards`) rather drastically.

The general plan is outlined in #15999: Identify GC-sensitive benchmarks by
looking at how productivity rates change over different nursery sizes and
iterate `main` of these benchmarks often enough for the non-monotony and
discontinuities to go away.

I was paying attention that the benchmarked logic is actually run $n times more
often, rather than just benchmarking IO operations printing the result of CAFs.

When I found benchmarks with insignificant runtime (#15357), I made sure that
parameters/input files were adjusted so that runtime of the different modes
fall within the ranges proposed in
https://ghc.haskell.org/trac/ghc/ticket/15357#comment:4

- fast: 0.1-0.2s
- norm: 1-2s
- slow: 5-10s

This is what I did:

- Stabilise bernoulli
- Stabilise digits-of-e1
- Stabilise digits-of-e2
- Stabilise gen_regexp
- Adjust running time of integrate
- Adjust running time of kahan
- Stabilise paraffins
- Stabilise primes
- Adjust running time of rfib
- Adjust running time of tak
- Stabilise wheel-sieve1
- Stabilise wheel-sieve2
- Adjust running time of x2n1
- Adjust running time of ansi
- Adjust running time of atom
- Make awards benchmark something other than IO
- Adjust running time of banner
- Stabilise boyer
- Adjust running time of boyer2
- Adjust running time of queens
- Adjust running time of calendar
- Adjust runtime of cichelli
- Stabilise circsim
- Stabilise clausify
- Stabilise constraints with moderate success
- Adjust running time of cryptarithm1
- Adjust running time of cryptarythm2
- Adjust running time of cse
- Adjust running time of eliza
- Adjust running time of exact-reals
- Adjust running time of expert
- Stabilise fft2
- Stabilise fibheaps
- Stabilise fish
- Adjust running time for gcd
- Stabilise comp_lab_zift
- Stabilise event
- Stabilise fft
- Stabilise genfft
- Stabilise ida
- Adjust running time for listcompr
- Adjust running time for listcopy
- Adjust running time of nucleic2
- Attempt to stabilise parstof
- Stabilise sched
- Stabilise solid
- Adjust running time of transform
- Adjust running time of typecheck
- Stabilise wang
- Stabilise wave4main
- Adjust running time of integer
- Adjust running time of knights
- Stabilise lambda
- Stabilise lcss
- Stabilise life
- Stabilise mandel
- Stabilise mandel2
- Adjust running time of mate
- Stabilise minimax
- Adjust running time of multiplier
- Adjust running time of para
- Stabilise power
- Adjust running time of primetest
- Stabilise puzzle with mild success
- Adjust running time for rewrite
- Stabilise simple with mild success
- Stabilise sorting
- Stabilise sphere
- Stabilise treejoin
- Stabilise anna
- Stabilise bspt
- Stabilise cacheprof
- Stablise compress
- Stablise compress2
- Stabilise fem
- Adjust running time of fluid
- Stabilise fulsom
- Stabilise gamteb
- Stabilise gg
- Stabilise grep
- Adjust running time of hidden
- Stabilise hpg
- Stabilise infer
- Stabilise lift
- Stabilise linear
- Attempt to stabilise maillist
- Stabilise mkhprog
- Stabilise parser
- Stabilise pic
- Stabilise prolog
- Attempt to stabilise reptile
- Adjust running time of rsa
- Adjust running time of scs
- Stabilise symalg
- Stabilise veritas
- Stabilise binary-trees
- Adjust running time of fasta
- Adjust running time of k-nucleotide
- Adjust running time of pidigits
- Adjust running time of reverse-complement
- Adjust running time of spectral-norm
- Adjust running time of fannkuch-redux
- Adjust running time for n-body

Problematic benchmarks:

- `last-piece`: Unclear how to stabilise. Runs for 300ms and I can't make up smaller inputs because I don't understand what it does.
- `pretty`: It's just much too small to be relevant at all. Maybe we want to get rid of this one?
- `scc`: Same as `pretty`. The input graph for which SCC analysis is done is much too small and I can't find good directed example graphs on the internet.
- `secretary`: Apparently this needs `-package random` and consequently hasn't been run for a long time.
- `simple`: Same as `last-piece`. Decent runtime (70ms), but it's unstable and I see no way to iterate it ~100 times in fast mode.
- `eff`: Every benchmark is problematic here. Not from the point of view of allocations, but because the actual logic is vacuous. IMO, these should be performance tests, not actual benchmarks. Alternatively, write an actual application that makes use of algebraic effects.
- `maillist`: Too trivial. It's just String/list manipulation, not representative of any Haskell code we would write today (no use of base library functions which could be fused, uses String instead of Text). It's only 75 loc according to `cloc`, that's not a `real` application.

Reviewers: simonpj, simonmar, bgamari, AndreasK, osa1, alpmestan, O26 nofib

GHC Trac Issues: #15999

Differential Revision: https://phabricator.haskell.org/D5438

comment:10 Changed 8 months ago by sgraf

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.