Opened 15 months ago

Last modified 9 months ago

#15333 new bug

Nursery size adulterates cachegrind metrics in nofib

Reported by: sgraf Owned by:
Priority: normal Milestone: Research needed
Component: NoFib benchmark suite Version: 8.5
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #5793 #15999 Differential Rev(s):
Wiki Page:

Description (last modified by sgraf)

I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.

Check out the most recent version of nofib, enter shootout/binary-trees and run the following script:

#! /bin/sh
sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent
echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace

# bt1: Vanilly
sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt1

# bt2: Additional trace call
sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt2

valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace
valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace

This will compile two versions of binary-trees, the original, unchanged version and one with an extra trace "" $ call before the only call to the bit function. One would expect the version with the trace call (bt2) to allocate more than the version without (bt1). Indeed, the output of +RTS -s suggests that:

$ ./bt1 12 +RTS -s
...
43,107,560 bytes allocated in the heap
...
$ ./bt2 12 +RTS -s
...
43,116,888 bytes allocated in the heap
...

That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).

Compare that to the output of the above cachegrind calls:

$ valgrind --tool=cachegrind ./bt1 12 > /dev/null
...
I   refs:      118,697,583
...
D   refs:       43,475,212
...
$ valgrind --tool=cachegrind ./bt2 12 > /dev/null
...
I   refs:      116,340,710
...
D   refs:       42,523,369
...

It's the other way round here! How's that possible?

Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op trace _ a = a; {-# NOINLINE trace #-}, btw.

I checked this on GHC 8.2.2, 8.4.3 and a semi-recent HEAD commit (bb539cfe335eeec7989332b859b1f3162c5e105a).

Change History (8)

comment:1 Changed 15 months ago by sgraf

Description: modified (diff)

comment:2 Changed 13 months ago by sgraf

I'm still not really sure what's going on, but I feel like alignment instructions might skew cachegrinds instruction counts according to section 5.2.10 of the valgrind manual. I've had a few other cases in the meantime. Passing -fllvm -optlo -Os to use the LLVM backend optimising for code size helps.

Is there some equivalent for the NCG? That would be far more reliable for the counted instruction metric that our benchmark CI relies on.

comment:3 Changed 13 months ago by sgraf

Description: modified (diff)

comment:4 Changed 13 months ago by sgraf

Passing -fllvm -optlo -Os doesn't seem to remedy the symptoms of the original bug.

comment:5 Changed 13 months ago by sgraf

Description: modified (diff)

comment:6 Changed 13 months ago by sgraf

I think I figured this out. This is due to nursery size, specifically the points in time after which the GC kicks in. For benchmarks with a working set of varying size, GC Gen 0 collection performance heavily depends on the size of the working set within the nursery at the time collection is triggered. There obviously is no way to predict this reliably.

In this case, the default nursery size seemed to be particularly unfortunate, as +RTS -S reveals:

    Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
    bytes     bytes     bytes   user   elap     user     elap
  1097344    490864    548288  0.001  0.001    0.002    0.002    0    0  (Gen:  0)
  1044752    645640    706848  0.001  0.001    0.003    0.003    0    0  (Gen:  0)
  1046272    592608    650032  0.001  0.001    0.004    0.004    0    0  (Gen:  1)
  1079040     17808    699952  0.000  0.000    0.005    0.005    0    0  (Gen:  0)
  1046272     17808    717104  0.000  0.000    0.006    0.006    0    0  (Gen:  0)
  1046272     17808    734256  0.000  0.000    0.006    0.006    0    0  (Gen:  0)
  1079040     17808    784176  0.000  0.000    0.007    0.007    0    0  (Gen:  0)
  1046272     17808    801328  0.000  0.000    0.007    0.007    0    0  (Gen:  0)
  1078160    111840    912488  0.000  0.000    0.007    0.007    0    0  (Gen:  0)
  1036312     96056    911464  0.000  0.000    0.008    0.008    0    0  (Gen:  0)
  1036288      6608    915560  0.000  0.000    0.008    0.008    0    0  (Gen:  0)
  1036288      6608    919656  0.000  0.000    0.009    0.009    0    0  (Gen:  0)
  1036288      6608    923752  0.000  0.000    0.009    0.009    0    0  (Gen:  0)
  1036288      6608    927848  0.000  0.000    0.009    0.009    0    0  (Gen:  0)
  1036288      6608    931944  0.000  0.000    0.010    0.010    0    0  (Gen:  0)
  1036288      6608    936040  0.000  0.000    0.010    0.010    0    0  (Gen:  0)
  1036288      6608    940136  0.000  0.000    0.011    0.011    0    0  (Gen:  0)
  1044920     29264    966864  0.000  0.000    0.011    0.011    0    0  (Gen:  0)
  1045504     25496    963536  0.000  0.000    0.011    0.011    0    0  (Gen:  0)
  1045504      6160    964560  0.000  0.000    0.012    0.012    0    0  (Gen:  0)
  1045504      6160    965584  0.000  0.000    0.012    0.012    0    0  (Gen:  0)
  1045504      6160    966608  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
  1045504      6160    967632  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
  1045504      6160    968656  0.000  0.000    0.014    0.014    0    0  (Gen:  0)
  1045504      6160    969680  0.000  0.000    0.014    0.014    0    0  (Gen:  0)
  1047464     59208   1023728  0.000  0.000    0.015    0.015    0    0  (Gen:  0)
  1047808     57968   1022896  0.000  0.000    0.015    0.015    0    0  (Gen:  0)
  1047808     52912   1023120  0.000  0.000    0.016    0.016    0    0  (Gen:  0)
  1047808     52944   1023376  0.000  0.000    0.016    0.016    0    0  (Gen:  0)
  1047808     52944   1023632  0.000  0.000    0.017    0.017    0    0  (Gen:  0)
  1047808     52944   1023888  0.000  0.000    0.017    0.017    0    0  (Gen:  0)
  1047808     52944   1024144  0.000  0.000    0.018    0.018    0    0  (Gen:  0)
  1047808     52944   1024400  0.000  0.000    0.018    0.018    0    0  (Gen:  0)
  1048296    176376   1148064  0.000  0.000    0.018    0.018    0    0  (Gen:  0)
  1048384    175792   1147856  0.000  0.000    0.019    0.019    0    0  (Gen:  0)
  1048384    174288   1147856  0.000  0.000    0.019    0.019    0    0  (Gen:  0)
  1048384    174288   1147856  0.000  0.000    0.020    0.020    0    0  (Gen:  0)
  1048384    174288   1147856  0.000  0.000    0.020    0.020    0    0  (Gen:  0)
  1048384    174288   1147856  0.000  0.000    0.021    0.021    0    0  (Gen:  0)
  1048384    174288   1147856  0.000  0.000    0.021    0.021    0    0  (Gen:  0)
  1048384    174288   1147856  0.000  0.000    0.022    0.022    0    0  (Gen:  0)
   116160      3488     44576  0.000  0.000    0.022    0.022    0    0  (Gen:  1)
        0                      0.000  0.000

This is the log for bt1 (e.g. without the call to trace). Note the higher numbers in the Copied bytes column compared to bt2 (with the redundant trace call), in particular the whole second half:

    Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
    bytes     bytes     bytes   user   elap     user     elap

  1097448    488888    546312  0.001  0.001    0.002    0.002    0    0  (Gen:  0)
  1044664    645424    706840  0.001  0.001    0.004    0.004    0    0  (Gen:  0)
  1046272    591480    648904  0.001  0.001    0.005    0.005    0    0  (Gen:  1)
  1079040     17488    698824  0.000  0.000    0.006    0.006    0    0  (Gen:  0)
  1046272     17488    715976  0.000  0.000    0.007    0.007    0    0  (Gen:  0)
  1046272     17488    733128  0.000  0.000    0.008    0.008    0    0  (Gen:  0)
  1079040     17488    783048  0.000  0.000    0.008    0.008    0    0  (Gen:  0)
  1046272     17488    800200  0.000  0.000    0.009    0.009    0    0  (Gen:  0)
  1078376    108336    908176  0.000  0.000    0.010    0.010    0    0  (Gen:  0)
  1036312     96600    911232  0.000  0.000    0.011    0.011    0    0  (Gen:  0)
  1036288      7088    915248  0.000  0.000    0.012    0.012    0    0  (Gen:  0)
  1036288      7120    919344  0.000  0.000    0.012    0.012    0    0  (Gen:  0)
  1036288      7120    923440  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
  1036288      7120    927536  0.000  0.000    0.014    0.014    0    0  (Gen:  0)
  1036288      7120    931632  0.000  0.000    0.014    0.014    0    0  (Gen:  0)
  1036288      7120    935728  0.000  0.000    0.015    0.015    0    0  (Gen:  0)
  1036288      7120    939824  0.000  0.000    0.015    0.015    0    0  (Gen:  0)
  1045264     24608    961384  0.000  0.000    0.015    0.015    0    0  (Gen:  0)
  1045504     20824    958056  0.000  0.000    0.016    0.016    0    0  (Gen:  0)
  1045504      1488    959080  0.000  0.000    0.016    0.016    0    0  (Gen:  0)
  1045504      1488    960104  0.000  0.000    0.016    0.016    0    0  (Gen:  0)
  1045504      1488    961128  0.000  0.000    0.017    0.017    0    0  (Gen:  0)
  1045504      1488    962152  0.000  0.000    0.017    0.017    0    0  (Gen:  0)
  1045504      1488    963176  0.000  0.000    0.017    0.017    0    0  (Gen:  0)
  1045504      1488    964200  0.000  0.000    0.018    0.018    0    0  (Gen:  0)
  1047600     53192   1016904  0.000  0.000    0.018    0.018    0    0  (Gen:  0)
  1047808     51952   1016072  0.000  0.000    0.019    0.019    0    0  (Gen:  0)
  1047808     46896   1016296  0.000  0.000    0.019    0.019    0    0  (Gen:  0)
  1047808     46928   1016552  0.000  0.000    0.019    0.019    0    0  (Gen:  0)
  1047808     46928   1016808  0.000  0.000    0.020    0.020    0    0  (Gen:  0)
  1047808     46928   1017064  0.000  0.000    0.020    0.020    0    0  (Gen:  0)
  1047808     46928   1017320  0.000  0.000    0.020    0.020    0    0  (Gen:  0)
  1047808     46928   1017576  0.000  0.000    0.021    0.021    0    0  (Gen:  0)
  1048176    168760   1139640  0.000  0.000    0.021    0.021    0    0  (Gen:  0)
  1048384    168176   1139432  0.000  0.000    0.022    0.022    0    0  (Gen:  0)
  1048384    166672   1139432  0.000  0.000    0.022    0.022    0    0  (Gen:  0)
  1048384    166672   1139432  0.000  0.000    0.023    0.023    0    0  (Gen:  0)
  1048384    166672   1139432  0.000  0.000    0.023    0.023    0    0  (Gen:  0)
  1048384    166672   1139432  0.000  0.000    0.023    0.023    0    0  (Gen:  0)
  1048384    166672   1139432  0.000  0.000    0.024    0.024    0    0  (Gen:  0)
  1048384    166672   1139432  0.000  0.000    0.024    0.024    0    0  (Gen:  0)
   123776      2504     43592  0.000  0.000    0.024    0.025    0    0  (Gen:  1)
        0                      0.000  0.000

I see no easy way around this other than 'deactivating' the GC by choosing a big enough nursery, such as -A11M in this case. Taking the GC out of the equation in this way may be somewhat misrepresentative, but at least we get cachegrind numbers that can be trusted.

There are other cases I hit this, like bspt. I improved allocation a bit, triggering GC at later points when the working set was bigger. Result was an alleged regression of 6% in counted instructions, even though the relevant optimized snippet in isolation got faster.

comment:7 Changed 13 months ago by sgraf

Summary: Weird cachegrind results for binary-treesNursery size adulterates cachegrind metrics in nofib

comment:8 Changed 9 months ago by sgraf

See my recent concerns about the non-monotone impact of allocations on productivity. Here's a productivity curve:

https://i.imgur.com/kiZwF0R.png

It's not as bad as with paraffins in #15999, but not really monotone either.

Note: See TracTickets for help on using tickets.