Opened 6 years ago

Last modified 9 months ago

#8611 patch bug

nofib’s cacheprof’s allocations nondeterminisitic

Reported by: nomeata Owned by:
Priority: normal Milestone:
Component: NoFib benchmark suite Version: 8.5
Keywords: Cc: osa1, sgraf, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #4450 #16065 Differential Rev(s): Phab:D5470
Wiki Page:

Description

This seems to be neither expected nor desired, and hence worth investigating:

./cacheprof +RTS -t 
<<ghc: 127588752 bytes, 243 GCs, 2261288/6551032 avg/max bytes residency (6 samples), 15M in use, 0.00 INIT (0.00 elapsed), 0.12 MUT (0.12 elapsed), 0.05 GC (0.05 elapsed) :ghc>>
./cacheprof +RTS -t 
<<ghc: 127392144 bytes, 242 GCs, 2261304/6551576 avg/max bytes residency (6 samples), 15M in use, 0.00 INIT (0.00 elapsed), 0.11 MUT (0.11 elapsed), 0.04 GC (0.05 elapsed) :ghc>>

The allocations numbers change, when they shoudn’t.

Reproduced with 7.6.3 and current HEAD.

Change History (16)

comment:1 Changed 6 years ago by nomeata

I could not find the cause immediately, so I’ll report my partial findings here.

  • It is not the IO part. Inlining the input as one big string, and skipping the output, does not make the symptom disappear.
  • Ticky-ticky profiling yields identical results for the cacheprof modules, the only difference is in the global counters (HEAP_CHK_ctr and STK_CHK_ctr) to be precise.

comment:2 Changed 6 years ago by nomeata

The cause does not seem to be tied to a particular part of the code: Luckily, doFile has a very plain pipeline. The symptoms disappear when I stop at processing with_ccs and appear if I go until with_synth_2. But if I read with_css from a file instead of calculating it and only do the synth_2 call, the result also becomes deterministic. Tricky.

comment:3 Changed 2 years ago by Joachim Breitner <mail@…>

In c241954/nofib:

Notes: Link to #8611 (cacheprof nondeterministic)

comment:4 Changed 15 months ago by osa1

Cc: osa1 added
Version: 7.6.38.5

This has been tainting nofib results since years, should we maybe prioritize and fix this?

comment:5 Changed 15 months ago by nomeata

It just needs someone determined enough to find the cause…

comment:6 Changed 9 months ago by sgraf

Cc: sgraf added

Although serialising and deserialising the result of with_ccs gets rid of the non-determinism in total allocations AFAICT (using Read/Show is pretty slow, so I only did ~20 iterations), the average residency/number of samples seems to vary by one (23 vs. 24). Not sure if it's important.

comment:7 Changed 9 months ago by sgraf

If I seq with_ccs like this:

foldr seq () with_ccs `seq` return final

where final = final_cleanup with_synth_2, the non-determinism seems irreproducible. Still, the number of samples (and average residency, consequently) varies between 10 and 11.

comment:8 Changed 9 months ago by sgraf

If I force data_areas in synth_2 like this:

foldr seq () data_areas `seq` synthd_assy ++ data_areas

the non-determinism is irreproducible. If I instead force the whole list, like this:

let res = synthd_assy ++ data_areas
in foldr seq () res `seq` res

it's non-deterministic still. Seems like this isn't really possible to localise?! I'll leave this for another day.

comment:9 Changed 9 months ago by sgraf

Some notes from my Windows machine:

  • In the default build configuration with -O2, allocations are completely stable. Maximum residency is unstable, though.
  • The instability in max residency goes away if I do similar surgically seqs as above.
  • In -O0, I even get unstable allocations on Windows, with the samples spread wider than on Linux the other day.
  • In -O0 -prof -fprof-auto there's still a seldom, minimal instability of 40 bytes delta
  • Judging from +RTS -S output, the culprit seems to be an errorneous life set. Comparing two reports from different runs, there was always a point at which the live bytes differed and only much later total bytes allocated would differ. I had a sample where the difference was 1488 bytes in (minor) collection no. 11. These allocations seem to be kept live until the end of the program, and more live data is added during the run of the program.

comment:10 Changed 9 months ago by sgraf

Just a few notes while it's still fresh in my head. I always compared two traces via rr.

  • Based on my observation that bytes copied was the first metric to diverge, I did some printf debugging in the RTS.
  • I noticed that the first divergence comes from copying additional blackholes. After evacuating the blackhole in trace B, it resumes 'normal' operation, e.g. does the same thing trace A would do instead of evacuating the black hole.
  • The black holes are evacuated when the stack is scavenged. They're pointed to by update frames, all of which originally pointed to stg_sel_0_upd selector thunks.
  • I also noticed that the mut_list_size output of +RTS -Dg varies between the two traces, and it's never because of any mutable primitives (e.g., arrays, MVars, etc.).

I'm not sure what pushes (or fails to push) the update frames. Maybe it has to do with ThreadPaused.c:stackSqueeze()? Maybe this is related to #4450?

comment:11 Changed 9 months ago by sgraf

Cc: simonmar added

Indeed, never making the call to ThreadPaused.c:stackSqueeze() gets rid of the non-determinism. So does +RTS -V0 (which disables the master tick timer that triggers context switches, if understand correctly) as proposed in #4450 and +RTS -Z (which just deactivates stack squeezing completely).

The latter seems like an easy fix for this particular benchmark, but this will probably come up again in the future. Maybe we can teach the RTS not do stack squeezing during context switches?

CCing simonmar for input, as I'm not particularly familiar with RTS internals.

comment:12 Changed 9 months ago by sgraf

Differential Rev(s): Phab:D5460
Status: newpatch

comment:13 in reply to:  11 Changed 9 months ago by sgraf

Replying to sgraf:

The latter seems like an easy fix for this particular benchmark, but this will probably come up again in the future. Maybe we can teach the RTS not do stack squeezing during context switches?

CCing simonmar for input, as I'm not particularly familiar with RTS internals.

Continuing discussion in #16065.

comment:14 Changed 9 months ago by nomeata

Whoohoo, thanks for drilling down here!

comment:15 Changed 9 months ago by sgraf

Differential Rev(s): Phab:D5460Phab:D5470

comment:16 Changed 9 months ago by Sebastian Graf <sgraf1337@…>

In e2d614e/nofib:

Disable timer-based context switches

Summary:
In the past, we repeatedly had problems with non-deterministic allocations due to stack squeezing during context switches (#4450, #8611).
This patch adds `+RTS -V0 -RTS` as extra `RUNTEST_OPTS` for every single-threaded benchmark.

Is this the right place to add the flags? Should we also do this for all the other benchmarks?

Reviewers: simonmar, osa1, nomeata, O26 nofib

GHC Trac Issues: #8611

Differential Revision: https://phabricator.haskell.org/D5470
Note: See TracTickets for help on using tickets.