Opened 11 years ago

Closed 11 years ago

Last modified 10 years ago

#2712 closed bug (fixed)

Parallel GC scheduling problems

Reported by: simonmar Owned by: simonmar
Priority: high Milestone: 6.12 branch
Component: Runtime System Version: 6.11
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


The parallel GC uses its own gang of threads separate from those used to run the program. This is causing performance loss, severe in some cases, especially when the number of GC threads and mutator threads equals the number of processor cores. In this case, when the GC spins up, the OS has to schedule N threads onto N cores, where all cores already have other threads running. It has to correctly choose to bump the old mutator threads off to make room for the new GC threads, but at least on Linux it doesn't always succeed in doing this, and there can be a delay while the scheduler sorts things out (as much as 50ms). The measurements I've been using to test the parallel GC so far have been mostly on single-threaded programs, so this problem only emerged recently.

Really we ought to be using the mutator threads as GC threads too. Things are made slightly more complicated by the fact that some of the mutator threads might not be awake when we GC, if not all cores are busy. Perhaps we should bite the bullet and try to set affinity masks.

If this is affecting you, try turning off the parallel GC, or reducing the number of threads it uses, with e.g. +RTS -g1.

Change History (3)

comment:1 Changed 11 years ago by simonmar

Milestone: branch
Resolution: fixed
Status: newclosed

Done in the HEAD, but I probably won't backport because the changes are too large and potentially destabilising. Here's the main patch, for reference:

Fri Nov 21 15:12:33 GMT 2008  Simon Marlow <>
  * Use mutator threads to do GC, instead of having a separate pool of GC threa
  Previously, the GC had its own pool of threads to use as workers when
  doing parallel GC.  There was a "leader", which was the mutator thread
  that initiated the GC, and the other threads were taken from the pool.
  This was simple and worked fine for sequential programs, where we did
  most of the benchmarking for the parallel GC, but falls down for
  parallel programs.  When we have N mutator threads and N cores, at GC
  time we would have to stop N-1 mutator threads and start up N-1 GC
  threads, and hope that the OS schedules them all onto separate cores.
  It practice it doesn't, as you might expect.
  Now we use the mutator threads to do GC.  This works quite nicely,
  particularly for parallel programs, where each mutator thread scans
  its own spark pool, which is probably in its cache anyway.
  There are some flag changes:
    -g<n> is removed (-g1 is still accepted for backwards compat).
    There's no way to have a different number of GC threads than mutator
    threads now.
    -q1       Use one OS thread for GC (turns off parallel GC)
    -qg<n>    Use parallel GC for generations >= <n> (default: 1)
  Using parallel GC only for generations >=1 works well for sequential
  programs.  Compiling an ordinary sequential program with -threaded and
  running it with -N2 or more should help if you do a lot of GC.  I've
  found that adding -qg0 (do parallel GC for generation 0 too) speeds up
  some parallel programs, but slows down some sequential programs.
  Being conservative, I left the threshold at 1.
  ToDo: document the new options.

comment:2 Changed 10 years ago by simonmar

difficulty: Moderate (1 day)Moderate (less than a day)

comment:3 Changed 10 years ago by simonmar

Type of failure: Runtime performance bug
Note: See TracTickets for help on using tickets.