Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#3518 closed bug (fixed)

GHC GC rises greatly on -N8 compared to -N7

Reported by: nccb Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 6.10.4
Keywords: Cc:
Operating System: Linux Architecture: x86
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I've recently been benchmarking some parallel code on an 8-core x86 machine. I have graphed performance for -N1 through to -N8, but I often get a huge spike on -N8. I can produce this fairly reliably with the attached program, that simply spawns 8 threads that each spin writing to a TVar. Here is an example of the timings I get:

time ./Main +RTS -N1
real	0m43.697s
user	0m43.683s
sys	0m0.016s

time ./Main +RTS -N4
real	0m24.424s
user	1m13.621s
sys	0m0.208s

time ./Main +RTS -N7
real	0m18.631s
user	1m42.778s
sys	0m0.260s

time ./Main +RTS -N8
real	0m37.859s
user	3m38.506s
sys	0m0.632s

Spot the odd one out! It's not to do with having exactly 8 threads for 8 cores -- I get this problem with ~150 threads on 8 cores. When I run with -s, it seems like the GC is the reason:

time ./Main +RTS -N7 -s 
   7,313,502,412 bytes allocated in the heap
     108,812,664 bytes copied during GC
          11,924 bytes maximum residency (100 sample(s))
          61,828 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  4207 collections,     0 parallel,  1.18s,  1.17s elapsed
  Generation 1:   100 collections,    99 parallel,  0.18s,  0.04s elapsed

  Parallel GC work balance: 3.25 (285256 / 87661, ideal 7)

  Task  0 (worker) :  MUT time:   5.56s  ( 11.70s elapsed)
                      GC  time:   0.01s  (  0.01s elapsed)

  Task  1 (worker) :  MUT time:   6.65s  ( 11.70s elapsed)
                      GC  time:   1.16s  (  1.13s elapsed)

  Task  2 (worker) :  MUT time:   0.03s  ( 11.70s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  3 (worker) :  MUT time:   0.00s  ( 11.70s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  4 (worker) :  MUT time:   9.93s  ( 11.70s elapsed)
                      GC  time:   0.01s  (  0.01s elapsed)

  Task  5 (worker) :  MUT time:   6.97s  ( 11.70s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  6 (worker) :  MUT time:   8.25s  ( 11.70s elapsed)
                      GC  time:   0.04s  (  0.02s elapsed)

  Task  7 (worker) :  MUT time:   6.79s  ( 11.72s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  8 (worker) :  MUT time:   3.70s  ( 11.70s elapsed)
                      GC  time:   0.14s  (  0.03s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   61.91s  ( 11.70s elapsed)
  GC    time    1.35s  (  1.21s elapsed)
  EXIT  time    0.01s  (  0.02s elapsed)
  Total time   63.27s  ( 12.93s elapsed)

  %GC time       2.1%  (9.4% elapsed)

  Alloc rate    118,112,344 bytes per MUT second

  Productivity  97.9% of total user, 478.9% of total elapsed

recordMutableGen_sync: 383
gc_alloc_block_sync: 2
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 16
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 175
gen[1].steps[0].sync_large_objects: 0

real	0m12.933s
user	1m3.276s
sys	0m0.248s
time ./Main +RTS -N8 -s 
   7,316,581,044 bytes allocated in the heap
     153,424,584 bytes copied during GC
          11,964 bytes maximum residency (132 sample(s))
          65,916 bytes maximum slop
               6 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  4711 collections,  4711 parallel, 136.08s, 21.38s elapsed
  Generation 1:   132 collections,   132 parallel,  6.12s,  0.93s elapsed

  Parallel GC work balance: 1.01 (38276826 / 38012855, ideal 8)

  Task  0 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  30.31s  (  4.74s elapsed)

  Task  1 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  25.09s  (  3.68s elapsed)

  Task  2 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:   0.09s  (  0.01s elapsed)

  Task  3 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  4 (worker) :  MUT time:   0.98s  ( 14.56s elapsed)
                      GC  time:   7.62s  (  1.17s elapsed)

  Task  5 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  13.09s  (  2.01s elapsed)

  Task  6 (worker) :  MUT time:   5.81s  ( 14.56s elapsed)
                      GC  time:   1.92s  (  0.28s elapsed)

  Task  7 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  25.12s  (  4.24s elapsed)

  Task  8 (worker) :  MUT time:   0.95s  ( 14.56s elapsed)
                      GC  time:   8.01s  (  1.20s elapsed)

  Task  9 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  30.94s  (  4.98s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   65.37s  ( 14.56s elapsed)
  GC    time  142.20s  ( 22.31s elapsed)
  EXIT  time    0.00s  (  0.01s elapsed)
  Total time  207.57s  ( 36.87s elapsed)

  %GC time      68.5%  (60.5% elapsed)

  Alloc rate    111,922,106 bytes per MUT second

  Productivity  31.5% of total user, 177.3% of total elapsed

recordMutableGen_sync: 597
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 35296
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 557515
gen[1].steps[0].sync_large_objects: 0

real	0m36.873s
user	3m27.569s
sys	0m0.524s

See how the GC goes from 1s to 142s, just by virtue of -N8 instead of -N7. Increasing the heap size seems to alleviate the problem, but the difference is so severe (and unexpected, at least to me) that I thought I'd file a ticket. I am happy to try other flags if you want. This is all on GHC 6.10.4.

Attachments (1)

Main.hs (583 bytes) - added by nccb 10 years ago.

Download all attachments as: .zip

Change History (4)

Changed 10 years ago by nccb

Attachment: Main.hs added

comment:1 Changed 10 years ago by NeilMitchell

I've seen this too. At -N8 some extra parallelism kicks in with the garbage collector, which destroys everything. The parallel GC in 6.12 is far better, so this issue will be obsolete then.

Until then, I always pass -N8 -g1 which turns off parallel GC and makes everything much faster again.

comment:2 Changed 10 years ago by simonmar

difficulty: Unknown
Resolution: fixed
Status: newclosed

This is fixed in 6.12.1. Please re-open if you still have problems with 6.12.1.

comment:3 Changed 10 years ago by simonmar

Type of failure: Runtime performance bug
Note: See TracTickets for help on using tickets.