Opened 6 years ago

Closed 6 years ago

#8513 closed bug (worksforme)

Parallel GC increases CPU load while slowing down program

Reported by: blitzcode Owned by: simonmar
Priority: normal Milestone:
Component: Runtime System Version: 7.6.3
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


I noticed this issue with a lot of my programs. I have no idea if this is a widely know issue or if I'm just particularly unluckily and/or unskilled when it comes to the GHC GC, but I thought it might be worth reporting as a bug.

Here's a fairly simple program showing the issue:

(Note the 'GHC.Conc.getNumProcessors >>= setNumCapabilities', need to remove that for testing)

On my quad core machine, this simple (non-parallel, some concurrency for draw & compute) Game-of-Life program runs as follows:

+RTS -N1 = ~520G/s, CPU Load ~100% +RTS -N2 = ~505G/s, CPU Load ~135% +RTS -N3 = ~485G/s, CPU Load ~150% +RTS -N4 = ~485G/s, CPU Load ~160%

Specifying -qg1 caps the CPU load increase at ~135% and it won't slow down below ~505G/s. The statistics from +RTS -s also suggest a decrease in GC time / increase in productivity through using -qg1. The program is a bit crummy, but it's the shortest example of this I got at hand. I've seen this in many different programs, serial GC just seems to be faster for a lot of workloads.

I think it might at least be helpful to improve documentation a bit, suggesting some things to try for a GC speedup etc. Apologies if this is already a well-known issue or if I'm just doing something obviously dumb here that makes the GC perform poorly.

Change History (1)

comment:1 Changed 6 years ago by simonmar

Resolution: worksforme
Status: newclosed

Your results seem to be in line with what I would expect. The parallel GC improves performance for (a) parallel programs and (b) sequential programs that have a large residency. For (b) you should use +RTS -qg1.

The documentation for +RTS -qg already mentions the points above, and seems reasonably clear to me:

Your program looks like its main heap structure is a single Vector, which is not very parallelisable in the GC. This would explain why you don't see much speedup with parallel GC.

Note: See TracTickets for help on using tickets.