Opened 3 years ago

Last modified 3 years ago

#13164 new feature request

idle time full GCs (idle cpu usage)

Reported by: lspitzner Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 8.0.1
Keywords: idle GC Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

In an interactive program I noticed idle CPU usage of ~12% due to the full GCs done by the RTS on default settings. For this specific usecase this is a lot of wasted CPU cycles for no gain. Passing -I0 to the RTS "fixes" this; nonetheless I think that the default behaviour could/should be improved, even when considering other usecases where full GCs are a good choice.

The program in question keeps ~50MB allocated on the heap but does close to no short-lived allocations in between the idle GCs. Some relevant data from the summary:

     930,569,008 bytes allocated in the heap
   4,847,446,928 bytes copied during GC
      46,255,056 bytes maximum residency (111 sample(s))
         378,200 bytes maximum slop
              94 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1784 colls,     0 par    0.123s   0.100s     0.0001s    0.0006s
  Gen  1       111 colls,     0 par    9.377s   9.418s     0.0848s    0.1099s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    2.250s  ( 95.029s elapsed)
  GC      time    9.500s  (  9.518s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.007s  (  0.008s elapsed)
  Total   time   11.760s  (104.557s elapsed)

The detailed GC statistics of the idle GCs look like

    Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
    bytes     bytes     bytes   user   elap     user     elap
    13304  45958728  46210536  0.070  0.068    0.920    1.550    0    0  (Gen:  1)
    11568  45958720  46210536  0.100  0.102    1.043    2.584    0    0  (Gen:  1)
    15432  45958720  46210536  0.103  0.105    1.167    3.588    0    0  (Gen:  1)
    11568  45958720  46210536  0.100  0.102    1.287    4.586    0    0  (Gen:  1)
    11568  45958720  46210536  0.107  0.107    1.413    5.592    0    0  (Gen:  1)
    11568  45958720  46210536  0.073  0.080    1.503    6.566    0    0  (Gen:  1)
    11568  45958720  46210536  0.073  0.073    1.593    7.560    0    0  (Gen:  1)
    11568  45958720  46210536  0.067  0.068    1.677    8.557    0    0  (Gen:  1)

I don't really know what exactly "alloc bytes" means, but my guess is that those numbers are indeed fairly low and those GCs are mostly useless.

To clarify the intention of this request: Tweak the idle GC (default) configuration in some way so that interactive programs with moderate heap use reasonable amount of CPU time while idle.

And "reasonable" is intentionally vague as I don't know how to weight other usecases.

Minimal example:

import Control.Concurrent

main :: IO ()
main = do
  let large = [1..1000000]
  print $ length large
  [1..20] `forM_` \_ -> do
    threadDelay 400000
  print $ sum large

Change History (8)

comment:1 Changed 3 years ago by simonmar

I'm open to ideas here. The tradeoff is that without the idle GC you don't get any deadlock detection (BlockedOnDeadMVar and friends) when the process is idle, so provided you're OK with that then it's fine to use -I0.

comment:2 Changed 3 years ago by lspitzner

I think it would be a good approach to keep the current delay, but to insert some kind of conditional around actually running the full GC. This condition could take into account e.g.:

  • The time since the last (actually-performed) idle GC. This would reset if a non-idle full GC is performed.
  • The number of bytes collected by the last idle GC, perhaps in relation to the residency. (If you collect less than 10% of heap as garbage, it is probably not worth to run the GC too often).
  • The number of bytes allocated since the last full GC (idle or not).

I don't know if these statistics are available, but at least the first should be easy to implement.

comment:3 Changed 3 years ago by simonmar

The main purpose of the idle GC is not really to collect garbage, it's for deadlock detection. The program may have deadlocked even if it has only allocated a tiny amount since the previous GC, and we have no way of telling that.

On the other hand, if you want the idle GC for collecting garbage, and you don't mind about not having prompt deadlock detection, then yes having those heuristics would be a good idea. I'm just not sure how often this is the case.

comment:4 Changed 3 years ago by lspitzner

I just noticed that this means that you do not get deadlock detection if you forkIO (forever (threadDelay 100000)). That's interesting.

You are right, I was focused on GC performance instead of deadlock detection because the latter was not even mentioned when i chatted with rwbarton. Maybe he can comment on this.

Even for deadlock detection it makes sense to delay things based on the cost of the full GCs. (I'd rather have to wait 15secs to get the deadlock detected when I manage to create deadlocks than constantly be wasting CPU time - but if full GCs are cheap, I would not mind quicker detection.)

comment:5 Changed 3 years ago by rwbarton

FWIW I've never cared about deadlock detection, and it sounds unreliable in view of lspitzner's most recent comment. I do care that interactive programs I'm not actively using don't consume 12% of my CPU. Doing a GC when the user isn't interacting with the program sounds like a good idea to hide latency.

Judging from the detailed GC statistics lspitzner's program isn't totally idle; it must be doing a little bit of work every second, repeatedly triggering a new idle GC. But one would expect it to be cheap to do a little bit of work every second. The heap could be a lot bigger than 50 MB...

The user's guide suggests enabling the idle GC for interactive programs, but when there is a little work to do every second it seems basically unusable. And the effects of the idle GC would be lost completely by doing work every 0.2 seconds, which is strange.

comment:6 Changed 3 years ago by simonmar

I just noticed that this means that you do not get deadlock detection if you forkIO (forever (threadDelay 100000)). That's interesting.

Probably because it never runs a major GC, only minor GCs. Admittedly that's surprising and counter-intuitive.

I do care that interactive programs I'm not actively using don't consume 12% of my CPU.

I'm sympathetic to that. Like I said, I'm open to suggestions here. I do worry that not having deadlock detection by default will be surprising - it's quite useful when developing/debugging, for instance. We have lots of tests in testsuite/tests/concurrent that rely on it.

How about we give the idle GC a time budget expressed as a % of wall clock time?

comment:7 Changed 3 years ago by lspitzner

Firstly: I said "I just noticed that this means", i.e. this was a pure deduction, not an observation.

Apart from that, I never meant to suggest making -I0 the default, and I think the two goals (deadlock detection and latency reduction) do align here. So between options a) status quo b) default -I0 c) something more clever that both does deadlock detection and uses, lets say.. 3% idle cpu time, I'd choose c) over a) over b).

I am not sure how the budget would be enforced, but it sounds like a good idea, especially if that is easier to implement than the heuristics I mentioned above.

Another thought: If there was an upper bound on the interval between full GCs (idle or not) you'd get reliable deadlock detection even with "forever threaddelay" cases.

comment:8 Changed 3 years ago by rwbarton

How about we give the idle GC a time budget expressed as a % of wall clock time?

This sounds sensible, e.g., if the idle GC last ran for m milliseconds, wait (at least) 100*m milliseconds before running it again.

Another thought: If there was an upper bound on the interval between full GCs (idle or not) you'd get reliable deadlock detection even with "forever threaddelay" cases.

This makes sense, too. So the 100*m timer should be reset whenever we do a full GC. Then we probably no longer need the "idle" condition.

(But when the RTS is truly idle then we should not do another GC, of course. For example, currently when you leave ghci idle in a terminal it does one idle GC after (GHC's custom idle GC time of) 5 seconds, but then doesn't do another GC after 5 more seconds if you haven't interacted with it at all.)

Note: See TracTickets for help on using tickets.