Version 35 (modified by MikolajKonarski, 2 years ago)

Paste commit logs

Detailed design rationale for the changes in GHC for emulating +RTS -s through eventlog events

The following subsection are titled after the GHC commits they describe. The remaining sections contain problem analysis and design drafts and are retained to show the motivation and the history of the changes.

Calculate the total memory allocated on a per-capability basis

In addition to the existing global method. For now we just do it both ways and assert they give the same grand total. At some stage we can simplify the global method to just take the sum of the per-cap counters.

Change the presentation of parallel GC work balance in +RTS -s

Also rename internal variables to make the names match what they hold. The parallel GC work balance is calculated using the total amount of memory copied by all GC threads, and the maximum copied by any individual thread. You have serial GC when the max is the same as copied, and perfectly balanced GC when total/max == n_caps.

Previously we presented this as the ratio total/max and told users that the serial value was 1 and the ideal value N, for N caps, e.g.

  Parallel GC work balance: 1.05 (4045071 / 3846774, ideal 2)

The downside of this is that the user always has to keep in mind the number of cores being used. Our new presentation uses a normalised scale 0--1 as a percentage. The 0% means completely serial and 100% is perfect balance, e.g.

  Parallel GC work balance: 4.56% (serial 0%, perfect 100%)

Add new eventlog events for various heap and GC statistics

They cover much the same info as is available via the GHC.Stats module or via the '+RTS -s' textual output, but via the eventlog and with a better sampling frequency.

We have three new generic heap info events and two very GHC-specific ones. (The hope is the general ones are usable by other implementations that use the same eventlog system, or indeed not so sensitive to changes in GHC itself.)

The general ones are:

  • total heap mem allocated since prog start, on a per-HEC basis
  • current size of the heap (MBlocks reserved from OS for the heap)
  • current size of live data in the heap

Currently these are all emitted by GHC at GC time (live data only at major GC).

The GHC specific ones are:

  • an event giving various static heap paramaters:
    • number of generations (usually 2)
    • max size if any
    • nursary size
    • MBlock and block sizes
  • a event emitted on each GC containing:
    • GC generation (usually just 0,1)
    • total bytes copied
    • bytes lost to heap slop and fragmentation
    • the number of threads in the parallel GC (1 for serial)
    • the maximum number of bytes copied by any par GC thread
    • the total number of bytes copied by all par GC threads (these last three can be used to calculate an estimate of the work balance in parallel GCs)

Adjust the eventlog description header for the spark counter event

The EventLogFormat.h described the spark counter fields in a different order to that which ghc emits (the GC'd and fizzled fields were reversed). At this stage it is easier to fix the ghc-events lib and to have ghc continue to emit them in the current order.

Move trace of cap delete from shutdownCapability to freeCapability

Will let us do final per-cap trace events from stat_exit(). Otherwise we would end up with eventlogs with events for caps that have already been deleted.

Emit final heap alloc events and rearrange code to calculate alloc totals

In stat_exit we want to emit a final EVENT_HEAP_ALLOCATED for each cap so that we get the same total allocation count as reported via +RTS -s. To do so we need to update the per-cap total_allocated counts.

Previously we had a single calcAllocated(rtsBool) function that counted the large allocations and optionally the nurseries for all caps. The GC would always call it with false, and the stat_exit always with true. The reason for these two modes is that the GC counts the nurseries via clearNurseries() (which also updates the per-cap total_allocated counts), so it's only the stat_exit() path that needs to count them.

We now split the calcAllocated() function into two: countLargeAllocated and updateNurseriesStats. As the name suggests, the latter now updates the per-cap total_allocated counts, in additon to returning a total.

Fix the timestamps in GC_START and GC_END events on the GC-initiating cap

There was a discrepancy between GC times reported in +RTS -s and the timestamps of GC_START and GC_END events on the cap, on which +RTS -s stats for the given GC are based. This is fixed by posting the events with exactly the same timestamp as generated for the stat calculation. The calls posting the events are moved too, so that the events are emitted close to the time instant they claim to be emitted at. The GC_STATS_GHC was moved, too, ensuring it's emitted before the moved GC_END on all caps, which simplifies tools code.

Add the GC_GLOBAL_SYNC event marking that all caps are stopped for GC

Quoting design rationale by dcoutts: The event indicates that we're doing a stop-the-world GC and all other HECs should be between their GC_START and GC_END events at that moment. We don't want to use GC_STATS_GHC for that, because GC_STATS_GHC is for extra GHC-specific info, not something we have to rely on to be able to match the GC pauses across HECs to a particular global GC.

The analysis of the semantics of +RTS -s

Here is a sample output of +RTS -s, annotated with a discussion of new events required to simulate it in ThreadScope? (for a user-selected time interval). A list of the new required events is in the second part of this page. Here is a screenshot of what we can already do using the current set of events. It so happens we can do as much for the whole runtime as for the selected time intervals with the currently available events, but in general, intervals require more kinds of events and more samples. Similarly, when we visualize some of this as graphs and especially graphs of rates of change of some values (e.g., memory usage), more frequent sampling will be required.

The first line of +RTS -s follows.

     237,179,528 bytes allocated in the heap

We'd need an extra event, emitted at each GC, with the allocation since the previous GC. (We really don't want an event for every memory allocation, that would be impractical and very slow.)

      52,785,584 bytes copied during GC

An event with a summary of all copying done, emitted after the end of each GC.

      17,272,336 bytes maximum residency (5 sample(s))

A separate event for that, perhaps emitted only after major GC when we know how much memory is really used by the program. The docs explain the "n samples" above saying "only checked during major garbage collections".

       6,493,328 bytes maximum slop

We also need an extra event for slop, probably emitted rarely.

              45 MB total memory in use

The peak "total memory in use" to date is stored in the peak_mblocks_allocated global var. It changes often, but we can't spam too much, so let's emit it only after each GC, and not the peak value to date, but the current value.

                  (0 MB lost due to fragmentation)

Fragmentation is calculated in the RTS -s code as follows:

 (peak_mblocks_allocated * BLOCKS_PER_MBLOCK * BLOCK_SIZE_W - hw_alloc_blocks * BLOCK_SIZE_W) / (1024 * 1024 / sizeof(W_))

Note that it uses the peak and high-water values and we instead want the current values. It's calculated as the difference of mblock and block allocations, so we need an extra events for allocated block (mblocks are already recorded in "total memory in use" above).

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       448 colls,     0 par    0.04s    0.04s     0.0001s    0.0016s
  Gen  1         5 colls,     0 par    0.07s    0.07s     0.0149s    0.0386s

The current GC events (in particular RequestParGC) seem to be enough to distinguish between seq and par GC. We'd need to split the current GC events into generations, though, to report for every generation separately. We may and up with two tables for the same GC info: one aggregated by cap, another by generations. Or, as long as there are only 2 generations, one table with both caps and generations, with the following rows: cap0&gen0, cap0&gen1, cap1&gen0, etc. Note that we don't want to report the CPU time, only the elapsed time, and that's fine.

  Parallel GC work balance: 1.00 (6391526 / 6375794, ideal 2)

Let's ignore that one for now. JaffaCake? says we probably don't care about work balance and that he thinks it is computed in the simplest way. Detail are in http://community.haskell.org/~simonmar/papers/parallel-gc.pdf. The current computation seems to be the following: total words copied during parallel GCs divided by the sum over all parallel GCs of the maximal number of words copied by any thread in a given GC. Events needed: the events added above probably suffice, but quite a bit of extra state will have to be maintained when reading the events due to the rather complex formula for work balance.

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.04s    (  0.38s)       0.13s    (  0.17s)
  Task  1 (worker) :    0.00s    (  0.55s)       0.00s    (  0.00s)
  Task  2 (bound)  :    0.41s    (  0.41s)       0.13s    (  0.13s)
  Task  3 (worker) :    0.00s    (  0.55s)       0.00s    (  0.00s)

JaffaCake? says the task information has questionable usefulness, so let's ignore that one for now. It's much more natural for us to present the same info per cap, not per OS thread (which the tasks basically are). Actually we do present the GC info per cap (not only total, as in +RTS -s) already and the total activity time per cap (which includes the mutator time) is much better conveyed by the graphs in ThreadScope?.

BTW, the time between events GCIdle and GCWork is still counted as GC time, so we may ignore the events for calculating the times spent on GC. OTOH, a summary of the GCIdle times, per hec, then the total, also as the percentage of all GC time could be useful. Probably we can do that cheaply along the way since we have to identify and sift out the GCIdle, GCDone and GCWork events anyway.

  SPARKS: 300001 (17739 converted, 282262 overflowed, 0 dud, 0 GC'd, 0 fizzled)

Tell JaffaCake? that the example and description for the SPARKS count at

http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control.html#rts-options-gc

needs updating (not sure for which GHC version, though). Otherwise, we have enough events for that (we calculate this using the SparkCounters? events, but we could also use the precise per-spark events).

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.42s  (  0.41s elapsed)
  GC      time    0.29s  (  0.13s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.71s  (  0.55s elapsed)

(Note that there may be more times listed above, e.g., the time overhead of profiling.) We can sum up the GC time from GC events. We get the total of GC and MUT time (and PROF, etc.) as the time from the start of the first event to the (end of) the last event, so from the total and the GC time we can normally compute MUT. We can assume that INIT and EXIT are negligible (I wonder when they are not) and anyway they don't make sense for when we summarize an interval. If we insist on them, a separate event for each would be required.

%GC time      89.5%  (75.3% elapsed)

This line does not appear in threaded RTS, so we disregard it.

  Alloc rate    564,074,971 bytes per MUT second

The events added above should suffice, except that we use elapsed time, not CPU time.

  Productivity  59.0% of total user, 76.6% of total elapsed

The events added above should be enough. Again. we only do the elapsed case, so we'd show elapsed/elapsed, while the figures above are cpu/cpu and cpu/elapsed. JaffCake? thinks the latter mixture is OK. However, it mixes the productivity of CPU mutation vs elapsed mutation with the productivity of mutation vs GC. In this light, out figure will not be that bad, because it's consistent, even if not as accurate as the equally consistent first figure above.

BTW, the fact that the second figure is higher (I can reproduce it), shows a problem with clocks or some other problem. I'd guess the elapsed time should always be higher than the CPU time, shouldn't it?

The list of considered new events or even parameters (not final)

There is a wealth of statistics around heaps and GC. Some of the stats are reasonably common, shared by different implementations while many more are highly specific to a particular implementation. Even if we ignore RTSs other than GHC, we still have an issue of flexibility for future changes in GHC's RTS.

Our solution is to split the stats into two groups, a general group that make sense in many implementations, and a second that are highly GHC-specific. Analyses and visualisations based on the first group are likely to be portable to other RTS instances and changes in GHC's RTS. The second is likely to have to change when GHC changes, but it does at least contain the less frequently used info and does not need so much visualisation.

New general memory stats events

  • EVENT_HEAP_ALLOCATED (alloc_bytes): is the total bytes allocated over the whole run by this HEC. That is we count allocation on a per-HEC basis. This event is in principle not tied to GC, it could be emitted any time.
  • EVENT_HEAP_SIZE (heap_capset, size_bytes): is the current bytes allocated from the OS to use for the heap. For the current GHC RTS this is the MBlocks, kept in the mblocks_allocated var. Again, this in principle could be emitted any time. The maximum accuracy would be to emit the event exactly when MBlocks are allocated or freed.
  • EVENT_HEAP_LIVE (heap_capset, live_bytes): is the current amount of live/reachable data in the heap. This is almost certainly only known after a major GC.

New GHC-specific general memory stats events

  • EVENT_HEAP_INFO_GHC (heap_capset, gens, max_size, nursary_size): various static parameters relating to the heap. In particular it tells us the number of generations.
  • EVENT_GC_STATS_GHC (heap_capset, gen, copied, slop, frag, was_par, max_copied, avg_copied): various less used GC stats. These are all GHC-specific, and specific to current GC design. It includes the generation of this GC so along with knowing the total number of generations we can identify major/minor GCs. We also include memory lost to slop and fragmentation. The final three are to do with parallel GC: the first is just a flag to indicate if this GC was a parallel GC, the ratio of the other two gives the parallel work balance (this ratio can be extended to multiple GCs to get an average figure).

Identifying heaps in eventlogs

In the above events, the "allocated since program start" is done per-HEC, but the others apply to the heap as a whole, not a particular HEC.

For completeness explicitly identify heaps by identifying the heap to which the events apply. (Remember that we can merge event logs from multiple processes, so there is already no truly global notion of heap, implicitly it would be the single heap belonging to the HEC that emits the event. We would also have to make the assumption that there is a single heap per OS process (we can already identify which HECs belong to the same OS process). Alternatively we can explicitly identify heaps using the existing capset (capability set) mechanism.)

The GHC RTS already puts all of its HECs into a capset for the OS process. We can reuse this capset.

If in future we allow multiple independent heaps in the same OS process (e.g. separate RTS instances) then this scheme would let us cope by making a separate capset. Similarly it'd cope with implementations like GdH which use a global heap spanning multiple OS processes.

Tinkering

While we tinker with these events, we could try to ensure that RequestSeqGC and RequestParGC are emitted before their corresponding GC_START events, by using the following tip from JaffaCake?: "actually you could try moving the interruptAllCapabilities() call in Schedule.c:1500 down below the traceEvent calls". Currently they can be emitted in arbitrary order and it complicates analysis, especially for selected intervals (and complicates and makes less strict the future validation of the events).