|Version 30 (modified by MikolajKonarski, 2 years ago)|
The analysis of the semantics of +RTS -s
Here is a sample output of +RTS -s, annotated with a discussion of new events required to simulate it in ThreadScope? (for a user-selected time interval). A list of the new required events is in the second part of this page. Here is a screenshot of what we can already do using the current set of events. It so happens we can do as much for the whole runtime as for the selected time intervals with the currently available events, but in general, intervals require more kinds of events and more samples. Similarly, when we visualize some of this as graphs and especially graphs of rates of change of some values (e.g., memory usage), more frequent sampling will be required.
The first line of +RTS -s follows.
237,179,528 bytes allocated in the heap
We'd need an extra event, emitted at each GC, with the allocation since the previous GC. (We really don't want an event for every memory allocation, that would be impractical and very slow.)
52,785,584 bytes copied during GC
An event with a summary of all copying done, emitted after the end of each GC.
17,272,336 bytes maximum residency (5 sample(s))
A separate event for that, perhaps emitted only after major GC when we know how much memory is really used by the program. The docs explain the "n samples" above saying "only checked during major garbage collections".
6,493,328 bytes maximum slop
We also need an extra event for slop, probably emitted rarely.
45 MB total memory in use
The peak "total memory in use" to date is stored in the peak_mblocks_allocated global var. It changes often, but we can't spam too much, so let's emit it only after each GC, and not the peak value to date, but the current value.
(0 MB lost due to fragmentation)
Fragmentation is calculated in the RTS -s code as follows:
(peak_mblocks_allocated * BLOCKS_PER_MBLOCK * BLOCK_SIZE_W - hw_alloc_blocks * BLOCK_SIZE_W) / (1024 * 1024 / sizeof(W_))
Note that it uses the peak and high-water values and we instead want the current values. It's calculated as the difference of mblock and block allocations, so we need an extra events for allocated block (mblocks are already recorded in "total memory in use" above).
Tot time (elapsed) Avg pause Max pause Gen 0 448 colls, 0 par 0.04s 0.04s 0.0001s 0.0016s Gen 1 5 colls, 0 par 0.07s 0.07s 0.0149s 0.0386s
The current GC events (in particular RequestParGC) seem to be enough to distinguish between seq and par GC. We'd need to split the current GC events into generations, though, to report for every generation separately. We may and up with two tables for the same GC info: one aggregated by cap, another by generations. Or, as long as there are only 2 generations, one table with both caps and generations, with the following rows: cap0&gen0, cap0&gen1, cap1&gen0, etc. Note that we don't want to report the CPU time, only the elapsed time, and that's fine.
Parallel GC work balance: 1.00 (6391526 / 6375794, ideal 2)
Let's ignore that one for now. JaffaCake? says we probably don't care about work balance and that he thinks it is computed in the simplest way. Detail are in http://community.haskell.org/~simonmar/papers/parallel-gc.pdf. The current computation seems to be the following: total words copied during parallel GCs divided by the average over all parallel GCs of maximal number of words copied by any thread in a single par GC. Events needed: the events added above probably suffice, but quite a bit of extra state will have to be maintained when reading the events due to the rather complex formula for work balance.
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.04s ( 0.38s) 0.13s ( 0.17s) Task 1 (worker) : 0.00s ( 0.55s) 0.00s ( 0.00s) Task 2 (bound) : 0.41s ( 0.41s) 0.13s ( 0.13s) Task 3 (worker) : 0.00s ( 0.55s) 0.00s ( 0.00s)
JaffaCake? says the task information has questionable usefulness, so let's ignore that one for now. It's much more natural for us to present the same info per cap, not per OS thread (which the tasks basically are). Actually we do present the GC info per cap (not only total, as in +RTS -s) already and the total activity time per cap (which includes the mutator time) is much better conveyed by the graphs in ThreadScope?.
BTW, the time between events GCIdle and GCWork is still counted as GC time, so we may ignore the events for calculating the times spent on GC. OTOH, a summary of the GCIdle times, per hec, then the total, also as the percentage of all GC time could be useful. Probably we can do that cheaply along the way since we have to identify and sift out the GCIdle, GCDone and GCWork events anyway.
SPARKS: 300001 (17739 converted, 282262 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Tell JaffaCake? that the example and description for the SPARKS count at
needs updating (not sure for which GHC version, though). Otherwise, we have enough events for that (we calculate this using the SparkCounters? events, but we could also use the precise per-spark events).
INIT time 0.00s ( 0.00s elapsed) MUT time 0.42s ( 0.41s elapsed) GC time 0.29s ( 0.13s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.71s ( 0.55s elapsed)
(Note that there may be more times listed above, e.g., the time overhead of profiling.) We can sum up the GC time from GC events. We get the total of GC and MUT time (and PROF, etc.) as the time from the start of the first event to the (end of) the last event, so from the total and the GC time we can normally compute MUT. We can assume that INIT and EXIT are negligible (I wonder when they are not) and anyway they don't make sense for when we summarize an interval. If we insist on them, a separate event for each would be required.
%GC time 89.5% (75.3% elapsed)
This line does not appear in threaded RTS, so we disregard it.
Alloc rate 564,074,971 bytes per MUT second
The events added above should suffice, except that we use elapsed time, not CPU time.
Productivity 59.0% of total user, 76.6% of total elapsed
The events added above should be enough. Again. we only do the elapsed case, so we'd show elapsed/elapsed, while the figures above are cpu/cpu and cpu/elapsed. JaffCake? thinks the latter mixture is OK. However, it mixes the productivity of CPU mutation vs elapsed mutation with the productivity of mutation vs GC. In this light, out figure will not be that bad, because it's consistent, even if not as accurate as the equally consistent first figure above.
BTW, the fact that the second figure is higher (I can reproduce it), shows a problem with clocks or some other problem. I'd guess the elapsed time should always be higher than the CPU time, shouldn't it?
The list of needed new events or even parameters
A rough proposal, in particular the names are ad hoc and the units are provisional (e.g., words or blocks would be more natural for memory events, but TS does not know their sizes).
New memory stats events
* MEM_ALLOCATED(number_of_bytes) emitted after (or just before?) each GC, with the number of bytes allocated in the heap since the previous GC, or a running total
* MEM_COPIED(number_of_bytes) emitted after each GC, with the number of bytes copied during that GC (this one could also be counted as a GC_ event, not MEM_ event, because it's only about the particular memory operations of a copying GC)
* MEM_RESIDENCY(number_of_bytes) emitted after each major GC (too rarely, but after minor GC the figure is too inaccurate, probably), with the total number of bytes of memory actually used (as opposed to allocated from the OS) by the program
* MEM_SLOP(number_of_bytes) emitted after each GC, with the current slop in bytes
* MEM_TOTAL(number_of_bytes) emitted after each GC, with the total number of mblocks allocated from the OS
* MEM_TOTAL_BLOCK(number_of_bytes) emitted after each GC, with the total memory of blocks allocated inside the mblocks allocated from the OS, used to calculated the memory lost due to fragmentation of mblocks
New parameters for GC stats events
Extra "generation" parameter is needed for one of GC_START or GC_END. If we have the extra parameter in both, we can do more with partial eventlogs that lack starts of ends of some GCs, and it's easier to calculate stats for selected time intervals. When the RequestSeqGC and RequestParGC events are emitted, it's not yet know if the GC will be major or minor, so no extra parameters should be added to them.
While we tinker with these events, we could try to ensure that RequestSeqGC and RequestParGC are emitted before their corresponding GC_START events, by using the following tip from JaffaCake?: "actually you could try moving the interruptAllCapabilities() call in Schedule.c:1500 down below the traceEvent calls". Currently they can be emitted in arbitrary order and it complicates analysis, especially for selected intervals (and complicates and makes less strict the future validation of the events).