|Version 23 (modified by MikolajKonarski, 4 years ago)|
The analysis of the semantics of +RTS -s and the list of needed new events
Here is a sample output of +RTS -s, annotated with the new events required to simulate it in ThreadScope?. More concrete event proposals will follow after discussion. Note that eventually we'd like to generate such a summary for any user-selected time interval of runtime and we may need more or different events for this or we may choose to skip some information (for example the INIT or EXIT times do not make sense for an interval). Here is a screenshot of what we can already do using the current set of events. It happens we can do as much for the whole runtime as for the time intervals in this case.
The first line of +RTS -s follows.
237,179,528 bytes allocated in the heap
We'd need an event, emitted at each GC, with the allocation since the previous GC. (We really don't want an event for every memory allocation, that would be impractical and very slow.)
52,785,584 bytes copied during GC
An event with a summary of all copying done, emitted after the end of each GC.
17,272,336 bytes maximum residency (5 sample(s))
A separate event for that, perhaps emitted only after major GC when we know how much memory is really used by the program. The docs explain the "n samples" above saying "only checked during major garbage collections".
6,493,328 bytes maximum slop
We also need an extra event for slop, probably emitted rarely.
45 MB total memory in use
The "total memory in use" is stored in the mblocks_allocated global var. Check how often it changes and emit accordingly.
(0 MB lost due to fragmentation)
Fragmentation is calculated in the RTS -s code as follows:
(peak_mblocks_allocated * BLOCKS_PER_MBLOCK * BLOCK_SIZE_W - hw_alloc_blocks * BLOCK_SIZE_W) / (1024 * 1024 / sizeof(W_))
so it's a difference between total memory allocated from the OS (peak_mblocks_allocated), and total memory in use by the GC and other parts of the RTS (hw_alloc_blocks). Presumably, all the events needed so far are of the latter kind, so the former may need a new event.
Tot time (elapsed) Avg pause Max pause Gen 0 448 colls, 0 par 0.04s 0.04s 0.0001s 0.0016s Gen 1 5 colls, 0 par 0.07s 0.07s 0.0149s 0.0386s
The current GC events (in particular RequestParGC) seem to be enough to distinguish between seq and par GC. We'd need to split the current GC events into generations, though, to report for every generation separately. Note that we don't want to report the CPU time, only the elapsed time, and that's fine.
Parallel GC work balance: 1.00 (6391526 / 6375794, ideal 2)
This is computed in a quite convoluted way in the +RTS -s code. Ask JaffaCake? if it could be simplified. The current computation seems to be the following: total words copied during parallel GCs divided by the average over all parallel GCs of maximal number of words copied by any thread in a single par GC. Events needed: the events added above probably suffice, but quite a bit of extra state will have to be maintained when reading the events, especially if we want to stick to the current, complex computation.
MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.04s ( 0.38s) 0.13s ( 0.17s) Task 1 (worker) : 0.00s ( 0.55s) 0.00s ( 0.00s) Task 2 (bound) : 0.41s ( 0.41s) 0.13s ( 0.13s) Task 3 (worker) : 0.00s ( 0.55s) 0.00s ( 0.00s)
Ask JaffaCake? how the "tasks" relate to the "threads" for which we generate events. For now, to the existing GC events we can add the info about which task does the job, but we may miss something this way. BTW, ask JaffaCake? if the time between event GCIdle and GCWork is included in the GC time of the task or not. Generally, is GCIdle useful for us here in any way? Similarly, can we calculate the MUT (elapsed) times just but taking the total time a thread (or cap?) is run and subtracting the GC time and any pauses (and are all pauses visible through events we have already?).
SPARKS: 300001 (17739 converted, 282262 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Tell JaffaCake? that the example and description for the SPARKS count at
needs updating (not sure for which GHC version, though). Otherwise, we have enough events for that (we calculate this using the SparkCounters? events, but we could also use the precise per-spark events).
INIT time 0.00s ( 0.00s elapsed) MUT time 0.42s ( 0.41s elapsed) GC time 0.29s ( 0.13s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.71s ( 0.55s elapsed)
(Note that there may be more times listed above, e.g., the time overhead of profiling.) We can sum up the GC time from GC events. We'd also like to have the MUT figure, but it's not obvious if we can manage to get it from all the thread (task) events that we have (or add above). It's also not clear if adding events needed to get the other times is worth it. After any extra events are added, let's see if we can get any more of these summary times, perhaps by adding a minor event emitted just once. Note that the INIT time is necessary for the Productivity figure below (INIT does not count as "productive time" in +RTS -s)..
%GC time 89.5% (75.3% elapsed)
This line does not appear in threaded RTS, so we disregard it.
Alloc rate 564,074,971 bytes per MUT second
The events added above should suffice, except that we use elapsed time, not CPU time.
Productivity 59.0% of total user, 76.6% of total elapsed
The events added above should be enough. Again. we only do the elapsed case. Ask JaffaCake? if it's OK that the "% of total elapsed" in +RTS -s is actually the CPU time divided by elapsed time, not elapsed divided by elapsed.