Changes between Version 40 and Version 41 of RTSsummaryEvents

Show
Ignore:
Timestamp:
04/06/12 23:12:08 (3 years ago)
Author:
MikolajKonarski (IP: 95.160.111.162)
Comment:

Move the IRC logs about the final solution to the end

Legend:

Unmodified
Added
Removed
Modified
  • RTSsummaryEvents

    v40 v41  
    333333because while they are needed for GC, their cause is (slow)  
    334334OS scheduling, just as for any other delays when switching threads. 
    335 Below is a detailed discussion of the original GC event order 
    336 and initial design choices considered. 
     335In the subsection there are detailed discussions of the original GC event order 
     336problems and about the initially considered design choices. 
     337 
     338==== Summary of the problems ==== 
     339 
     340The main issue is how we measure the time spent doing GC. The +RTS -s 
     341measurement is taken from the stat_startGC() to stat_endGC() on the 
     342"primary" GC thread. In ThreadScope the main GC events we have access to 
     343are GC_START and GC_END on each capability. 
     344 
     345It turns out that these different measurements are often quite 
     346different.  In particular there can often be a significant delay between 
     347when +RTS -s reports and when the other capabilities start the mutator 
     348again (and hence emit GC_END). Our guess is that this is because the 
     349other caps are suspended at the end of the GC while the primary GC 
     350thread does its stats etc, then when it releases them it can take a 
     351while for the other caps to be rescheduled by the OS. 
     352 
     353In this example, END_GC on one of the caps comes very late: 
     354 
     355{{{ 
     3569378252000: cap 1: GC stats for heap capset 0: generation 0, 45160 
     357bytes copied, 58296 bytes slop, 1757184 bytes fragmentation, 2 par 
     358threads, 45080 bytes max par copied, 45112 bytes total par copied 
     3599378253000: cap 1: finished GC 
     3609378254000: cap 1: spark stats: 1913071 created, 1229807 converted, 
     361221 remaining (0 overflowed, 0 dud, 58923 GC'd, 144981 fizzled) 
     3629378255000: cap 1: waking up thread 3 on cap 1 
     3639378257000: cap 1: running thread 2885 
     3649378259000: cap 1: stopping thread 2885 (thread finished) 
     3659378268000: cap 1: running thread 3 
     3669378354000: cap 1: stopping thread 3 (blocked on black hole owned by 
     367thread 2884) 
     3689378355000: cap 1: creating thread 2886 
     3699378363000: cap 0: finished GC 
     370}}} 
     371 
     372From the user's point of view it is perhaps more accurate to look at the 
     373GC_START/END because that's when the caps really get back to the 
     374mutator. 
     375 
     376This different GC vs mutator measurement affects several other 
     377statistics and the difference vs +RTS -s can build up to be quite large 
     378on larger eventlogs. 
     379 
     380Additionally, if we only look at the GC_START/GC_END then it's not 
     381entirely clear what the average or maximum pause time should mean, since 
     382we have multiple capabilities. Is the max the max for any individual 
     383cap, or from when the first cap does GC_START 'til when the last cap 
     384does GC_END. 
     385 
     386Which brings up another issue: some of the GC statistics rely on a GC 
     387being global across all capabilities, but strictly speaking the current 
     388GC events do not let us associate a GC_START/END on one cap with that on 
     389another. One core can issue a new REQUEST_SEQ_GC 
     390before another has issued it's GC_END from the previous GC. Add in the 
     391fact that not all cores have to take part in a GC and it's tricky to 
     392associate GCs on each core to a global GC. 
     393 
     394We have added a GC_GLOBAL_SYNC event which is emitted strictly between 
     395GC_START/GC_END on all cores, used for associating GC across all cores.  
     396If we could calculate stats in such a way that we do not  
     397need to associate GC across 
     398all cores then that would be better (and more flexible e.g. to work with 
     399your local GC branch). 
     400 
     401==== An implemented partial solution for RequestParGC ==== 
     402 
     403We can try to ensure that RequestSeqGC and RequestParGC are emitted before their corresponding 
     404GC_START events, by using the following tip from JaffaCake:  
     405"actually you could try moving the interruptAllCapabilities() call in Schedule.c:1500 down below the traceEvent calls".  
     406However, even with this implemented in GHC HEAD, the events can still 
     407be emitted in arbitrary order and it complicates analysis,  
     408especially for selected intervals (and complicates and makes less strict 
     409the future validation of events). 
     410 
     411Here are the results of implementing it: 
     412 
     413{{{ 
     41418:13 < mikolaj> dcoutts, JaffaCake: I've tested the "moving the interruptAllCapabilities()  
     415call below the traceEvent calls" tweak and it helps in 99-ish % of cases 
     41618:13 < mikolaj> that is, the order of GC events is much, much better now 
     41718:14 < mikolaj> but there are still very rare cases, especially when the par GC request was  
     418from one HEC, and another, busy HEC is too slow to report finishing a previous GC 
     41918:15 < mikolaj> so it does not help that much in validating eventlogs, but makes them  
     420so much more readable, if a user wants to inspect the event messages one by one 
     421}}} 
     422 
     423==== More example eventlog fragments ==== 
     424 
     425The example of the event order issues below are on a system with 2 caps, on an eventlog 
     426with lots of small GCs. They show how GC times measured by "+RTS -s", 
     427that is from START_GC to the "GC stats" event, are different from TS times. 
     428The TS times are the average over all caps of the pause 
     429between START_GC, when the caps stop mutating, 
     430and END_GC, when they resume mutating. (Most of) the difference 
     431stems probably from the thread sleep/wake delays, some ending GC bookkeeping 
     432(including the time taken to emit GC and other events) 
     433and related overheads. 
     434 
     435It seems the TS times are more useful to the programmer that is mostly 
     436concerned in how much time is spend mutating and not how the wasted time 
     437is divided between the actual GC and the cap switching overhead. 
     438Moreover "+RTS -s" just adds the cap switching overhead to the MUT time 
     439hiding the fact that it's caused by GC. 
     440 
     441The differences accumulate and affect the "GC total elapsed time", 
     442"Avg pause" and "Max pause" columns of the GC table in "+RTS -s". 
     443They also influence the "Task" table (not reproduced in TS). 
     444They add up in total "GC time (elapsed)" and consequently in 
     445"MUT time (elapsed)" that is obtained by subtracting GC time from total time. 
     446In the result, they influence "Alloc rate" and "Productivity", too. 
     447The figures are seriously different between TS and "+RTS -s", 
     448except on small eventlogs. 
     449 
     450It seems for all these figures the TS version is the right one. 
     451In particular, if by "pause" we mean the time between 
     452GC_START and GC_END, by "Max pause" we neither mean 
     453the "maximum over all GCs of (minimal pause over all caps)", 
     454similarly as in +RTS -s (but not exactly the same, because 
     455even the minimal GC_END is usually a tick after "GC stats"), 
     456nor "maximum over all GCs of (maximal pause over all caps)", which 
     457we could easily show in TS and which could make sense for a real-time 
     458system with specialized, non-interchangeable caps, 
     459but "maximum over all GCs of (average pause over all caps)". 
     460 
     461Here's an example demonstrating END_GC 
     462for the previous GC following REQUEST_PAR_GC for the next GC on another cap. 
     463Note that the early END_GC on cap 1 still comes 1000 nanoseconds after 
     464the "GC stats" event. Even such small differences accumulate over all GCs: 
     465 
     466{{{ 
     4679366536000: cap 1: GC stats for heap capset 0: generation 0, 77552 
     468bytes copied, 226056 bytes slop, 696320 bytes fragmentation, 2 par 
     469threads, 71736 bytes max par copied, 77520 bytes total par copied 
     4709366537000: cap 1: finished GC 
     4719366538000: cap 1: spark stats: 1910358 created, 1228256 converted, 
     4721000 remaining (0 overflowed, 0 dud, 58306 GC'd, 144981 fizzled) 
     4739366541000: cap 1: running thread 3 
     4749367310000: cap 1: stopping thread 3 (heap overflow) 
     4759367312000: cap 1: requesting parallel GC 
     4769367316000: cap 0: finished GC 
     4779367317000: cap 0: spark stats: 1071418 created, 1292489 converted, 0 
     478remaining (0 overflowed, 0 dud, 41840 GC'd, 214904 fizzled) 
     4799367318000: cap 0: starting GC 
     4809368131000: cap 1: starting GC 
     481}}} 
     482 
     483Another example, this time there's a different event 
     484("allocated on heap capset") on the late cap, at the same time 
     485as "GC stats" an another cap and before END_GC finally arrives. 
     486Not sure how this can happen. The END_GC on the early cap comes here 
     487a bit later that usually (2000 nanoseconds after the "GC stats" event), too: 
     488 
     489{{{ 
     490836505000: cap 0: GC stats for heap capset 0: generation 0, 11584 
     491bytes copied, 205232 bytes slop, 1007616 bytes fragmentation, 2 par 
     492threads, 11472 bytes max par copied, 11536 bytes total par copied 
     493836505000: cap 1: allocated on heap capset 0: 180717856 total bytes till now 
     494836507000: cap 0: finished GC 
     495836507000: cap 0: spark stats: 5893 created, 114894 converted, 228 
     496remaining (0 overflowed, 0 dud, 13 GC'd, 22190 fizzled) 
     497836509000: cap 0: waking up thread 3 on cap 0 
     498836511000: cap 0: running thread 233 
     499836514000: cap 0: stopping thread 233 (thread finished) 
     500836521000: cap 1: finished GC 
     501}}} 
     502 
     503Finally an example where END_GC is quite late on both caps 
     504(3000 nanoseconds and 6000 nanoseconds): 
     505 
     506{{{ 
     50767582000: cap 1: GC stats for heap capset 0: generation 1, 58248 bytes 
     508copied, 40096 bytes slop, 1912832 bytes fragmentation, 2 par threads, 
     50951920 bytes max par copied, 58208 bytes total par copied 
     51067583000: cap 1: live data in heap capset 0: 82784 bytes 
     51167585000: cap 1: finished GC 
     51267585000: cap 1: spark stats: 15101 created, 2570 converted, 508 
     513remaining (0 overflowed, 0 dud, 879 GC'd, 12 fizzled) 
     51467588000: cap 0: finished GC 
     515}}} 
     516 
     517 
     518==== Discussions that led to the implemented GC_START/GC_END/GC_GLOBAL_SYNC solution ==== 
    337519 
    338520Here are two IRC logs with the discussion that led 
     
    463645}}} 
    464646 
     647Below is the second log. 
     648 
    465649{{{ 
    46665012:38 < dcoutts> JaffaCake: we didn't plan on adding a per-HEC GC stat event, do you think that would be useful? what info would we get? 
     
    48667013:04 < mikolaj_> that's why I asked about moving GC_END, but perhaps wait until dcoutts write the paper and you finish overhauling stat gathering 
    487671}}} 
    488  
    489 ==== Summary of the problems ==== 
    490  
    491 The main issue is how we measure the time spent doing GC. The +RTS -s 
    492 measurement is taken from the stat_startGC() to stat_endGC() on the 
    493 "primary" GC thread. In ThreadScope the main GC events we have access to 
    494 are GC_START and GC_END on each capability. 
    495  
    496 It turns out that these different measurements are often quite 
    497 different.  In particular there can often be a significant delay between 
    498 when +RTS -s reports and when the other capabilities start the mutator 
    499 again (and hence emit GC_END). Our guess is that this is because the 
    500 other caps are suspended at the end of the GC while the primary GC 
    501 thread does its stats etc, then when it releases them it can take a 
    502 while for the other caps to be rescheduled by the OS. 
    503  
    504 In this example, END_GC on one of the caps comes very late: 
    505  
    506 {{{ 
    507 9378252000: cap 1: GC stats for heap capset 0: generation 0, 45160 
    508 bytes copied, 58296 bytes slop, 1757184 bytes fragmentation, 2 par 
    509 threads, 45080 bytes max par copied, 45112 bytes total par copied 
    510 9378253000: cap 1: finished GC 
    511 9378254000: cap 1: spark stats: 1913071 created, 1229807 converted, 
    512 221 remaining (0 overflowed, 0 dud, 58923 GC'd, 144981 fizzled) 
    513 9378255000: cap 1: waking up thread 3 on cap 1 
    514 9378257000: cap 1: running thread 2885 
    515 9378259000: cap 1: stopping thread 2885 (thread finished) 
    516 9378268000: cap 1: running thread 3 
    517 9378354000: cap 1: stopping thread 3 (blocked on black hole owned by 
    518 thread 2884) 
    519 9378355000: cap 1: creating thread 2886 
    520 9378363000: cap 0: finished GC 
    521 }}} 
    522  
    523 From the user's point of view it is perhaps more accurate to look at the 
    524 GC_START/END because that's when the caps really get back to the 
    525 mutator. 
    526  
    527 This different GC vs mutator measurement affects several other 
    528 statistics and the difference vs +RTS -s can build up to be quite large 
    529 on larger eventlogs. 
    530  
    531 Additionally, if we only look at the GC_START/GC_END then it's not 
    532 entirely clear what the average or maximum pause time should mean, since 
    533 we have multiple capabilities. Is the max the max for any individual 
    534 cap, or from when the first cap does GC_START 'til when the last cap 
    535 does GC_END. 
    536  
    537 Which brings up another issue: some of the GC statistics rely on a GC 
    538 being global across all capabilities, but strictly speaking the current 
    539 GC events do not let us associate a GC_START/END on one cap with that on 
    540 another. One core can issue a new REQUEST_SEQ_GC 
    541 before another has issued it's GC_END from the previous GC. Add in the 
    542 fact that not all cores have to take part in a GC and it's tricky to 
    543 associate GCs on each core to a global GC. 
    544  
    545 We have added a GC_GLOBAL_SYNC event which is emitted strictly between 
    546 GC_START/GC_END on all cores, used for associating GC across all cores.  
    547 If we could calculate stats in such a way that we do not  
    548 need to associate GC across 
    549 all cores then that would be better (and more flexible e.g. to work with 
    550 your local GC branch). 
    551  
    552 ==== An implemented partial solution for RequestParGC ==== 
    553  
    554 We can try to ensure that RequestSeqGC and RequestParGC are emitted before their corresponding 
    555 GC_START events, by using the following tip from JaffaCake:  
    556 "actually you could try moving the interruptAllCapabilities() call in Schedule.c:1500 down below the traceEvent calls".  
    557 However, even with this implemented in GHC HEAD, the events can still 
    558 be emitted in arbitrary order and it complicates analysis,  
    559 especially for selected intervals (and complicates and makes less strict 
    560 the future validation of events). 
    561  
    562 Here are the results of implementing it: 
    563  
    564 {{{ 
    565 18:13 < mikolaj> dcoutts, JaffaCake: I've tested the "moving the interruptAllCapabilities()  
    566 call below the traceEvent calls" tweak and it helps in 99-ish % of cases 
    567 18:13 < mikolaj> that is, the order of GC events is much, much better now 
    568 18:14 < mikolaj> but there are still very rare cases, especially when the par GC request was  
    569 from one HEC, and another, busy HEC is too slow to report finishing a previous GC 
    570 18:15 < mikolaj> so it does not help that much in validating eventlogs, but makes them  
    571 so much more readable, if a user wants to inspect the event messages one by one 
    572 }}} 
    573  
    574 ==== More example eventlog fragments ==== 
    575  
    576 The example of the event order issues below are on a system with 2 caps, on an eventlog 
    577 with lots of small GCs. They show how GC times measured by "+RTS -s", 
    578 that is from START_GC to the "GC stats" event, are different from TS times. 
    579 The TS times are the average over all caps of the pause 
    580 between START_GC, when the caps stop mutating, 
    581 and END_GC, when they resume mutating. (Most of) the difference 
    582 stems probably from the thread sleep/wake delays, some ending GC bookkeeping 
    583 (including the time taken to emit GC and other events) 
    584 and related overheads. 
    585  
    586 It seems the TS times are more useful to the programmer that is mostly 
    587 concerned in how much time is spend mutating and not how the wasted time 
    588 is divided between the actual GC and the cap switching overhead. 
    589 Moreover "+RTS -s" just adds the cap switching overhead to the MUT time 
    590 hiding the fact that it's caused by GC. 
    591  
    592 The differences accumulate and affect the "GC total elapsed time", 
    593 "Avg pause" and "Max pause" columns of the GC table in "+RTS -s". 
    594 They also influence the "Task" table (not reproduced in TS). 
    595 They add up in total "GC time (elapsed)" and consequently in 
    596 "MUT time (elapsed)" that is obtained by subtracting GC time from total time. 
    597 In the result, they influence "Alloc rate" and "Productivity", too. 
    598 The figures are seriously different between TS and "+RTS -s", 
    599 except on small eventlogs. 
    600  
    601 It seems for all these figures the TS version is the right one. 
    602 In particular, if by "pause" we mean the time between 
    603 GC_START and GC_END, by "Max pause" we neither mean 
    604 the "maximum over all GCs of (minimal pause over all caps)", 
    605 similarly as in +RTS -s (but not exactly the same, because 
    606 even the minimal GC_END is usually a tick after "GC stats"), 
    607 nor "maximum over all GCs of (maximal pause over all caps)", which 
    608 we could easily show in TS and which could make sense for a real-time 
    609 system with specialized, non-interchangeable caps, 
    610 but "maximum over all GCs of (average pause over all caps)". 
    611  
    612 Here's an example demonstrating END_GC 
    613 for the previous GC following REQUEST_PAR_GC for the next GC on another cap. 
    614 Note that the early END_GC on cap 1 still comes 1000 nanoseconds after 
    615 the "GC stats" event. Even such small differences accumulate over all GCs: 
    616  
    617 {{{ 
    618 9366536000: cap 1: GC stats for heap capset 0: generation 0, 77552 
    619 bytes copied, 226056 bytes slop, 696320 bytes fragmentation, 2 par 
    620 threads, 71736 bytes max par copied, 77520 bytes total par copied 
    621 9366537000: cap 1: finished GC 
    622 9366538000: cap 1: spark stats: 1910358 created, 1228256 converted, 
    623 1000 remaining (0 overflowed, 0 dud, 58306 GC'd, 144981 fizzled) 
    624 9366541000: cap 1: running thread 3 
    625 9367310000: cap 1: stopping thread 3 (heap overflow) 
    626 9367312000: cap 1: requesting parallel GC 
    627 9367316000: cap 0: finished GC 
    628 9367317000: cap 0: spark stats: 1071418 created, 1292489 converted, 0 
    629 remaining (0 overflowed, 0 dud, 41840 GC'd, 214904 fizzled) 
    630 9367318000: cap 0: starting GC 
    631 9368131000: cap 1: starting GC 
    632 }}} 
    633  
    634 Another example, this time there's a different event 
    635 ("allocated on heap capset") on the late cap, at the same time 
    636 as "GC stats" an another cap and before END_GC finally arrives. 
    637 Not sure how this can happen. The END_GC on the early cap comes here 
    638 a bit later that usually (2000 nanoseconds after the "GC stats" event), too: 
    639  
    640 {{{ 
    641 836505000: cap 0: GC stats for heap capset 0: generation 0, 11584 
    642 bytes copied, 205232 bytes slop, 1007616 bytes fragmentation, 2 par 
    643 threads, 11472 bytes max par copied, 11536 bytes total par copied 
    644 836505000: cap 1: allocated on heap capset 0: 180717856 total bytes till now 
    645 836507000: cap 0: finished GC 
    646 836507000: cap 0: spark stats: 5893 created, 114894 converted, 228 
    647 remaining (0 overflowed, 0 dud, 13 GC'd, 22190 fizzled) 
    648 836509000: cap 0: waking up thread 3 on cap 0 
    649 836511000: cap 0: running thread 233 
    650 836514000: cap 0: stopping thread 233 (thread finished) 
    651 836521000: cap 1: finished GC 
    652 }}} 
    653  
    654 Finally an example where END_GC is quite late on both caps 
    655 (3000 nanoseconds and 6000 nanoseconds): 
    656  
    657 {{{ 
    658 67582000: cap 1: GC stats for heap capset 0: generation 1, 58248 bytes 
    659 copied, 40096 bytes slop, 1912832 bytes fragmentation, 2 par threads, 
    660 51920 bytes max par copied, 58208 bytes total par copied 
    661 67583000: cap 1: live data in heap capset 0: 82784 bytes 
    662 67585000: cap 1: finished GC 
    663 67585000: cap 1: spark stats: 15101 created, 2570 converted, 508 
    664 remaining (0 overflowed, 0 dud, 879 GC'd, 12 fizzled) 
    665 67588000: cap 0: finished GC 
    666 }}}