Changes between Version 37 and Version 38 of RTSsummaryEvents

Show
Ignore:
Timestamp:
04/06/12 22:49:46 (14 months ago)
Author:
MikolajKonarski (IP: 95.160.111.162)
Comment:

Add some IRC logs from #ghc; perhaps that'd be enough docs for the patchset

Legend:

Unmodified
Added
Removed
Modified
  • RTSsummaryEvents

    v37 v38  
    299299If in future we allow multiple independent heaps in the same OS process (e.g. separate RTS instances) then this scheme would let us cope by making a separate capset. Similarly it'd cope with implementations like GdH which use a global heap spanning multiple OS processes. 
    300300 
    301 == Problems with the order of event in the eventlog == 
     301== Problems with the order of events in the eventlog == 
    302302 
    303303In the implemented version, we sidestep the issues with event order 
     
    311311Below is a detailed discussion of the original GC event order 
    312312and initial design choices considered. 
     313 
     314Here are two IRC logs with the discussion that led 
     315to the GC_START/GC_END/GC_GLOBAL_SYNC solution. 
     316There are a few more long IRC logs related to that, 
     317most of them on #ghc. 
     318 
     319{{{ 
     32017:19 < dcoutts> and yes, mikolaj is right, I did mean to ask if it wouldn't make sense to include sequential GCs when calculating the parallel GC work balance 
     32117:19 < dcoutts> I understand from your pov as implementer that you don't care, but doesn't it make sense from the user's pov 
     32217:20 < dcoutts> that is, over a whole prog run, we sum up the total bytes copied and the max bytes copied by any individual GC thread 
     32317:20 < dcoutts> +RTS -s only includes parallel GCs for this 
     32417:24 < mikolaj_> yeah, more generally, the caps can be added and deleted during runtime, so seq GCs are just a special case, when only one cap runs at all, or only one works in this GC 
     32517:25 < mikolaj_> which leads to a related quesion, which number of caps take as the divisor in "Parallel GC work balance" 
     32617:27 < dcoutts> mikolaj_: yeah I was thinking about that. If we do want to take into account changing numbers of caps then we might want to divide by the active number of caps at each point, and combine the results 
     32717:32 < JaffaCake> right, I'm not sure what to do about the "GC_END lag" 
     32817:33 < JaffaCake> if the thread is descheduled at the end of GC, is it still in the GC or not? 
     32917:33 < dcoutts> JaffaCake: from the user's pov I'd say yes 
     33017:33 < dcoutts> from the GC implementer's pov, perhaps not 
     33117:33 < JaffaCake> but if you have some threads in GC and some not, it makes no sense to ask what the wall-clock time spent in GC is 
     33217:33 < dcoutts> that's true 
     33317:34 < JaffaCake> this is the problem I had in the local GC, it doesn't measure wall-clock time spent in local GC 
     33417:34 < dcoutts> well, unless you take the sum over all hecs 
     33517:34 < dcoutts> right, what will wall clock time mean for local per-hec GC 
     33617:34 < JaffaCake> (and BTW, CPU time in parallel GC is next to useless, because a lot of cycles are spent spinning) 
     33717:35 < dcoutts> JaffaCake: that's also why I'd ideally like to calculate these things without having to associate GC events across all HECs 
     33817:35 < dcoutts> since that would no longer make sense for local per-HEC GC 
     33917:36 < JaffaCake> so just use the single GC_STATS event? 
     34017:36 < mikolaj_> neither will maxPause make sense for per-HEC GC 
     34117:36 < JaffaCake> right 
     34217:36 < dcoutts> mikolaj: unless it's the max pause on any HEC 
     34317:37 < dcoutts> rather than trying to measure for each GC, time between first HEC starting GC and last finishing 
     34417:37 < JaffaCake> hmm, that might be useful 
     34517:37 < dcoutts> "for any HEC, what was the longest time it paused" 
     34617:38 < JaffaCake> I think that might give artificially long pauses 
     34717:38 < dcoutts> why do you say it's artificial? 
     34817:38 < dcoutts> because you're blaming the OS? :-) 
     34917:39 < JaffaCake> if a HEC has nothing to do, then if it gets delayed a lot after GC that's not really a pause 
     35017:39 < JaffaCake> and yes, there's a question about whether OS-induced pauses count 
     35117:40 < JaffaCake> for me, I'd rather they weren't included 
     35217:40 < dcoutts> JaffaCake: hmm, you mean it can block for a long time, not emit a GC_END and then block due to being idle? 
     35317:40 < JaffaCake> yes 
     35417:40 < JaffaCake> but, I'm not sure why HECs would get descheduled at this point, unless it's just unlucky 
     35517:41 < JaffaCake> because we use spinlocks rather than mutexes 
     35617:41 < dcoutts> JaffaCake: for the "end of GC" sync point you mean? 
     35717:41 < JaffaCake> I suppose if there's other processes and we yield() 
     35817:41 < JaffaCake> yes 
     35917:42 < JaffaCake> so is there a problem with just taking the stats from the GC_STATS event? 
     36017:42 < JaffaCake> then you'll match what GHC does 
     36117:42 < dcoutts> JaffaCake: well two things, one is if it's really measuring what we want 
     36217:43 < dcoutts> and the other is that it is quite ghc specific, it'd be nicer to calculate these things using more general events 
     36317:43 < JaffaCake> why might it not be? 
     36417:44 < dcoutts> JaffaCake: because from the user's pov, the GC vs mutator ratio is really mutator vs other, they don't care so much who's fault it is GC or OS 
     36517:44 < JaffaCake> the problem with trying to be more general is you have to cope with GC that might be concurrent, and then various things don't make sense (e.g. wall-clock time) 
     36617:44 < dcoutts> right yes, I'd like to distinguish which stats make sense generally 
     36717:45 < dcoutts> and which can only make sense for stop-the-world 
     36817:46 < JaffaCake> but if you have mutator overlapping with GC, then you can't ask how much wall-clock time was spent in either 
     36917:46 < JaffaCake> I suppose you could give an average... 
     37017:46 < dcoutts> right, only elapsed time * ncpus 
     37117:47 < JaffaCake> but I don't think this is the most important issue, right? 
     37217:47 < JaffaCake> we don't have any other compilers or GC architectures on the radar 
     37317:47 < JaffaCake> if and when we do, then we could figure out what to do 
     37417:47 < JaffaCake> it's hard to generalise from one example 
     37517:48 < dcoutts> JaffaCake: true, we can just try and match exactly what +RTS -s does, and use the GC_STATS_GHC event 
     37617:48 < JaffaCake> that would be fine by me 
     37717:48 < dcoutts> I'm not convinced it's the "right thing" though it's the expedient thing 
     37817:49 < JaffaCake> I agree, but sometimes worse is better :) 
     37917:49 < dcoutts> mikolaj_: suppose we try to match exactly what +RTS -s gets, do we have any other remaining problems? 
     38017:49 < dcoutts> we can match GCs across all HECs by using the GC_STATS event 
     38117:49 < mikolaj_> "Parallel GC work balance" 
     38217:49 < JaffaCake> is there a problem with that? 
     38317:50 < mikolaj_> dcoutts: yes, we can and we'd need to recod the GC time in GC_STATS, instead of using GC_START and GC_END 
     38417:50 < dcoutts> mikolaj_: so we can get the same number as +RTS -s if we filter out the samples where n_gc_threads = 1 
     38517:50 < mikolaj_> recod/record 
     38617:50 < dcoutts> mikolaj: ah yes, we would 
     38717:50 < mikolaj_> dcoutts: oh, yes, if that's the general rule that we just stick to RTS -s then yes 
     38817:51 < dcoutts> JaffaCake: right we'd have to ignore GC_START/END completely really 
     38917:51 < dcoutts> and put GC elapsed time into the GC_STATS event 
     39017:52 < dcoutts> since even if we look at the GC_START/END for the "primary" GC thread that emits GC_STATS, then the times are different than that HEC's GC_START/END 
     39117:52 < JaffaCake> you could use GC_START/GC_END for just one HEC - the one that emitted the GC_STATS event (but that seems a bit horrible) 
     39217:52 < dcoutts> JaffaCake: right, except it will give us different numbers 
     39317:52 < JaffaCake> oh, does it? 
     39417:53 < JaffaCake> because the events are emitted a little earlier or later than the timestamps we use for stats? 
     39517:53 < dcoutts> because the event and the stats_startGC()/stats_endGC() happen at different points in the code from traceGcBegin()/End() 
     39617:53 < JaffaCake> if it really mattered we could make them the same 
     39717:53 < dcoutts> JaffaCake: right 
     39817:53 < mikolaj_> dcoutts: with some luck, we could draw the time recorded in GC_STATS, starting from GC_START and show the remaining area, till GC_END as an extra shceduling cost, outlines in different colour 
     39917:54 < JaffaCake> i.e. grab a timestamp and pass it to both stat_startGC() and traceGcBegin() 
     40017:54 < mikolaj_> so the duplication would not be entirely wasted 
     40117:54 < dcoutts> JaffaCake: mm 
     40217:54 < mikolaj_> but it depends on GC_START being mostly syncronized across HECS, which it almost is, I think 
     40317:54 < mikolaj_> it's GC_END that's out of sync a lot 
     40417:55 < JaffaCake> I really do want to see the ragged GC_STARTs, because it tells me how long it takes to synchronise all the HECs 
     40517:55 < dcoutts> mikolaj_: it's only out of sync on the HECs other than the primary one 
     40617:56 < mikolaj_> JaffaCake: oh, so they are not in sync? I guess I didn't look close enough 
     40717:56 < JaffaCake> funny if GC_END is more out of sync than GC_START, I don't remember seeing that before 
     40817:56 < JaffaCake> mikolaj_: they won't occur at exactly the same time, no 
     40917:57 < mikolaj_> JaffaCake: I may be wrong, or it may only be so in rare cases 
     41017:57 < dcoutts> JaffaCake: this is the thing mikolaj_ was seeing, that GC_END can be quite significantly delayed on the "other" HECs 
     41117:57 < JaffaCake> because we really do have to send a signal to all the cores 
     41217:57 < JaffaCake> dcoutts__: so that could be something we should look into 
     41317:57 < dcoutts> JaffaCake: I'd assumed that this was because we were using a mutex to hold the other HECs, but you say it's a spinlock so I cannot explain it. 
     41417:58 < mikolaj_> anyway, ignore me, I didn't look at GC_START at all, just thinking aloud how we can use in TS the extra data from dupliated GC times 
     41517:58 < JaffaCake> is it occurring when you're using all the cores?  the last core problem? 
     41617:58 < dcoutts> mikolaj_: how many cores do you have? 
     41717:58 < mikolaj_> JaffaCake: quite likely 
     41817:58 < mikolaj_> 2 
     41917:58 < JaffaCake> aaah 
     42017:58 < mikolaj_> but the logs shows different scenarios 
     42117:59 < JaffaCake> all kinds of bad things happen when one thread gets descheduled 
     42217:59 < dcoutts> JaffaCake: so in this sense we already have mutator and GC overlapping 
     42317:59 < mikolaj_> it's just a simple lag 
     42417:59 < mikolaj_> *not 
     42518:00 < JaffaCake> dcoutts: I don't think of it like that - there is a handshake at the end of GC, so the primary thread can only continue when it has received the ok from all the other HECs 
     42618:00 < JaffaCake> so if one of the other HECs goes for a walk for a while after that, it is in a "limbo" state, not still doing GC 
     42718:00 < dcoutts> right 
     42818:01 < dcoutts> again it's just from the user pov, it's doing something other than running the mutator 
     42918:01  * JaffaCake is out of time, must dash 
     43018:01 < dcoutts> ok 
     43118:01 < mikolaj_> thanks 
     43218:01 < dcoutts> JaffaCake: yes, thanks, we'll think about what to do 
     43318:01 < JaffaCake> dcoutts: in that case you would want to count all "descheduled" time as non-mutator 
     43418:02 < JaffaCake> but we don't have a way to measure that yet 
     43518:02 < dcoutts> aye 
     43618:03 < dcoutts> mikolaj_: so perhaps we should make the GC_START/END times on the primary GC thread match the times emitted for +RTS -s 
     43718:03 < dcoutts> mikolaj_: rather than duplicating the info in the GC_STATS 
     43818:03 < mikolaj_> dcoutts: works for me (probably) 
     439}}} 
     440 
     441{{{ 
     44212:38 < dcoutts> JaffaCake: we didn't plan on adding a per-HEC GC stat event, do you think that would be useful? what info would we get? 
     44312:39 < JaffaCake> CPU time in GC on that thread 
     44412:39 < JaffaCake> not particularly useful, but it could be done 
     44512:39 < dcoutts> as opposed to wall clock time between GC_START/END 
     44612:39 < JaffaCake> right - you could get wall clock time too, but that's even less useful I think 
     44712:39 < dcoutts> we get wall clock time already from GC_START/END 
     44812:40 < JaffaCake> right 
     44912:40 < mikolaj_> but it's mixed GC + scheduling overhead time 
     45012:40 < dcoutts> right 
     45112:40 < dcoutts> like we were discussing yesterday 
     45212:43 < JaffaCake> GC time does include scheduling overhead already 
     45312:43 < JaffaCake> it includes the time taken to sync with all the other HECs 
     45412:47 < mikolaj_> right, so only the amount of post-GC scheduling overhead atrributed to GC varies (between RTS stats and GC_START/END) 
     45512:48 < JaffaCake> I suppose so, but it's just random noise introduced by the OS 
     45612:49 < JaffaCake> OS context switches can strike at any time 
     45712:49 < JaffaCake> so if you take two timestamps in succession, there's a random chance that there will be a big delay between them 
     45812:50 < JaffaCake> the discrepancy you see with GC_END is nothing more than that, I think 
     45912:52  * mikolaj_ has a look at the log samples again 
     46013:01 < mikolaj_> yes, quite probably so, plus perhaps a tiny error from the very act of emitting events, etc. 
     46113:02 < mikolaj_> anyway, for the global GC stats, we just want to make sure we measure the same amount of OS noise between GC_START/END as RTS stats do, at least on the main GC HEC 
     46213:04 < mikolaj_> that's why I asked about moving GC_END, but perhaps wait until dcoutts write the paper and you finish overhauling stat gathering 
     463}}} 
    313464 
    314465==== Summary of the problems ==== 
     
    385536the future validation of events). 
    386537 
     538Here are the results of implementing it: 
     539 
     540{{{ 
     54118:13 < mikolaj> dcoutts, JaffaCake: I've tested the "moving the interruptAllCapabilities()  
     542call below the traceEvent calls" tweak and it helps in 99-ish % of cases 
     54318:13 < mikolaj> that is, the order of GC events is much, much better now 
     54418:14 < mikolaj> but there are still very rare cases, especially when the par GC request was  
     545from one HEC, and another, busy HEC is too slow to report finishing a previous GC 
     54618:15 < mikolaj> so it does not help that much in validating eventlogs, but makes them  
     547so much more readable, if a user wants to inspect the event messages one by one 
     548}}} 
     549 
    387550==== More example eventlog fragments ==== 
    388551