| | 313 | |
| | 314 | Here are two IRC logs with the discussion that led |
| | 315 | to the GC_START/GC_END/GC_GLOBAL_SYNC solution. |
| | 316 | There are a few more long IRC logs related to that, |
| | 317 | most of them on #ghc. |
| | 318 | |
| | 319 | {{{ |
| | 320 | 17:19 < dcoutts> and yes, mikolaj is right, I did mean to ask if it wouldn't make sense to include sequential GCs when calculating the parallel GC work balance |
| | 321 | 17:19 < dcoutts> I understand from your pov as implementer that you don't care, but doesn't it make sense from the user's pov |
| | 322 | 17:20 < dcoutts> that is, over a whole prog run, we sum up the total bytes copied and the max bytes copied by any individual GC thread |
| | 323 | 17:20 < dcoutts> +RTS -s only includes parallel GCs for this |
| | 324 | 17:24 < mikolaj_> yeah, more generally, the caps can be added and deleted during runtime, so seq GCs are just a special case, when only one cap runs at all, or only one works in this GC |
| | 325 | 17:25 < mikolaj_> which leads to a related quesion, which number of caps take as the divisor in "Parallel GC work balance" |
| | 326 | 17:27 < dcoutts> mikolaj_: yeah I was thinking about that. If we do want to take into account changing numbers of caps then we might want to divide by the active number of caps at each point, and combine the results |
| | 327 | 17:32 < JaffaCake> right, I'm not sure what to do about the "GC_END lag" |
| | 328 | 17:33 < JaffaCake> if the thread is descheduled at the end of GC, is it still in the GC or not? |
| | 329 | 17:33 < dcoutts> JaffaCake: from the user's pov I'd say yes |
| | 330 | 17:33 < dcoutts> from the GC implementer's pov, perhaps not |
| | 331 | 17:33 < JaffaCake> but if you have some threads in GC and some not, it makes no sense to ask what the wall-clock time spent in GC is |
| | 332 | 17:33 < dcoutts> that's true |
| | 333 | 17:34 < JaffaCake> this is the problem I had in the local GC, it doesn't measure wall-clock time spent in local GC |
| | 334 | 17:34 < dcoutts> well, unless you take the sum over all hecs |
| | 335 | 17:34 < dcoutts> right, what will wall clock time mean for local per-hec GC |
| | 336 | 17:34 < JaffaCake> (and BTW, CPU time in parallel GC is next to useless, because a lot of cycles are spent spinning) |
| | 337 | 17:35 < dcoutts> JaffaCake: that's also why I'd ideally like to calculate these things without having to associate GC events across all HECs |
| | 338 | 17:35 < dcoutts> since that would no longer make sense for local per-HEC GC |
| | 339 | 17:36 < JaffaCake> so just use the single GC_STATS event? |
| | 340 | 17:36 < mikolaj_> neither will maxPause make sense for per-HEC GC |
| | 341 | 17:36 < JaffaCake> right |
| | 342 | 17:36 < dcoutts> mikolaj: unless it's the max pause on any HEC |
| | 343 | 17:37 < dcoutts> rather than trying to measure for each GC, time between first HEC starting GC and last finishing |
| | 344 | 17:37 < JaffaCake> hmm, that might be useful |
| | 345 | 17:37 < dcoutts> "for any HEC, what was the longest time it paused" |
| | 346 | 17:38 < JaffaCake> I think that might give artificially long pauses |
| | 347 | 17:38 < dcoutts> why do you say it's artificial? |
| | 348 | 17:38 < dcoutts> because you're blaming the OS? :-) |
| | 349 | 17:39 < JaffaCake> if a HEC has nothing to do, then if it gets delayed a lot after GC that's not really a pause |
| | 350 | 17:39 < JaffaCake> and yes, there's a question about whether OS-induced pauses count |
| | 351 | 17:40 < JaffaCake> for me, I'd rather they weren't included |
| | 352 | 17:40 < dcoutts> JaffaCake: hmm, you mean it can block for a long time, not emit a GC_END and then block due to being idle? |
| | 353 | 17:40 < JaffaCake> yes |
| | 354 | 17:40 < JaffaCake> but, I'm not sure why HECs would get descheduled at this point, unless it's just unlucky |
| | 355 | 17:41 < JaffaCake> because we use spinlocks rather than mutexes |
| | 356 | 17:41 < dcoutts> JaffaCake: for the "end of GC" sync point you mean? |
| | 357 | 17:41 < JaffaCake> I suppose if there's other processes and we yield() |
| | 358 | 17:41 < JaffaCake> yes |
| | 359 | 17:42 < JaffaCake> so is there a problem with just taking the stats from the GC_STATS event? |
| | 360 | 17:42 < JaffaCake> then you'll match what GHC does |
| | 361 | 17:42 < dcoutts> JaffaCake: well two things, one is if it's really measuring what we want |
| | 362 | 17:43 < dcoutts> and the other is that it is quite ghc specific, it'd be nicer to calculate these things using more general events |
| | 363 | 17:43 < JaffaCake> why might it not be? |
| | 364 | 17:44 < dcoutts> JaffaCake: because from the user's pov, the GC vs mutator ratio is really mutator vs other, they don't care so much who's fault it is GC or OS |
| | 365 | 17:44 < JaffaCake> the problem with trying to be more general is you have to cope with GC that might be concurrent, and then various things don't make sense (e.g. wall-clock time) |
| | 366 | 17:44 < dcoutts> right yes, I'd like to distinguish which stats make sense generally |
| | 367 | 17:45 < dcoutts> and which can only make sense for stop-the-world |
| | 368 | 17:46 < JaffaCake> but if you have mutator overlapping with GC, then you can't ask how much wall-clock time was spent in either |
| | 369 | 17:46 < JaffaCake> I suppose you could give an average... |
| | 370 | 17:46 < dcoutts> right, only elapsed time * ncpus |
| | 371 | 17:47 < JaffaCake> but I don't think this is the most important issue, right? |
| | 372 | 17:47 < JaffaCake> we don't have any other compilers or GC architectures on the radar |
| | 373 | 17:47 < JaffaCake> if and when we do, then we could figure out what to do |
| | 374 | 17:47 < JaffaCake> it's hard to generalise from one example |
| | 375 | 17:48 < dcoutts> JaffaCake: true, we can just try and match exactly what +RTS -s does, and use the GC_STATS_GHC event |
| | 376 | 17:48 < JaffaCake> that would be fine by me |
| | 377 | 17:48 < dcoutts> I'm not convinced it's the "right thing" though it's the expedient thing |
| | 378 | 17:49 < JaffaCake> I agree, but sometimes worse is better :) |
| | 379 | 17:49 < dcoutts> mikolaj_: suppose we try to match exactly what +RTS -s gets, do we have any other remaining problems? |
| | 380 | 17:49 < dcoutts> we can match GCs across all HECs by using the GC_STATS event |
| | 381 | 17:49 < mikolaj_> "Parallel GC work balance" |
| | 382 | 17:49 < JaffaCake> is there a problem with that? |
| | 383 | 17:50 < mikolaj_> dcoutts: yes, we can and we'd need to recod the GC time in GC_STATS, instead of using GC_START and GC_END |
| | 384 | 17:50 < dcoutts> mikolaj_: so we can get the same number as +RTS -s if we filter out the samples where n_gc_threads = 1 |
| | 385 | 17:50 < mikolaj_> recod/record |
| | 386 | 17:50 < dcoutts> mikolaj: ah yes, we would |
| | 387 | 17:50 < mikolaj_> dcoutts: oh, yes, if that's the general rule that we just stick to RTS -s then yes |
| | 388 | 17:51 < dcoutts> JaffaCake: right we'd have to ignore GC_START/END completely really |
| | 389 | 17:51 < dcoutts> and put GC elapsed time into the GC_STATS event |
| | 390 | 17:52 < dcoutts> since even if we look at the GC_START/END for the "primary" GC thread that emits GC_STATS, then the times are different than that HEC's GC_START/END |
| | 391 | 17:52 < JaffaCake> you could use GC_START/GC_END for just one HEC - the one that emitted the GC_STATS event (but that seems a bit horrible) |
| | 392 | 17:52 < dcoutts> JaffaCake: right, except it will give us different numbers |
| | 393 | 17:52 < JaffaCake> oh, does it? |
| | 394 | 17:53 < JaffaCake> because the events are emitted a little earlier or later than the timestamps we use for stats? |
| | 395 | 17:53 < dcoutts> because the event and the stats_startGC()/stats_endGC() happen at different points in the code from traceGcBegin()/End() |
| | 396 | 17:53 < JaffaCake> if it really mattered we could make them the same |
| | 397 | 17:53 < dcoutts> JaffaCake: right |
| | 398 | 17:53 < mikolaj_> dcoutts: with some luck, we could draw the time recorded in GC_STATS, starting from GC_START and show the remaining area, till GC_END as an extra shceduling cost, outlines in different colour |
| | 399 | 17:54 < JaffaCake> i.e. grab a timestamp and pass it to both stat_startGC() and traceGcBegin() |
| | 400 | 17:54 < mikolaj_> so the duplication would not be entirely wasted |
| | 401 | 17:54 < dcoutts> JaffaCake: mm |
| | 402 | 17:54 < mikolaj_> but it depends on GC_START being mostly syncronized across HECS, which it almost is, I think |
| | 403 | 17:54 < mikolaj_> it's GC_END that's out of sync a lot |
| | 404 | 17:55 < JaffaCake> I really do want to see the ragged GC_STARTs, because it tells me how long it takes to synchronise all the HECs |
| | 405 | 17:55 < dcoutts> mikolaj_: it's only out of sync on the HECs other than the primary one |
| | 406 | 17:56 < mikolaj_> JaffaCake: oh, so they are not in sync? I guess I didn't look close enough |
| | 407 | 17:56 < JaffaCake> funny if GC_END is more out of sync than GC_START, I don't remember seeing that before |
| | 408 | 17:56 < JaffaCake> mikolaj_: they won't occur at exactly the same time, no |
| | 409 | 17:57 < mikolaj_> JaffaCake: I may be wrong, or it may only be so in rare cases |
| | 410 | 17:57 < dcoutts> JaffaCake: this is the thing mikolaj_ was seeing, that GC_END can be quite significantly delayed on the "other" HECs |
| | 411 | 17:57 < JaffaCake> because we really do have to send a signal to all the cores |
| | 412 | 17:57 < JaffaCake> dcoutts__: so that could be something we should look into |
| | 413 | 17:57 < dcoutts> JaffaCake: I'd assumed that this was because we were using a mutex to hold the other HECs, but you say it's a spinlock so I cannot explain it. |
| | 414 | 17:58 < mikolaj_> anyway, ignore me, I didn't look at GC_START at all, just thinking aloud how we can use in TS the extra data from dupliated GC times |
| | 415 | 17:58 < JaffaCake> is it occurring when you're using all the cores? the last core problem? |
| | 416 | 17:58 < dcoutts> mikolaj_: how many cores do you have? |
| | 417 | 17:58 < mikolaj_> JaffaCake: quite likely |
| | 418 | 17:58 < mikolaj_> 2 |
| | 419 | 17:58 < JaffaCake> aaah |
| | 420 | 17:58 < mikolaj_> but the logs shows different scenarios |
| | 421 | 17:59 < JaffaCake> all kinds of bad things happen when one thread gets descheduled |
| | 422 | 17:59 < dcoutts> JaffaCake: so in this sense we already have mutator and GC overlapping |
| | 423 | 17:59 < mikolaj_> it's just a simple lag |
| | 424 | 17:59 < mikolaj_> *not |
| | 425 | 18:00 < JaffaCake> dcoutts: I don't think of it like that - there is a handshake at the end of GC, so the primary thread can only continue when it has received the ok from all the other HECs |
| | 426 | 18:00 < JaffaCake> so if one of the other HECs goes for a walk for a while after that, it is in a "limbo" state, not still doing GC |
| | 427 | 18:00 < dcoutts> right |
| | 428 | 18:01 < dcoutts> again it's just from the user pov, it's doing something other than running the mutator |
| | 429 | 18:01 * JaffaCake is out of time, must dash |
| | 430 | 18:01 < dcoutts> ok |
| | 431 | 18:01 < mikolaj_> thanks |
| | 432 | 18:01 < dcoutts> JaffaCake: yes, thanks, we'll think about what to do |
| | 433 | 18:01 < JaffaCake> dcoutts: in that case you would want to count all "descheduled" time as non-mutator |
| | 434 | 18:02 < JaffaCake> but we don't have a way to measure that yet |
| | 435 | 18:02 < dcoutts> aye |
| | 436 | 18:03 < dcoutts> mikolaj_: so perhaps we should make the GC_START/END times on the primary GC thread match the times emitted for +RTS -s |
| | 437 | 18:03 < dcoutts> mikolaj_: rather than duplicating the info in the GC_STATS |
| | 438 | 18:03 < mikolaj_> dcoutts: works for me (probably) |
| | 439 | }}} |
| | 440 | |
| | 441 | {{{ |
| | 442 | 12:38 < dcoutts> JaffaCake: we didn't plan on adding a per-HEC GC stat event, do you think that would be useful? what info would we get? |
| | 443 | 12:39 < JaffaCake> CPU time in GC on that thread |
| | 444 | 12:39 < JaffaCake> not particularly useful, but it could be done |
| | 445 | 12:39 < dcoutts> as opposed to wall clock time between GC_START/END |
| | 446 | 12:39 < JaffaCake> right - you could get wall clock time too, but that's even less useful I think |
| | 447 | 12:39 < dcoutts> we get wall clock time already from GC_START/END |
| | 448 | 12:40 < JaffaCake> right |
| | 449 | 12:40 < mikolaj_> but it's mixed GC + scheduling overhead time |
| | 450 | 12:40 < dcoutts> right |
| | 451 | 12:40 < dcoutts> like we were discussing yesterday |
| | 452 | 12:43 < JaffaCake> GC time does include scheduling overhead already |
| | 453 | 12:43 < JaffaCake> it includes the time taken to sync with all the other HECs |
| | 454 | 12:47 < mikolaj_> right, so only the amount of post-GC scheduling overhead atrributed to GC varies (between RTS stats and GC_START/END) |
| | 455 | 12:48 < JaffaCake> I suppose so, but it's just random noise introduced by the OS |
| | 456 | 12:49 < JaffaCake> OS context switches can strike at any time |
| | 457 | 12:49 < JaffaCake> so if you take two timestamps in succession, there's a random chance that there will be a big delay between them |
| | 458 | 12:50 < JaffaCake> the discrepancy you see with GC_END is nothing more than that, I think |
| | 459 | 12:52 * mikolaj_ has a look at the log samples again |
| | 460 | 13:01 < mikolaj_> yes, quite probably so, plus perhaps a tiny error from the very act of emitting events, etc. |
| | 461 | 13:02 < mikolaj_> anyway, for the global GC stats, we just want to make sure we measure the same amount of OS noise between GC_START/END as RTS stats do, at least on the main GC HEC |
| | 462 | 13:04 < mikolaj_> that's why I asked about moving GC_END, but perhaps wait until dcoutts write the paper and you finish overhauling stat gathering |
| | 463 | }}} |