| 335 | | Below is a detailed discussion of the original GC event order |
| 336 | | and initial design choices considered. |
| | 335 | In the subsection there are detailed discussions of the original GC event order |
| | 336 | problems and about the initially considered design choices. |
| | 337 | |
| | 338 | ==== Summary of the problems ==== |
| | 339 | |
| | 340 | The main issue is how we measure the time spent doing GC. The +RTS -s |
| | 341 | measurement is taken from the stat_startGC() to stat_endGC() on the |
| | 342 | "primary" GC thread. In ThreadScope the main GC events we have access to |
| | 343 | are GC_START and GC_END on each capability. |
| | 344 | |
| | 345 | It turns out that these different measurements are often quite |
| | 346 | different. In particular there can often be a significant delay between |
| | 347 | when +RTS -s reports and when the other capabilities start the mutator |
| | 348 | again (and hence emit GC_END). Our guess is that this is because the |
| | 349 | other caps are suspended at the end of the GC while the primary GC |
| | 350 | thread does its stats etc, then when it releases them it can take a |
| | 351 | while for the other caps to be rescheduled by the OS. |
| | 352 | |
| | 353 | In this example, END_GC on one of the caps comes very late: |
| | 354 | |
| | 355 | {{{ |
| | 356 | 9378252000: cap 1: GC stats for heap capset 0: generation 0, 45160 |
| | 357 | bytes copied, 58296 bytes slop, 1757184 bytes fragmentation, 2 par |
| | 358 | threads, 45080 bytes max par copied, 45112 bytes total par copied |
| | 359 | 9378253000: cap 1: finished GC |
| | 360 | 9378254000: cap 1: spark stats: 1913071 created, 1229807 converted, |
| | 361 | 221 remaining (0 overflowed, 0 dud, 58923 GC'd, 144981 fizzled) |
| | 362 | 9378255000: cap 1: waking up thread 3 on cap 1 |
| | 363 | 9378257000: cap 1: running thread 2885 |
| | 364 | 9378259000: cap 1: stopping thread 2885 (thread finished) |
| | 365 | 9378268000: cap 1: running thread 3 |
| | 366 | 9378354000: cap 1: stopping thread 3 (blocked on black hole owned by |
| | 367 | thread 2884) |
| | 368 | 9378355000: cap 1: creating thread 2886 |
| | 369 | 9378363000: cap 0: finished GC |
| | 370 | }}} |
| | 371 | |
| | 372 | From the user's point of view it is perhaps more accurate to look at the |
| | 373 | GC_START/END because that's when the caps really get back to the |
| | 374 | mutator. |
| | 375 | |
| | 376 | This different GC vs mutator measurement affects several other |
| | 377 | statistics and the difference vs +RTS -s can build up to be quite large |
| | 378 | on larger eventlogs. |
| | 379 | |
| | 380 | Additionally, if we only look at the GC_START/GC_END then it's not |
| | 381 | entirely clear what the average or maximum pause time should mean, since |
| | 382 | we have multiple capabilities. Is the max the max for any individual |
| | 383 | cap, or from when the first cap does GC_START 'til when the last cap |
| | 384 | does GC_END. |
| | 385 | |
| | 386 | Which brings up another issue: some of the GC statistics rely on a GC |
| | 387 | being global across all capabilities, but strictly speaking the current |
| | 388 | GC events do not let us associate a GC_START/END on one cap with that on |
| | 389 | another. One core can issue a new REQUEST_SEQ_GC |
| | 390 | before another has issued it's GC_END from the previous GC. Add in the |
| | 391 | fact that not all cores have to take part in a GC and it's tricky to |
| | 392 | associate GCs on each core to a global GC. |
| | 393 | |
| | 394 | We have added a GC_GLOBAL_SYNC event which is emitted strictly between |
| | 395 | GC_START/GC_END on all cores, used for associating GC across all cores. |
| | 396 | If we could calculate stats in such a way that we do not |
| | 397 | need to associate GC across |
| | 398 | all cores then that would be better (and more flexible e.g. to work with |
| | 399 | your local GC branch). |
| | 400 | |
| | 401 | ==== An implemented partial solution for RequestParGC ==== |
| | 402 | |
| | 403 | We can try to ensure that RequestSeqGC and RequestParGC are emitted before their corresponding |
| | 404 | GC_START events, by using the following tip from JaffaCake: |
| | 405 | "actually you could try moving the interruptAllCapabilities() call in Schedule.c:1500 down below the traceEvent calls". |
| | 406 | However, even with this implemented in GHC HEAD, the events can still |
| | 407 | be emitted in arbitrary order and it complicates analysis, |
| | 408 | especially for selected intervals (and complicates and makes less strict |
| | 409 | the future validation of events). |
| | 410 | |
| | 411 | Here are the results of implementing it: |
| | 412 | |
| | 413 | {{{ |
| | 414 | 18:13 < mikolaj> dcoutts, JaffaCake: I've tested the "moving the interruptAllCapabilities() |
| | 415 | call below the traceEvent calls" tweak and it helps in 99-ish % of cases |
| | 416 | 18:13 < mikolaj> that is, the order of GC events is much, much better now |
| | 417 | 18:14 < mikolaj> but there are still very rare cases, especially when the par GC request was |
| | 418 | from one HEC, and another, busy HEC is too slow to report finishing a previous GC |
| | 419 | 18:15 < mikolaj> so it does not help that much in validating eventlogs, but makes them |
| | 420 | so much more readable, if a user wants to inspect the event messages one by one |
| | 421 | }}} |
| | 422 | |
| | 423 | ==== More example eventlog fragments ==== |
| | 424 | |
| | 425 | The example of the event order issues below are on a system with 2 caps, on an eventlog |
| | 426 | with lots of small GCs. They show how GC times measured by "+RTS -s", |
| | 427 | that is from START_GC to the "GC stats" event, are different from TS times. |
| | 428 | The TS times are the average over all caps of the pause |
| | 429 | between START_GC, when the caps stop mutating, |
| | 430 | and END_GC, when they resume mutating. (Most of) the difference |
| | 431 | stems probably from the thread sleep/wake delays, some ending GC bookkeeping |
| | 432 | (including the time taken to emit GC and other events) |
| | 433 | and related overheads. |
| | 434 | |
| | 435 | It seems the TS times are more useful to the programmer that is mostly |
| | 436 | concerned in how much time is spend mutating and not how the wasted time |
| | 437 | is divided between the actual GC and the cap switching overhead. |
| | 438 | Moreover "+RTS -s" just adds the cap switching overhead to the MUT time |
| | 439 | hiding the fact that it's caused by GC. |
| | 440 | |
| | 441 | The differences accumulate and affect the "GC total elapsed time", |
| | 442 | "Avg pause" and "Max pause" columns of the GC table in "+RTS -s". |
| | 443 | They also influence the "Task" table (not reproduced in TS). |
| | 444 | They add up in total "GC time (elapsed)" and consequently in |
| | 445 | "MUT time (elapsed)" that is obtained by subtracting GC time from total time. |
| | 446 | In the result, they influence "Alloc rate" and "Productivity", too. |
| | 447 | The figures are seriously different between TS and "+RTS -s", |
| | 448 | except on small eventlogs. |
| | 449 | |
| | 450 | It seems for all these figures the TS version is the right one. |
| | 451 | In particular, if by "pause" we mean the time between |
| | 452 | GC_START and GC_END, by "Max pause" we neither mean |
| | 453 | the "maximum over all GCs of (minimal pause over all caps)", |
| | 454 | similarly as in +RTS -s (but not exactly the same, because |
| | 455 | even the minimal GC_END is usually a tick after "GC stats"), |
| | 456 | nor "maximum over all GCs of (maximal pause over all caps)", which |
| | 457 | we could easily show in TS and which could make sense for a real-time |
| | 458 | system with specialized, non-interchangeable caps, |
| | 459 | but "maximum over all GCs of (average pause over all caps)". |
| | 460 | |
| | 461 | Here's an example demonstrating END_GC |
| | 462 | for the previous GC following REQUEST_PAR_GC for the next GC on another cap. |
| | 463 | Note that the early END_GC on cap 1 still comes 1000 nanoseconds after |
| | 464 | the "GC stats" event. Even such small differences accumulate over all GCs: |
| | 465 | |
| | 466 | {{{ |
| | 467 | 9366536000: cap 1: GC stats for heap capset 0: generation 0, 77552 |
| | 468 | bytes copied, 226056 bytes slop, 696320 bytes fragmentation, 2 par |
| | 469 | threads, 71736 bytes max par copied, 77520 bytes total par copied |
| | 470 | 9366537000: cap 1: finished GC |
| | 471 | 9366538000: cap 1: spark stats: 1910358 created, 1228256 converted, |
| | 472 | 1000 remaining (0 overflowed, 0 dud, 58306 GC'd, 144981 fizzled) |
| | 473 | 9366541000: cap 1: running thread 3 |
| | 474 | 9367310000: cap 1: stopping thread 3 (heap overflow) |
| | 475 | 9367312000: cap 1: requesting parallel GC |
| | 476 | 9367316000: cap 0: finished GC |
| | 477 | 9367317000: cap 0: spark stats: 1071418 created, 1292489 converted, 0 |
| | 478 | remaining (0 overflowed, 0 dud, 41840 GC'd, 214904 fizzled) |
| | 479 | 9367318000: cap 0: starting GC |
| | 480 | 9368131000: cap 1: starting GC |
| | 481 | }}} |
| | 482 | |
| | 483 | Another example, this time there's a different event |
| | 484 | ("allocated on heap capset") on the late cap, at the same time |
| | 485 | as "GC stats" an another cap and before END_GC finally arrives. |
| | 486 | Not sure how this can happen. The END_GC on the early cap comes here |
| | 487 | a bit later that usually (2000 nanoseconds after the "GC stats" event), too: |
| | 488 | |
| | 489 | {{{ |
| | 490 | 836505000: cap 0: GC stats for heap capset 0: generation 0, 11584 |
| | 491 | bytes copied, 205232 bytes slop, 1007616 bytes fragmentation, 2 par |
| | 492 | threads, 11472 bytes max par copied, 11536 bytes total par copied |
| | 493 | 836505000: cap 1: allocated on heap capset 0: 180717856 total bytes till now |
| | 494 | 836507000: cap 0: finished GC |
| | 495 | 836507000: cap 0: spark stats: 5893 created, 114894 converted, 228 |
| | 496 | remaining (0 overflowed, 0 dud, 13 GC'd, 22190 fizzled) |
| | 497 | 836509000: cap 0: waking up thread 3 on cap 0 |
| | 498 | 836511000: cap 0: running thread 233 |
| | 499 | 836514000: cap 0: stopping thread 233 (thread finished) |
| | 500 | 836521000: cap 1: finished GC |
| | 501 | }}} |
| | 502 | |
| | 503 | Finally an example where END_GC is quite late on both caps |
| | 504 | (3000 nanoseconds and 6000 nanoseconds): |
| | 505 | |
| | 506 | {{{ |
| | 507 | 67582000: cap 1: GC stats for heap capset 0: generation 1, 58248 bytes |
| | 508 | copied, 40096 bytes slop, 1912832 bytes fragmentation, 2 par threads, |
| | 509 | 51920 bytes max par copied, 58208 bytes total par copied |
| | 510 | 67583000: cap 1: live data in heap capset 0: 82784 bytes |
| | 511 | 67585000: cap 1: finished GC |
| | 512 | 67585000: cap 1: spark stats: 15101 created, 2570 converted, 508 |
| | 513 | remaining (0 overflowed, 0 dud, 879 GC'd, 12 fizzled) |
| | 514 | 67588000: cap 0: finished GC |
| | 515 | }}} |
| | 516 | |
| | 517 | |
| | 518 | ==== Discussions that led to the implemented GC_START/GC_END/GC_GLOBAL_SYNC solution ==== |
| 488 | | |
| 489 | | ==== Summary of the problems ==== |
| 490 | | |
| 491 | | The main issue is how we measure the time spent doing GC. The +RTS -s |
| 492 | | measurement is taken from the stat_startGC() to stat_endGC() on the |
| 493 | | "primary" GC thread. In ThreadScope the main GC events we have access to |
| 494 | | are GC_START and GC_END on each capability. |
| 495 | | |
| 496 | | It turns out that these different measurements are often quite |
| 497 | | different. In particular there can often be a significant delay between |
| 498 | | when +RTS -s reports and when the other capabilities start the mutator |
| 499 | | again (and hence emit GC_END). Our guess is that this is because the |
| 500 | | other caps are suspended at the end of the GC while the primary GC |
| 501 | | thread does its stats etc, then when it releases them it can take a |
| 502 | | while for the other caps to be rescheduled by the OS. |
| 503 | | |
| 504 | | In this example, END_GC on one of the caps comes very late: |
| 505 | | |
| 506 | | {{{ |
| 507 | | 9378252000: cap 1: GC stats for heap capset 0: generation 0, 45160 |
| 508 | | bytes copied, 58296 bytes slop, 1757184 bytes fragmentation, 2 par |
| 509 | | threads, 45080 bytes max par copied, 45112 bytes total par copied |
| 510 | | 9378253000: cap 1: finished GC |
| 511 | | 9378254000: cap 1: spark stats: 1913071 created, 1229807 converted, |
| 512 | | 221 remaining (0 overflowed, 0 dud, 58923 GC'd, 144981 fizzled) |
| 513 | | 9378255000: cap 1: waking up thread 3 on cap 1 |
| 514 | | 9378257000: cap 1: running thread 2885 |
| 515 | | 9378259000: cap 1: stopping thread 2885 (thread finished) |
| 516 | | 9378268000: cap 1: running thread 3 |
| 517 | | 9378354000: cap 1: stopping thread 3 (blocked on black hole owned by |
| 518 | | thread 2884) |
| 519 | | 9378355000: cap 1: creating thread 2886 |
| 520 | | 9378363000: cap 0: finished GC |
| 521 | | }}} |
| 522 | | |
| 523 | | From the user's point of view it is perhaps more accurate to look at the |
| 524 | | GC_START/END because that's when the caps really get back to the |
| 525 | | mutator. |
| 526 | | |
| 527 | | This different GC vs mutator measurement affects several other |
| 528 | | statistics and the difference vs +RTS -s can build up to be quite large |
| 529 | | on larger eventlogs. |
| 530 | | |
| 531 | | Additionally, if we only look at the GC_START/GC_END then it's not |
| 532 | | entirely clear what the average or maximum pause time should mean, since |
| 533 | | we have multiple capabilities. Is the max the max for any individual |
| 534 | | cap, or from when the first cap does GC_START 'til when the last cap |
| 535 | | does GC_END. |
| 536 | | |
| 537 | | Which brings up another issue: some of the GC statistics rely on a GC |
| 538 | | being global across all capabilities, but strictly speaking the current |
| 539 | | GC events do not let us associate a GC_START/END on one cap with that on |
| 540 | | another. One core can issue a new REQUEST_SEQ_GC |
| 541 | | before another has issued it's GC_END from the previous GC. Add in the |
| 542 | | fact that not all cores have to take part in a GC and it's tricky to |
| 543 | | associate GCs on each core to a global GC. |
| 544 | | |
| 545 | | We have added a GC_GLOBAL_SYNC event which is emitted strictly between |
| 546 | | GC_START/GC_END on all cores, used for associating GC across all cores. |
| 547 | | If we could calculate stats in such a way that we do not |
| 548 | | need to associate GC across |
| 549 | | all cores then that would be better (and more flexible e.g. to work with |
| 550 | | your local GC branch). |
| 551 | | |
| 552 | | ==== An implemented partial solution for RequestParGC ==== |
| 553 | | |
| 554 | | We can try to ensure that RequestSeqGC and RequestParGC are emitted before their corresponding |
| 555 | | GC_START events, by using the following tip from JaffaCake: |
| 556 | | "actually you could try moving the interruptAllCapabilities() call in Schedule.c:1500 down below the traceEvent calls". |
| 557 | | However, even with this implemented in GHC HEAD, the events can still |
| 558 | | be emitted in arbitrary order and it complicates analysis, |
| 559 | | especially for selected intervals (and complicates and makes less strict |
| 560 | | the future validation of events). |
| 561 | | |
| 562 | | Here are the results of implementing it: |
| 563 | | |
| 564 | | {{{ |
| 565 | | 18:13 < mikolaj> dcoutts, JaffaCake: I've tested the "moving the interruptAllCapabilities() |
| 566 | | call below the traceEvent calls" tweak and it helps in 99-ish % of cases |
| 567 | | 18:13 < mikolaj> that is, the order of GC events is much, much better now |
| 568 | | 18:14 < mikolaj> but there are still very rare cases, especially when the par GC request was |
| 569 | | from one HEC, and another, busy HEC is too slow to report finishing a previous GC |
| 570 | | 18:15 < mikolaj> so it does not help that much in validating eventlogs, but makes them |
| 571 | | so much more readable, if a user wants to inspect the event messages one by one |
| 572 | | }}} |
| 573 | | |
| 574 | | ==== More example eventlog fragments ==== |
| 575 | | |
| 576 | | The example of the event order issues below are on a system with 2 caps, on an eventlog |
| 577 | | with lots of small GCs. They show how GC times measured by "+RTS -s", |
| 578 | | that is from START_GC to the "GC stats" event, are different from TS times. |
| 579 | | The TS times are the average over all caps of the pause |
| 580 | | between START_GC, when the caps stop mutating, |
| 581 | | and END_GC, when they resume mutating. (Most of) the difference |
| 582 | | stems probably from the thread sleep/wake delays, some ending GC bookkeeping |
| 583 | | (including the time taken to emit GC and other events) |
| 584 | | and related overheads. |
| 585 | | |
| 586 | | It seems the TS times are more useful to the programmer that is mostly |
| 587 | | concerned in how much time is spend mutating and not how the wasted time |
| 588 | | is divided between the actual GC and the cap switching overhead. |
| 589 | | Moreover "+RTS -s" just adds the cap switching overhead to the MUT time |
| 590 | | hiding the fact that it's caused by GC. |
| 591 | | |
| 592 | | The differences accumulate and affect the "GC total elapsed time", |
| 593 | | "Avg pause" and "Max pause" columns of the GC table in "+RTS -s". |
| 594 | | They also influence the "Task" table (not reproduced in TS). |
| 595 | | They add up in total "GC time (elapsed)" and consequently in |
| 596 | | "MUT time (elapsed)" that is obtained by subtracting GC time from total time. |
| 597 | | In the result, they influence "Alloc rate" and "Productivity", too. |
| 598 | | The figures are seriously different between TS and "+RTS -s", |
| 599 | | except on small eventlogs. |
| 600 | | |
| 601 | | It seems for all these figures the TS version is the right one. |
| 602 | | In particular, if by "pause" we mean the time between |
| 603 | | GC_START and GC_END, by "Max pause" we neither mean |
| 604 | | the "maximum over all GCs of (minimal pause over all caps)", |
| 605 | | similarly as in +RTS -s (but not exactly the same, because |
| 606 | | even the minimal GC_END is usually a tick after "GC stats"), |
| 607 | | nor "maximum over all GCs of (maximal pause over all caps)", which |
| 608 | | we could easily show in TS and which could make sense for a real-time |
| 609 | | system with specialized, non-interchangeable caps, |
| 610 | | but "maximum over all GCs of (average pause over all caps)". |
| 611 | | |
| 612 | | Here's an example demonstrating END_GC |
| 613 | | for the previous GC following REQUEST_PAR_GC for the next GC on another cap. |
| 614 | | Note that the early END_GC on cap 1 still comes 1000 nanoseconds after |
| 615 | | the "GC stats" event. Even such small differences accumulate over all GCs: |
| 616 | | |
| 617 | | {{{ |
| 618 | | 9366536000: cap 1: GC stats for heap capset 0: generation 0, 77552 |
| 619 | | bytes copied, 226056 bytes slop, 696320 bytes fragmentation, 2 par |
| 620 | | threads, 71736 bytes max par copied, 77520 bytes total par copied |
| 621 | | 9366537000: cap 1: finished GC |
| 622 | | 9366538000: cap 1: spark stats: 1910358 created, 1228256 converted, |
| 623 | | 1000 remaining (0 overflowed, 0 dud, 58306 GC'd, 144981 fizzled) |
| 624 | | 9366541000: cap 1: running thread 3 |
| 625 | | 9367310000: cap 1: stopping thread 3 (heap overflow) |
| 626 | | 9367312000: cap 1: requesting parallel GC |
| 627 | | 9367316000: cap 0: finished GC |
| 628 | | 9367317000: cap 0: spark stats: 1071418 created, 1292489 converted, 0 |
| 629 | | remaining (0 overflowed, 0 dud, 41840 GC'd, 214904 fizzled) |
| 630 | | 9367318000: cap 0: starting GC |
| 631 | | 9368131000: cap 1: starting GC |
| 632 | | }}} |
| 633 | | |
| 634 | | Another example, this time there's a different event |
| 635 | | ("allocated on heap capset") on the late cap, at the same time |
| 636 | | as "GC stats" an another cap and before END_GC finally arrives. |
| 637 | | Not sure how this can happen. The END_GC on the early cap comes here |
| 638 | | a bit later that usually (2000 nanoseconds after the "GC stats" event), too: |
| 639 | | |
| 640 | | {{{ |
| 641 | | 836505000: cap 0: GC stats for heap capset 0: generation 0, 11584 |
| 642 | | bytes copied, 205232 bytes slop, 1007616 bytes fragmentation, 2 par |
| 643 | | threads, 11472 bytes max par copied, 11536 bytes total par copied |
| 644 | | 836505000: cap 1: allocated on heap capset 0: 180717856 total bytes till now |
| 645 | | 836507000: cap 0: finished GC |
| 646 | | 836507000: cap 0: spark stats: 5893 created, 114894 converted, 228 |
| 647 | | remaining (0 overflowed, 0 dud, 13 GC'd, 22190 fizzled) |
| 648 | | 836509000: cap 0: waking up thread 3 on cap 0 |
| 649 | | 836511000: cap 0: running thread 233 |
| 650 | | 836514000: cap 0: stopping thread 233 (thread finished) |
| 651 | | 836521000: cap 1: finished GC |
| 652 | | }}} |
| 653 | | |
| 654 | | Finally an example where END_GC is quite late on both caps |
| 655 | | (3000 nanoseconds and 6000 nanoseconds): |
| 656 | | |
| 657 | | {{{ |
| 658 | | 67582000: cap 1: GC stats for heap capset 0: generation 1, 58248 bytes |
| 659 | | copied, 40096 bytes slop, 1912832 bytes fragmentation, 2 par threads, |
| 660 | | 51920 bytes max par copied, 58208 bytes total par copied |
| 661 | | 67583000: cap 1: live data in heap capset 0: 82784 bytes |
| 662 | | 67585000: cap 1: finished GC |
| 663 | | 67585000: cap 1: spark stats: 15101 created, 2570 converted, 508 |
| 664 | | remaining (0 overflowed, 0 dud, 879 GC'd, 12 fizzled) |
| 665 | | 67588000: cap 0: finished GC |
| 666 | | }}} |