Opened 13 months ago

Last modified 8 months ago

#15418 infoneeded bug

Performance drop 60 times on non-profiling binary

Reported by: hth313 Owned by:
Priority: high Milestone: 8.10.1
Component: Runtime System Version: 8.4.3
Keywords: Cc:
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #14414, #9599 Differential Rev(s):
Wiki Page:

Description

I have a rather large application that I have spent quite some time on tuning it as performance is just too bad. I have come to a point where the profiler reports about 14 seconds on a particular example, which is about 5x slower than I would like to see.

However, building the same binary without profiling ability results in that the example takes 14 _minutes_. If I just touch a file and build with profiling information, but not giving it any profiling related RTS options, it takes about 30 seconds (not 14 seconds, but that is probably due to profiling overhead being there).

How can performance drop 60x when I basically just relink it?

From what I can see using "top" and memory profiling, memory consumption is quite stable over time.

Change History (20)

comment:1 Changed 13 months ago by osa1

How can performance drop 60x when I basically just relink it?

If you previously linked with profiling libraries and runtime, relinking it with non-profiling libraries and runtime (or the other way around) can of course make a huge difference. In one case all library functions are profiled and you're using profiling RTS, in the other they're not profiled and the RTS has less overheads.

I'm confused about what exactly you're doing. I understand that you have one program that you want to optimize, and you're trying different profiling settings. You have these three configurations:

  • Compiled without profiling and run
  • Compiled with profiling and run without profiling (i.e. no +RTS -p)
  • Compiled with profiling and run with profiling (+RTS -p)

Can you report numbes for each of these?

comment:2 Changed 13 months ago by hth313

Normal build (report using time)

real	16m17.687s
user	16m1.916s
sys	0m10.150s

Profiling build (no +RTS -p)

real	0m26.963s
user	0m26.562s
sys	0m0.378s

Profiling build, +RTS -p

real	0m27.528s
user	0m27.121s
sys	0m0.471s

Peeking inside generated .prof

	total time  =       14.04 secs   (14037 ticks @ 1000 us, 1 processor)
	total alloc = 17,827,853,336 bytes  (excludes profiling overheads)

comment:3 Changed 13 months ago by osa1

comment:4 Changed 13 months ago by George

Can you provide a reproducible case?

comment:5 Changed 13 months ago by hth313

I can provide a binary with needed input files for download. I can also try on my side if I get some suggestions what to do.

comment:6 Changed 13 months ago by George

Best to have source also with details on how to produce a binary and how to run it so people can try this on different ghc versions and on different platforms

comment:7 Changed 13 months ago by George

Architecture: Unknown/Multiplex86_64 (amd64)
Type of failure: None/UnknownRuntime performance bug

comment:8 Changed 13 months ago by hth313

Code base is unfortunately not open source.

Problem is at least also on GHC 8.2.2 as that is I was using when I encountered it. I upgraded to 8.4.3 but unfortunately the problem was still there.

comment:9 Changed 13 months ago by George

I think without a reduced test case with source this will be very hard to fix.

It would be interesting to see if you can reproduce the issues in the related tickets above as I don't believe others have been able to. Maybe you have the same problem as those tickets and in that case we would have the source to work on the problems.

Others will probably have suggestions about what you can do on your side. You might want to take a look at the performance debugging section of https://ghc.haskell.org/trac/ghc/wiki/Debugging

Last edited 13 months ago by George (previous) (diff)

comment:10 Changed 13 months ago by hth313

I tried #14414 but I cannot reproduce that problem.

Building my program on Linux (x86_64), I can confirm the problem I have encountered is also present there.

I also tried to build with -ticky, but I cannot say the output made me any wiser.

comment:11 Changed 13 months ago by hth313

Trying some low-tech profiling I compiled with -debug and ran in lldb, then stop using ctrl-c from time to time to see what it was up to. The following two are definitely main offenders, especically the second one.

The first is the garbage collector in some way, I am not really sure what is going on in the second.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000010097d553 tcc6502`countBlocks(bd=0x0000004202400fc0) at BlockAlloc.c:782
    frame #1: 0x0000000100988cd6 tcc6502`genBlocks(gen=0x0000000110c07528) at Sanity.c:870
    frame #2: 0x0000000100988fff tcc6502`memInventory(show=false) at Sanity.c:900
    frame #3: 0x00000001009821b8 tcc6502`GarbageCollect(collect_gen=0, do_heap_census=false, gc_type=0, cap=0x0000000101dc4500, idle_cap=0x0000000000000000) at GC.c:297
    frame #4: 0x00000001009730b9 tcc6502`scheduleDoGC(pcap=0x00007ffeefbff770, task=0x0000000110c078a0, force_major=false) at Schedule.c:1809
    frame #5: 0x00000001009724d4 tcc6502`schedule(initialCapability=0x0000000101dc4500, task=0x0000000110c078a0) at Schedule.c:558
    frame #6: 0x00000001009737d8 tcc6502`scheduleWaitThread(tso=0x0000004200105388, ret=0x0000000000000000, pcap=0x00007ffeefbff870) at Schedule.c:2544
    frame #7: 0x000000010096a496 tcc6502`rts_evalLazyIO(cap=0x00007ffeefbff870, p=0x0000000101d74e68, ret=0x0000000000000000) at RtsAPI.c:530
    frame #8: 0x000000010096d448 tcc6502`hs_main(argc=7, argv=0x00007ffeefbff9c8, main_closure=0x0000000101d74e68, rts_config=RtsConfig @ 0x00007ffeefbff890) at RtsMain.c:72
    frame #9: 0x000000010063eec6 tcc6502`main + 166
    frame #10: 0x00007fff692ec015 libdyld.dylib`start + 1
    frame #11: 0x00007fff692ec015 libdyld.dylib`start + 1
    frame #0: 0x00000001001105c4 tcc6502`c5kzn_info + 36
tcc6502`c5kzn_info:
->  0x1001105c4 <+36>: leaq   0x888175(%rip), %rdx      ; stg_MUT_ARR_PTRS_DIRTY_info
    0x1001105cb <+43>: movq   %rdx, (%rbx)
    0x1001105ce <+46>: leaq   0x18(%rbx), %rdx
    0x1001105d2 <+50>: shrq   $0x7, %rax
Target 0: (tcc6502) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001001105c4 tcc6502`c5kzn_info + 36
    frame #1: 0x0000000109a19b60 libclang.dylib`llvm::TargetLibraryInfoImpl::StandardNames + 896
    frame #2: 0x0000000108a6440b libclang.dylib
    frame #3: 0x0000000101d3f8c1 tcc6502`TranslatorziTargetziMOS6502ziCompilerziRegister_RCzuY_closure + 1

comment:12 Changed 13 months ago by George

Comment 7 in 14414 says: "Note that the entirety of the anomalous time is in the garbage collector."

Is this true in your case? Can you add " +RTS -s" to the end of your run command line and show us the output in the different cases?

comment:13 Changed 13 months ago by hth313

Built without profiling:

2,608,765,195,440 bytes allocated in the heap
   6,315,679,080 bytes copied during GC
      80,444,376 bytes maximum residency (62 sample(s))
         484,392 bytes maximum slop
             234 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1993101 colls,     0 par   44.791s  46.358s     0.0000s    0.0044s
  Gen  1        62 colls,     0 par    2.675s   2.983s     0.0481s    0.2165s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time  792.519s  (798.495s elapsed)
  GC      time   47.466s  ( 49.341s elapsed)
  EXIT    time    0.000s  (  0.004s elapsed)
  Total   time  839.985s  (847.842s elapsed)

  %GC     time       5.7%  (5.8% elapsed)

  Alloc rate    3,291,739,430 bytes per MUT second

  Productivity  94.3% of total user, 94.2% of total elapsed


real	14m42.573s
user	13m59.992s
sys	0m6.137s

Built with profiling:

  31,046,636,816 bytes allocated in the heap
   9,289,234,088 bytes copied during GC
     204,185,744 bytes maximum residency (55 sample(s))
         958,320 bytes maximum slop
             554 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     29674 colls,     0 par    3.982s   4.041s     0.0001s    0.0055s
  Gen  1        55 colls,     0 par    3.537s   3.785s     0.0688s    0.2521s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time   20.312s  ( 20.502s elapsed)
  GC      time    7.518s  (  7.827s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   27.830s  ( 28.330s elapsed)

  %GC     time      27.0%  (27.6% elapsed)

  Alloc rate    1,528,524,987 bytes per MUT second

  Productivity  73.0% of total user, 72.4% of total elapsed


real	0m28.442s
user	0m27.839s
sys	0m0.461s

comment:14 Changed 13 months ago by bgamari

Milestone: 8.6.18.8.1

It doesn't look like we will be able to address this in 8.6.

comment:15 Changed 12 months ago by osa1

Status: newinfoneeded

It's very hard (if not impossible) to fix this without a reproducer. Marking as "infoneeded".

comment:16 Changed 12 months ago by hth313

Would it work if I could provide object files and a command line, so that you can relink it, to make it possible to try different run-times?

comment:17 Changed 12 months ago by osa1

To be honest I don't think that'd be helpful. Normally for this I'd start with comparing Core outputs of the program, with and without profiling flags. I think it's very likely that this will be enough to see why profiled version is faster.

Btw, looking at the previous comments I realized that we didn't check different optimisation parameters. One of the reasons why sometimes profiling builds are faster is because profiling sometimes inhibits some optimisations (as it changes Core of the program), and optimisations sometimes make some programs slower. Could you try building your program without profiling, and with different optimisation settings? Please report numbers for:

  • -O0
  • -O1 (default when using cabal)
  • -O2

If you're using cabal add these to ghc-options in your .cabal file. Depending on the numbers you can then try to disable individual optimisation passes (see the user manual or man page) etc.

comment:18 Changed 12 months ago by hth313

Here are the numbers at different optimization levels (no profiling enabled).

-O0

15,092,192,977,712 bytes allocated in the heap
 108,592,404,512 bytes copied during GC
     186,703,440 bytes maximum residency (1432 sample(s))
         709,040 bytes maximum slop
             451 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     13967958 colls,     0 par   301.484s  311.887s     0.0000s    0.0044s
  Gen  1      1432 colls,     0 par   100.350s  100.731s     0.0703s    0.2282s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time  2260.896s  (2283.741s elapsed)
  GC      time  401.834s  (412.618s elapsed)
  EXIT    time    0.000s  (  0.002s elapsed)
  Total   time  2662.730s  (2696.363s elapsed)

  %GC     time      15.1%  (15.3% elapsed)

  Alloc rate    6,675,314,555 bytes per MUT second

  Productivity  84.9% of total user, 84.7% of total elapsed


real	44m56.321s
user	44m22.736s
sys	0m31.989s

-O1

2,608,631,901,600 bytes allocated in the heap
   6,337,722,720 bytes copied during GC
      80,513,992 bytes maximum residency (61 sample(s))
         434,080 bytes maximum slop
             235 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1993189 colls,     0 par   43.959s  45.610s     0.0000s    0.0021s
  Gen  1        61 colls,     0 par    2.810s   2.938s     0.0482s    0.1433s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time  835.425s  (843.204s elapsed)
  GC      time   46.769s  ( 48.548s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time  882.194s  (891.755s elapsed)

  %GC     time       5.3%  (5.4% elapsed)

  Alloc rate    3,122,521,195 bytes per MUT second

  Productivity  94.7% of total user, 94.6% of total elapsed


real	14m51.787s
user	14m42.200s
sys	0m7.372s

-O2

2,608,768,497,856 bytes allocated in the heap
   6,313,467,672 bytes copied during GC
      80,449,120 bytes maximum residency (62 sample(s))
         487,840 bytes maximum slop
             235 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     1993104 colls,     0 par   47.643s  49.406s     0.0000s    0.0017s
  Gen  1        62 colls,     0 par    2.746s   2.880s     0.0465s    0.1499s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time  801.664s  (809.891s elapsed)
  GC      time   50.389s  ( 52.286s elapsed)
  EXIT    time    0.000s  (  0.009s elapsed)
  Total   time  852.053s  (862.189s elapsed)

  %GC     time       5.9%  (6.1% elapsed)

  Alloc rate    3,254,192,077 bytes per MUT second

  Productivity  94.1% of total user, 93.9% of total elapsed


real	14m22.223s
user	14m12.060s
sys	0m7.559s
Last edited 12 months ago by hth313 (previous) (diff)

comment:19 Changed 12 months ago by osa1

OK, so profiling build is faster than any of these .. Becuase we can't see the source you'll have to debug this yourself. Here's what I'd do next:

  • Add -ddump-simpl -ddump-to-file -dsuppress-uniques to ghc-options in your .cabal, and build without profiling. Copy generated .dump-simpl files to another directory, and build again this time with profiling (make sure to use same optimisation settings in both!), copy the .dump-simpl files again to another directory. Do directory diff (perhaps using kdiff3) and see the differences. You should see lots of minor changes (like the extra scc expression in the profiled version) and those should not matter, focus on larger changes. Looking at results in comment:13, you should see some code in non-profiled version that allocates more. One example where this happens is when GHC unboxes strict arguments/fields but not in the whole program so you still need the boxed version of the value. In that case at some point you re-box the value, causing more allocation. There are other reasons too, hard to list all..
  • Looking at the numbers in comment:13, it looks like profiling version has less max residency. Perhaps in non-profiled version some closures are floated to the top-level, causing increased residency. Keep this in mind when comparing Core outputs.

Failing all that, try to extract some minimal reproducer from your code base and share it :-)

comment:20 Changed 8 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of some high and highest priority tickets.

Note: See TracTickets for help on using tickets.