Opened 8 years ago

Closed 7 years ago

#5505 closed bug (fixed)

Program runs faster with profiling than without

Reported by: simonpj Owned by:
Priority: high Milestone: 7.4.1
Component: Compiler Version: 7.6.1
Keywords: Cc: doaitse@…, scpmw@…, anton.nik@…, bgamari@…, tomberek@…
Operating System: Windows Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

This program, reported by Doaitse, runs nearly twice as fast when it is profiled than when it isn't. You need to install the Utrecht library uulib:

cabal install -p uulib

Now compile and run UULib.hs with and without -prof -auto-all. (It makes little difference whether or not you use -O.) When you run, here are the stats for GHC 7.0.3:

----------------  NOT PROFILED ----------------
bash-3.1$ ./uulib +RTS -s
c:\tmp\uulib.exe +RTS -s 
138000
      51,077,556 bytes allocated in the heap
      19,153,500 bytes copied during GC
       4,914,980 bytes maximum residency (5 sample(s))
         519,828 bytes maximum slop
              12 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    93 collections,     0 parallel,  4.71s,  4.70s elapsed
  Generation 1:     5 collections,     0 parallel,  2.73s,  2.76s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.05s  (  0.11s elapsed)
  GC    time    7.44s  (  7.45s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.49s  (  7.56s elapsed)

  %GC time      99.4%  (98.6% elapsed)

  Alloc rate    1,091,377,449 bytes per MUT second

  Productivity   0.6% of total user, 0.6% of total elapsed


------------- PROFILED ------------------------
bash-3.1$ ./uulib +RTS -s
c:\tmp\uulib.exe +RTS -s 
138000
      79,692,548 bytes allocated in the heap
      31,874,676 bytes copied during GC
       7,774,720 bytes maximum residency (5 sample(s))
         257,912 bytes maximum slop
              19 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   147 collections,     0 parallel,  2.65s,  2.66s elapsed
  Generation 1:     5 collections,     0 parallel,  1.33s,  1.34s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.09s  (  0.16s elapsed)
  GC    time    3.98s  (  3.99s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.07s  (  4.15s elapsed)

  %GC time      97.7%  (96.1% elapsed)

  Alloc rate    851,407,014 bytes per MUT second

  Productivity   2.3% of total user, 2.3% of total elapsed

The profiled one:

  • Allocates more
  • But runs nearly twice as fast

Looking at the stats, though, you can see that in both cases 99% of execution time spent in the garbage collector. I don't think I've ever seen a program spend that much time in GC.

So I don't really know what's going on, but it's clearly a rather amazing program.

Attachments (2)

UULib.hs (642 bytes) - added by simonpj 8 years ago.
Main.hs (1.7 KB) - added by tomberek 7 years ago.

Download all attachments as: .zip

Change History (17)

Changed 8 years ago by simonpj

Attachment: UULib.hs added

comment:1 Changed 8 years ago by scpmw

It also seems to make no difference whether you use -auto-all or not, ruling out the annotations as the villain. This is most definitely some sort of garbage collection issue. ThreadScope shows extremely long garbage collection phases - over 0.5 seconds each towards the end.

When I turn my profiling code to look at the garbage collection phases, it gives me >98% time spent directly in evacuate, for whatever that's worth. If only I had a per-line profile now... :-)

comment:2 Changed 8 years ago by simonmar

Milestone: 7.4.1
Owner: set to simonmar
Priority: normalhigh

comment:3 Changed 8 years ago by simonmar

Fascinating bug. I'm going to have to sleep on this and see if a fix occurs to me.

For now, you can workaround it by adding +RTS -Z:

$ ./UULib +RTS -s -Z                
./UULib +RTS -s -Z 
138000
     110,831,992 bytes allocated in the heap
      51,956,344 bytes copied during GC
      17,880,048 bytes maximum residency (6 sample(s))
       3,212,328 bytes maximum slop
              34 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   190 collections,     0 parallel,  0.23s,  0.24s elapsed
  Generation 1:     6 collections,     0 parallel,  0.07s,  0.07s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.11s  (  0.11s elapsed)
  GC    time    0.30s  (  0.30s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.41s  (  0.41s elapsed)

Or alternatively avoid using unsafePerformIO (I presume you're using that somewhere in the uulib package?), or change it to unsafeDupablePerformIO.

comment:4 Changed 8 years ago by lelf

Cc: anton.nik@… added

comment:5 Changed 8 years ago by bgamari

Cc: bgamari@… added

My GGen (https://github.com/bgamari/GGen) program also exhibits this behavior.

               Memory usage       Runtime
Profiling         20MB              14.15s
No Profiling      13MB              27.86s

I'm currently looking at GGen commit 49dbdd994291e8ef75d0119fa28f84164c5fff77 with the input file found at http://www.thingiverse.com/download:31703. Pass the input file name on the command line.

comment:6 Changed 8 years ago by bgamari

Here are -sstderr logs for both the profiling and non-profiling cases. Note that in my case relatively little time is spent in GC,

With profiling

../../GGen/dist/build/GGen/GGen z_endstop_clamp_and_mount.stl +RTS -sstderr 
  10,355,015,584 bytes allocated in the heap
     196,136,736 bytes copied during GC
       7,760,328 bytes maximum residency (27 sample(s))
         479,656 bytes maximum slop
              20 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 19768 collections,     0 parallel,  0.44s,  0.47s elapsed
  Generation 1:    27 collections,     0 parallel,  0.15s,  0.16s elapsed

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time   13.56s  ( 15.97s elapsed)
  GC    time    0.59s  (  0.64s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   14.15s  ( 16.60s elapsed)

  %GC time       4.2%  (3.8% elapsed)

  Alloc rate    763,501,989 bytes per MUT second

  Productivity  95.8% of total user, 81.6% of total elapsed

Without profiling

../../GGen/dist/build/GGen/GGen z_endstop_clamp_and_mount.stl +RTS -sstderr 
  18,234,515,824 bytes allocated in the heap
     128,446,200 bytes copied during GC
       5,211,504 bytes maximum residency (23 sample(s))
         325,792 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 34960 collections,     0 parallel,  0.62s,  0.71s elapsed
  Generation 1:    23 collections,     0 parallel,  0.10s,  0.10s elapsed

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time   27.14s  ( 31.47s elapsed)
  GC    time    0.72s  (  0.81s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   27.86s  ( 32.27s elapsed)

  %GC time       2.6%  (2.5% elapsed)

  Alloc rate    671,650,697 bytes per MUT second

  Productivity  97.4% of total user, 84.1% of total elapsed

comment:7 Changed 8 years ago by bgamari

Interestingly enough, adding some SPECIALIZE pragmas to a few hotspots (commit b08b2b3680548f47c71a90db20d8b1310392fec9) seems to fix this behavior (runtime of 10s with profiling, 5s without).

comment:8 in reply to:  7 Changed 8 years ago by simonmar

Replying to bgamari:

Interestingly enough, adding some SPECIALIZE pragmas to a few hotspots (commit b08b2b3680548f47c71a90db20d8b1310392fec9) seems to fix this behavior (runtime of 10s with profiling, 5s without).

I strongly suspect your bug is different to the original one reported in this ticket. Could you open a new ticket please? Also, I looked at your example, but it has rather a lot of dependencies (cairo, attoparsec, vector-spaces etc.) and not all of them work with GHC HEAD. Is it possible to make a more self-contained test case?

comment:9 Changed 8 years ago by bgamari

Given I have pretty much no idea what causes the behavior to begin with, I wouldn't know where to start in constructing a simple test case. Moreover, with the addition of the mentioned pragmas, the bug has effectively vanished.

comment:10 Changed 8 years ago by marlowsd@…

commit b04cc4c7b192a50885871d8ec71c25b6b6475d67

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Tue Nov 15 11:38:23 2011 +0000

    Avoid generating chains of indirections in stack squeezing (#5505)

 rts/ThreadPaused.c |  133 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 73 insertions(+), 60 deletions(-)

comment:11 Changed 8 years ago by simonmar

Resolution: fixed
Status: newclosed

Results for the original UULib test program after the above patch:

> ./UULib +RTS -s
138000
     101,854,752 bytes allocated in the heap
      38,063,512 bytes copied during GC
       9,955,944 bytes maximum residency (5 sample(s))
       2,221,040 bytes maximum slop
              24 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       191 colls,     0 par    0.04s    0.04s     0.0002s    0.0011s
  Gen  1         5 colls,     0 par    0.05s    0.05s     0.0098s    0.0254s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.10s  (  0.10s elapsed)
  GC      time    0.09s  (  0.09s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.20s  (  0.19s elapsed)

  %GC     time      45.5%  (46.1% elapsed)

  Alloc rate    952,278,461 bytes per MUT second

  Productivity  54.3% of total user, 54.9% of total elapsed

comment:12 Changed 8 years ago by marlowsd@…

commit 9562f18769b18cd44290d14628dd8d9a45e7d898

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Tue Nov 15 15:44:53 2011 +0000

    further fixes to the #5505 fix.

 rts/ThreadPaused.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

comment:13 Changed 7 years ago by tomberek

Cc: tomberek@… added
Operating System: Unknown/MultipleWindows
Owner: simonmar deleted
Resolution: fixed
Status: closednew
Type of failure: None/UnknownRuntime performance bug
Version: 7.0.37.6.1

The Main.hs program (compiled with or without -O2) processes a StateT monad over ST running around a circular list. When run with profiling it is around 35% faster than without.

It has slightly different GC behavior, but it doesn't look significant enough to cause such a large difference.

Changed 7 years ago by tomberek

Attachment: Main.hs added

comment:14 Changed 7 years ago by tomberek

./Main.exe +RTS -s
222
CPU time:   1.87s
         172,636 bytes allocated in the heap
             980 bytes copied during GC
          41,860 bytes maximum residency (1 sample(s))
          19,580 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.87s  (  1.86s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.87s  (  1.86s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    92,219 bytes per MUT second

  Productivity 100.0% of total user, 100.6% of total elapsed


./Main.exe +RTS -s -h
222
CPU time:   1.39s
         172,892 bytes allocated in the heap
         271,180 bytes copied during GC
          78,104 bytes maximum residency (14 sample(s))
          20,200 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1        14 colls,     0 par    0.00s    0.00s     0.0002s    0.0003s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.39s  (  1.37s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.39s  (  1.38s elapsed)

  %GC     time       0.0%  (0.2% elapsed)

  Alloc rate    124,525 bytes per MUT second

  Productivity 100.0% of total user, 100.7% of total elapsed

comment:15 Changed 7 years ago by simonmar

difficulty: Unknown
Resolution: fixed
Status: newclosed

I've moved this to a new ticket, see #7429.

Note: See TracTickets for help on using tickets.