Opened 9 years ago

Closed 9 years ago

#4951 closed bug (fixed)

Performance regression 7.0.1 -> 7.0.1.20110201

Reported by: simonmar Owned by: simonpj
Priority: highest Milestone: 7.0.2
Component: Compiler Version: 7.0.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Submitted by Kazu Yamamoto on glasgow-haskell-users.

This program

import System.IO

n :: Int
n = 10000

main :: IO ()
main = withFile "/dev/null" WriteMode $ \h ->
    hPutStr h . foldr1 (++) . replicate n . replicate n $ 'a'

runs much more slowly with the current 7.0 branch than with either 7.0.1 or HEAD.

He also reported that the program runs faster interpreted with runghc than when compiled. I think this is due to #917 (full-laziness), so that's a separate issue. THe regression reported in this ticket is in compiled performance relative to 7.0.1 and HEAD.

I don't see any obvious differences in the simplified output, so the difference could be in library code. I'm using libraries compiled with -O2 in all cases.

Change History (12)

comment:1 Changed 9 years ago by simonmar

Status: newmerge

The difference turned out to be in the inner loop of hPutStr, nothing to do with fusion as we first thought.

Simon and I have verified that merging

Tue Dec 21 16:58:00 GMT 2010  simonpj@microsoft.com
  * Add a simple arity analyser

from HEAD into stable (it pulls in 2 dependencies) fixes the problem.

I'll do a nofib run with this change and compare against 7.0.1.

comment:2 Changed 9 years ago by simonmar

Status: mergenew

Don't merge yet - the patch causes a Core Lint failure in haskeline. We're investigating.

comment:3 Changed 9 years ago by simonmar

Owner: set to simonpj

comment:4 Changed 9 years ago by simonmar

nofib run today, comparing 7.0.1 and 7.0.1.20110210:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed
--------------------------------------------------------------------------------
            Min          +0.4%     -4.4%     -2.9%     -2.9%
            Max          +7.9%   +110.8%    +65.9%    +66.7%
 Geometric Mean          +3.1%    +10.3%     +3.6%     +3.5%

clearly this is quite bad, we shouldn't roll 7.0.2 yet.

comment:5 Changed 9 years ago by simonmar

And for your enjoyment, here's 7.0.1 vs. today's HEAD (20110210), with a few of the bad ones singled out:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed
--------------------------------------------------------------------------------
          power          +4.6%   +112.1%    +67.6%    +67.8%
         exp3_8          +2.6%    +53.7%    +40.9%    +40.9%
         primes          +2.6%    +64.6%      0.11      0.11
--------------------------------------------------------------------------------
            Min          -4.4%    -95.0%    -27.7%    -27.7%
            Max          +6.8%   +112.1%    +67.6%    +67.8%
 Geometric Mean          +2.1%     -3.4%     -0.7%     -0.9%

on the whole the HEAD is improved over 7.0.1 (and 7.0.1 was generally better than 6.12.3), but we ought to find out what's going on with exp3_8 and primes, both of which are in the imaginary set.

comment:6 Changed 9 years ago by simonmar

I pushed the following patches:

Wed Oct 27 11:56:30 PDT 2010  simonpj@microsoft.com
  * Yet another go at CoreArity
  
Fri Sep 24 08:56:00 PDT 2010  simonpj@microsoft.com
  * Just moving comments around

Tue Dec 21 08:58:00 PST 2010  simonpj@microsoft.com
  * Add a simple arity analyser
  
Tue Feb 15 04:02:28 PST 2011  simonpj@microsoft.com
  * Make exprIsCheap/exprIsExpandable understand about casts

And performance is much better, though there are still some regressions relative to 7.0.1:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed
--------------------------------------------------------------------------------
        circsim        +284.1%     -1.1%    +10.9%    +10.9%
           para        +296.2%    +25.5%    +35.3%    +35.3%
          power        +282.3%   +114.1%    +76.8%    +76.8%
         simple        +202.2%    +66.6%    +28.6%    +28.2%
           x2n1        +322.6%   +145.2%      0.07      0.07
         sphere        +265.7%    +40.1%      0.18      0.18
--------------------------------------------------------------------------------
            Min        +133.1%     -9.6%     -2.9%     -2.9%
            Max        +475.2%   +145.2%    +76.8%    +76.8%
 Geometric Mean        +309.6%     +3.0%     +4.5%     +4.5%

comment:7 Changed 9 years ago by simonmar

See also #4965

comment:8 Changed 9 years ago by simonpj

I'm very puzzled. I've been looking at imaginary/primes, a very simple benchmark. With GHC 6.12 -O I get

bash$ ./primes-612 4000 +RTS  -s 
37831
     457,709,320 bytes allocated in the heap
      82,966,936 bytes copied during GC
         378,968 bytes maximum residency (29 sample(s))
          89,944 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   844 collections,     0 parallel,  0.99s,  0.96s elapsed
  Generation 1:    29 collections,     0 parallel,  0.05s,  0.06s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.67s  (  0.69s elapsed)
  GC    time    1.04s  (  1.03s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.71s  (  1.72s elapsed)

With HEAD I get

bash$ ./primes 4000 +RTS -rprimes.ticky -s
37831
     718,383,496 bytes allocated in the heap
      79,051,592 bytes copied during GC
         357,648 bytes maximum residency (25 sample(s))
          88,384 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   848 collections,     0 parallel,  0.95s,  0.95s elapsed
  Generation 1:    25 collections,     0 parallel,  0.05s,  0.05s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.43s  (  1.43s elapsed)
  GC    time    1.00s  (  1.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.43s  (  2.43s elapsed)

Note the massive increase in allocation, and in mutator execution time.

BUT when I look at the -ddump-simpl code I see virtually the same code. (For HEAD I also used -funfolding-use-threshold=9 to make mod inline; the HEAD seems a tiny bit less keen to inline. With this flag the two are close to identical.)

Moreover, I compiled both with -ticky (including all the libraries). The allocation word counts from -ticky are practically the same for the two programs!

So I'm stumped. Somewhere a lot of time and allocation is happening, but -ticky isn't seeing it.

It's a really small program and (by the time we've done inlining) almost all the code is in Main (though it still calls GHC.List.filter).

So where is that time and allocation going? Somewhere in the RTS?

Simon

comment:9 Changed 9 years ago by simonmar

On my phone so apologies for the terse response. The accounting in head is slightly different. For now I think we should concentrate on the regressions between 7.0.1 and 7.0.2, worry about head regressions later.

comment:10 Changed 9 years ago by simonpj

Just adding Ian's comments: Hmm, profiling also doesn't see a difference in allocations:

-   total time  =        1.80 secs   (90 ticks @ 20 ms)
-   total alloc = 585,367,648 bytes  (excludes profiling overheads)
+   total time  =        1.96 secs   (98 ticks @ 20 ms)
+   total alloc = 585,342,752 bytes  (excludes profiling overheads)

The +RTS -Dg output is different. Here's the end of the 6.12.3 output:

-----------------------------------------------------------------
  Gen     Max  Mut-list  Step   Blocks    Large     Live     Slop
       Blocks     Bytes                 Objects
-----------------------------------------------------------------
    0     256         0     0        0        0        0        0
                            1        2        4    26272    22880
    1     256         0     0        1        0     1376     2720
-----------------------------------------------------------------
                                                   27648    25600
-----------------------------------------------------------------

and of the 7.x output:

----------------------------------------------------------
  Gen     Max  Mut-list  Blocks    Large     Live     Slop
       Blocks     Bytes          Objects
----------------------------------------------------------
    0     256         0      12        4    34064    15088
    1     256         3     260        1  1041232    23728
----------------------------------------------------------
                                          1075296    38816
----------------------------------------------------------

More live data sounds bad, but it's not obvious to me the two are necessarily linked.

I'm not convinced everything is being accounted for by ticky. For example, sm/Storage.c line 572 does TICK_ALLOC_HEAP_NOCTR(n);, but I can't see how the large object case immediately above is accounted for. Likewise sm/Storage.c line 652.

There's also a load of stuff commented out in Ticky.c, headed by:

/* krc: comment out some of this stuff temporarily */

I think it would be good to have a less fragile way of doing the accounting, both for ticky and CCS, e.g. a wrapper for allocGroup that took the counters to attribute the allocation to.

comment:11 Changed 9 years ago by simonpj

I believe I have cured these regressions between 7.0.1 and the 7.0 branch (will be 7.0.2). Specifically, in the compiler:

Thu Feb 24 09:47:44 PST 2011  simonpj@microsoft.com
  * Re-enable SPECIALISE INSTANCE pragmas
  
  This mirror a similar patch in HEAD. Lacking it was responsible
  for signficant performance regressions in numeric operations on
  Rationals.

    M ./compiler/typecheck/TcInstDcls.lhs -2 +2

This was a good catch, because it was wrong in HEAD too.

The others are just patches in the libraries that hadn't moved to the branch. In base

Wed Nov 17 02:05:10 PST 2010  simonpj@microsoft.com
  * Make (^) and (^^) INLINABLE
  
  This makes them perform well robustly
  (e.g. in test perf/should_run/MethSharing)
  rather than relying on a rather delicate
  let-floating.  See Note [Inlining (^) in Real.lhs

    M ./GHC/Real.lhs +13

and in ghc-prim:

Tue Oct 26 01:29:55 PDT 2010  simonpj@microsoft.com
  * Add a rewrite rule for toInt# so literals work right
  
  See the comments with toInt#, but the key point is
  that we want (fromInteger 1)::Int to yield (I# 1)!

    M ./GHC/Integer.lhs +6

Result seems good

            Min         +10.2%    -92.8%    -32.4%    -32.2%
            Max         +12.6%    +23.4%     +4.2%     +4.2%
 Geometric Mean         +11.8%     -3.5%     -5.7%     -5.6%

The +24% is in sphere, and is caused by #4979; it's a code issue not an optimisation one.

Generally perf is signficantly up, which is good.

Ian can you confirm. No patches to push -- I have done that already.

Simon

comment:12 Changed 9 years ago by igloo

Resolution: fixed
Status: newclosed

Looks good, thanks!

Note: See TracTickets for help on using tickets.