Opened 8 years ago

Closed 7 years ago

#5888 closed bug (fixed)

Performance regression in 7.4.1 compared to 6.12.3

Reported by: nickie Owned by: igloo
Priority: highest Milestone: 7.6.1
Component: Compiler Version: 7.4.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Unless I'm doing something really stupid, there seems to be a significant slowdown for some programs between versions 6 and 7.

The bottom line is that the attached program runs in 6.79 sec. with 7.4.1 versus 4.72 sec. in 6.12.3. The script is meant to reproduce the problem in any x86 or x86_64 machine running linux.

The test machine was an Intel Xeon E7340 2.40GHz, 4x4 cores, GNU/Linux 2.6.32-5 (x86_64), but it can be reproduced in other machines too (i386) with similar results.

Checking GHC versions
The Glorious Glasgow Haskell Compilation System, version 6.12.3
The Glorious Glasgow Haskell Compilation System, version 7.4.1

Compiling with GHC 6.12.3
/var/tmp/nickie/ghc-bench/usr/bin/ghc-6.12.3 -o ntak6 --make ntak.hs -O2 -optc-O3
[1 of 1] Compiling Main             ( ntak.hs, ntak.o )
Linking ntak6 ...

Compiling with GHC 7.4.1
/var/tmp/nickie/ghc-bench/usr/bin/ghc-7.4.1 -o ntak7 --make ntak.hs -O2 -optc-O3 -rtsopts=all
[1 of 1] Compiling Main             ( ntak.hs, ntak.o )
Linking ntak7 ...

Running with GHC 6.12.3
./ntak6 +RTS -sstderr 
9, hs time = 4.748408 sec
   4,738,681,488 bytes allocated in the heap
      11,586,848 bytes copied during GC
          33,456 bytes maximum residency (1 sample(s))
         321,592 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  9038 collections,     0 parallel,  0.06s,  0.07s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.66s  (  4.68s elapsed)
  GC    time    0.06s  (  0.07s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.73s  (  4.75s elapsed)

  %GC time       1.3%  (1.5% elapsed)

  Alloc rate    1,015,078,645 bytes per MUT second

  Productivity  98.6% of total user, 98.2% of total elapsed

4.72 user 0.02 system 0:04.75 elapsed

Running with GHC 7.4.1
9, hs time = 6.82757 sec
   5,529,772,976 bytes allocated in the heap
      13,357,408 bytes copied during GC
          62,904 bytes maximum residency (2 sample(s))
          23,080 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10583 colls,     0 par    0.07s    0.08s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    6.72s  (  6.75s elapsed)
  GC      time    0.07s  (  0.08s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    6.79s  (  6.83s elapsed)

  %GC     time       1.1%  (1.2% elapsed)

  Alloc rate    822,831,578 bytes per MUT second

  Productivity  98.9% of total user, 98.4% of total elapsed

6.79 user 0.04 system 0:06.82 elapsed

Attachments (1)

ghc-bench.tar.gz (1.1 KB) - added by nickie 8 years ago.
Program and script to reproduce the problem

Download all attachments as: .zip

Change History (7)

Changed 8 years ago by nickie

Attachment: ghc-bench.tar.gz added

Program and script to reproduce the problem

comment:1 Changed 8 years ago by sanketr

Have you measured the actual system time? This might be related to #5749. MUT time was inaccurate in 6.12.3, and was fixed in 7.0. It didn't include the time spent in kernel, in 6.12.3.

comment:2 Changed 8 years ago by nickie

I was not aware of this, but it's not the issue here. The script I've attached also measures user/system/elapsed time using /usr/bin/time and it shows a clear difference:

Running with GHC 6.12.3
4.72 user 0.02 system 0:04.75 elapsed

Running with GHC 7.4.1
6.79 user 0.04 system 0:06.82 elapsed

comment:3 Changed 8 years ago by simonmar

difficulty: Unknown
Milestone: 7.4.2
Priority: normalhigh

This is more fallout from the change to the representation of Integer literals in 7.4.1. The important bit of code looks like this:

shuffle h x y z n =
   if n `mod` 3 == 0 then
      ntak shuffle h (n+3) (x-1) y z
   else if n `mod` 3 == 1 then
      ntak shuffle h (n+2) (y-1) z x
   else
      ntak shuffle h (n+1) (z-1) x y

In GHC 7.0.3, n `mod` 3 == 1 was compiled to

        case ww2_a1es of _ {
          GHC.Integer.Type.S# i_dIx ->
            case i_dIx of _ { __DEFAULT -> $w$j3_s1gG; 1 -> $w$j2_s1gI };
          GHC.Integer.Type.J# s_dIK d_dIL ->
            case {__pkg_ccall_GC integer-gmp integer_cmm_cmpIntegerIntzh ...

Note how equality has been inlined, and the box around the constant 1 has been optimised away. In 7.4.1 we get

        case GHC.Integer.Type.eqInteger ww5_X11T lvl1_r2i7 of _ {
          GHC.Types.False ->
            poly_ntak_r2ia
              @ GHC.Integer.Type.Integer
              shuffle_r2ic
              h_akF
              (GHC.Integer.Type.plusInteger n_akJ lvl1_r2i7)
              (GHC.Integer.Type.minusInteger z_akI lvl1_r2i7)
              x_akG
              y_akH;
          GHC.Types.True ->

equality has not been inlined, and the constant 1 is a top-level S# (lvl1_r2i7).

We could fix this by defining something like

eqIntegerInt# :: Integer -> Int# -> Bool

and having builtin RULES for equality where one argument is a constant, but then the tricky bit there is how to get hold of an Id for eqIntegerInt#. Ideas anyone?

comment:4 Changed 8 years ago by simonpj

Notes following a brief discussion with Simon M

  • It is probably a Good Plan to make the simplifier monad carry the global symbol table, just as the desugarer monad does, so that rules can readily grab these Ids. That would be generally useful.
  • Maybe it's worth having Integer/Int versions of all the Integer operations. Or not. It's a big surprise to me that the cost of unboxing that S# is 20% of execution time; and also allocation is higher and I don't understand why.

comment:5 Changed 7 years ago by simonpj

Milestone: 7.4.27.6.1
Owner: set to igloo
Priority: highhighest

Ian will do this for 7.6.1.

  • CorePrep: initialise the CorePrepEnv with a single (!) binding for mkIntegegerId or whatever it is.
  • Remove Id from LitInteger.
  • The ru_try field of BuiltInRule gets a new field Name -> Id, where we can look up known-key things (only!).
  • Then add rules for Integer/Int equality etc. Use commutativity to avoid exploding the number of primops.

comment:6 Changed 7 years ago by igloo

Resolution: fixed
Status: newclosed

The problem was that mod was calling divModInteger, which returns a pair. As divModInteger wasn't being inlined any more, we were always allocating the pair, and then throwing away the unwanted value.

I've now added divInteger and modInteger functions.

Note: See TracTickets for help on using tickets.