Opened 12 years ago

Closed 12 years ago

#2252 closed bug (invalid)

Extreme performance degradation on minor code change

Reported by: simona Owned by:
Priority: normal Milestone:
Component: Compiler Version: 6.8.2
Keywords: Cc:
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The attached code runs two similar calculations (call them a and b) 1000 times. If I calculate a and then b and then do the same 1000 times again the code runs in:

time ./tests/BadPerform 

real    0m40.429s
user    0m40.360s
sys     0m0.054s

If I run a 1000 times, then b 1000 times, the code runs in:

time ./tests/GoodPerform 

real    0m0.083s
user    0m0.081s
sys     0m0.001s

These are the times for ghc 6.6.1. For ghc 6.8.2, the first time reduces to 34s and the second time is identical.

Reproducing this bug is a bit involved: My program links to glpk which in turn uses gmp. Thus, you would have to:

  • install GLPK somewhere
  • extract the attached sparsePolyhedra-0.1.0 cabal package
  • create a local copy of libgmp and libglpk in which all symbols containing gmp are renamed to mygmp
    • cd to sparsePolyhedra-0.1.0
    • in genLibs.sh change the paths containing /users/absint/simona/local/lib to where you installed GMP and GLPK
    • run genLibs.sh which will create local copies of
  • in sparsePolyhedra.cabal change the path /users/absint/simona/current/source/sparseDomain to the current directory
  • build and register the cabal package
  • compile the attached BadPerform.hs and GoodPerform.hs

I guess this is quite complicated. I could send you the IM Core of these two files instead, if that helps.

Attachments (9)

sparsePolyhedra-0.1.0.tar.gz (33.3 KB) - added by simona 12 years ago.
The cabal package
BadPerform.2.hs (1.3 KB) - added by simona 12 years ago.
simplifier output for BadPerform.hs
badSimplOpt.txt (14.4 KB) - added by simona 12 years ago.
simplifier output for BadPerform.hs (with -O)
goodSimplOpt.txt (15.3 KB) - added by simona 12 years ago.
simplifier output for GoodPerform.hs (with -O)
badSimpl.txt (15.7 KB) - added by simona 12 years ago.
simplifier output for GoodPerform.hs
goodSimpl.txt (14.9 KB) - added by simona 12 years ago.
simplifier output for GoodPerform.hs
BadPerform.hs (1.4 KB) - added by simona 12 years ago.
The file where ghc optimises in a funny way.
GoodPerform.hs (1.4 KB) - added by simona 12 years ago.
The performance how it should be.
Fake.hs (1.5 KB) - added by simona 12 years ago.
A replacement for the big cabal library.

Download all attachments as: .zip

Change History (17)

Changed 12 years ago by simona

The cabal package

comment:1 Changed 12 years ago by simonpj

difficulty: Unknown

That is odd, I grant you.

However your code repeats the same computation 1000 times. That computation is in the I/O monad, but it does no I/O (or at least only outside the loop). Hence a clever optimiser might spot that it must give the same result every time, and compute that result just once.

I bet that is what is happening here. The "clever optimiser" may be somehow put off the scent by the minor change in structure of your program -- without more info I couldn't say exactly why.

If I'm right, then it's arguably bad that a minor change could make such a big performance difference. But repeating an identical computation 1000 times is equally arguably not something that a compiler should be robustly optimising for.

Of course I could be on the wrong track. Try -ddump-simpl and attach the output from that. You might like to look into that output for top-level definitions that look like lvl_*; these are the ones floated of loops. Also try -fno-full-laziness to switch off the full laziness transformation, and see if that make Good perform the same as Bad.

Simon

comment:2 Changed 12 years ago by simona

Gosh I hope this isn't happening since I used the timing results of the good code in a paper.

Adding -fno-full-laziness to the compilation of GoodPerform.hs changes nothing (thankfully). Note also that the good performer is less likely to benefit from lazyness optimization since it first runs one loop for 1000 times, then another loop for 1000 times. The bad one has the potential for optimization since it runs two nearly identical calculations 1000 times. I might be possible, of course, that the loops in the good code are simple enough for optimizations to kick in whereas they are too big in the bad case. However, I ran other, more complex loops that have a good performance, too. I'm still assuming that it is the bad code that triggers some bad behavior.

There is one side-effect in each loop iteration which tests some invariant on the result. There are quite a few computations that call the external library but which are pure, so I've wrapped them in unsafePerformIO. I don't know if that matters.

Also, I should mention that both programs run in constant space (around 20MB virtual space), hence it's not a space leak that keeps back the tighter loop in BadPerform.hs. I compiled the code without any -O flags and I attach the two simplifier outputs. There are no lvl_ strings in either file.

comment:3 Changed 12 years ago by simonpj

Thanks. More info: what command line arguments are you using when compiling your program? In your -ddump-simpl I see invocations of GHC.Base.$, which suggests you aren't using -O. That never occurred to me. Do you get the same performance differences with -O?

Of course, you still should not get these huge differences even without -O, but I'm interested because it'll give clues about where the problem is.

The Core code looks ok to me, so I'm still v puzzled about why you would see big perf differences. I wonder if you can make this easier to reproduce, and at the same time give more clues. For example, suppose you trim out most of the functions the fixpoint calls, and leave just some reasonably-expensive thing (some random quadratically-expensive algorithm, say). Does the effect still happen? Or is it somehow connected to the code that is called (meetAsym, equalityToInequalities, etc)? And if so can you narrow down which code is the culprit, by stubbing out other parts?

Simon

comment:4 Changed 12 years ago by simona

I've simplified the programs. This is what they do now:

aconit:~/current/source/sparseDomain:1821$ rm -f tests/BadPerform.o && ghc-6.8.2 --make tests/BadPerform.hs -ddump-simpl > badSimpl.txt && time tests/BadPerform
[1 of 1] Compiling Main             ( tests/BadPerform.hs, tests/BadPerform.o )
Linking tests/BadPerform ...

real    0m9.585s
user    0m9.551s
sys     0m0.033s
aconit:~/current/source/sparseDomain:1822$ rm -f tests/GoodPerform.o && ghc-6.8.2 --make tests/GoodPerform.hs -ddump-simpl > goodSimpl.txt && time tests/GoodPerform
[1 of 1] Compiling Main             ( tests/GoodPerform.hs, tests/GoodPerform.o )
Linking tests/GoodPerform ...

real    0m0.015s
user    0m0.012s
sys     0m0.002s

There are some fishy non-improving optimizations going on. As you suggested, I re-complied with optimizations:

aconit:~/current/source/sparseDomain:1825$ rm -f tests/BadPerform.o && ghc-6.8.2 --make tests/BadPerform.hs -O -ddump-simpl > badSimplOpt.txt && time tests/BadPerform
[1 of 1] Compiling Main             ( tests/BadPerform.hs, tests/BadPerform.o )
Linking tests/BadPerform ...

real    0m19.101s
user    0m19.044s
sys     0m0.054s
aconit:~/current/source/sparseDomain:1826$ rm -f tests/GoodPerform.o && ghc-6.8.2 --make tests/GoodPerform.hs -O -ddump-simpl > goodSimplOpt.txt && time tests/GoodPerform
[1 of 1] Compiling Main             ( tests/GoodPerform.hs, tests/GoodPerform.o )
Linking tests/GoodPerform ...

real    0m18.974s
user    0m18.914s
sys     0m0.056s

I have also tried to radically simplify the fixpoint function which would then return immediately. It is so simple, that I could build a cabal package that doesn't link to the simplex solver and which would therefore be easy to build for you. However, I'm not sure if you can see the difference anymore: Repeating the good case one million times takes 0.4 seconds, repeating the bad case one million times (which does twice the amount of work in each iteration) takes 1.4 seconds. Repeating the good case two million times takes 0.8 seconds, as expected.

I hope the attached simplifier outputs are informative enough. Note that I've updated the source files too!

Changed 12 years ago by simona

Attachment: BadPerform.2.hs added

simplifier output for BadPerform.hs

Changed 12 years ago by simona

Attachment: badSimplOpt.txt added

simplifier output for BadPerform.hs (with -O)

Changed 12 years ago by simona

Attachment: goodSimplOpt.txt added

simplifier output for GoodPerform.hs (with -O)

comment:5 Changed 12 years ago by simonpj

Crumbs! So switching on -O makes GoodPerform run slowly too!

Can you try with -fno-state-hack?

How fast do you expect this to run? For example, if you do just one of the fixpoints 1000 times, is that fast? Is it supposed to be fast? I'm trying to ask whether GoodPerform is going bizarrely fast or BadPerform is going bizarrely slowly.

Well done cutting down fixpoint. It'd be a huge help if you could so so some more, so that GLPK, gmp etc were not needed. If it runs too fast, just do something expensive, like min [i+j | i <- [1..n], j <- [1..n]].

Simon

comment:6 Changed 12 years ago by simona

I think (and hope) that BadPerform is going bizarrely slow. BadPerform runs the same loop twice and should thus only be twice as slow as GoodPerform, not 800 times as slow. I ran more complex examples a million times and got something in the range of 4-8s which means that each fixpoint computation (involving 8 iterations around the loop) takes 4-8us which is actually quite fast since that would mean that some complicated list operations and a several calls to simplex are done in a few thousand instructions on this 2.X GHz machine. But the numbers did seem to make sense in relation to each other.

I tried to add the -fno-state-hack flag but that didn't change anything.

I copied stubs of the relevant function into a new module called Fake.hs -- I tried putting the functions directly into BadPerform.hs but that it went as fast as GoodPerform.hs. I made the fixpoint function go slower by adding the test minimum [i+j | i <- [1..100], j <- [1..100]]<3 as you suggested.

The results:

1858$ rm -f tests/GoodPerform.o && ghc-6.8.2 --make GoodPerform.hs && time GoodPerform
real    0m0.008s
user    0m0.004s
sys     0m0.004s
1859$ rm -f tests/BadPerform.o && ghc-6.8.2 --make BadPerform.hs && time BadPerform
Linking BadPerform ...

real    0m2.923s
user    0m2.911s
sys     0m0.010s

I will update the two source files and their simplifier dump. I'll also add Fake.hs.

Thanks for looking into this.

Changed 12 years ago by simona

Attachment: badSimpl.txt added

simplifier output for GoodPerform.hs

Changed 12 years ago by simona

Attachment: goodSimpl.txt added

simplifier output for GoodPerform.hs

Changed 12 years ago by simona

Attachment: BadPerform.hs added

The file where ghc optimises in a funny way.

Changed 12 years ago by simona

Attachment: GoodPerform.hs added

The performance how it should be.

Changed 12 years ago by simona

Attachment: Fake.hs added

A replacement for the big cabal library.

comment:7 Changed 12 years ago by simonpj

You made a mistake in the new GoodPerform.hs; you wrote

  replicateM_ 1000 $ do
    --res1 <- fixpoint initial bottom bottom vn 1
    --unless (res1 `subset` res1) $ putStrLn "something's wrong"
    res2 <- fixpoint initial bottom bottom vn 1
    unless (res2 `subset` res2) $ putStrLn "something's wrong"

but you meant to use two calls to replicateM_, didn't you?

Anyway I know what's going on. Here is the crucial fragment of the two forms, after desguaring:

Bad:
  replicateM_ 1000 (f 10 >>= (\res1 -> f 20 >>= (\res2 -> return ())))

Good: 
  replicateM_ 1000 (f 10) >> replicateM_ 1000 (f 20)

The f is the fixpoint function; the 10 and 20 are the constant args to those two calls. Just look at it! The key point is that in both cases, the subexpression (f 10) is outside any lambdas, and hence is executed just once. It's just as I said, although obscured by clutter: the computation is unaffected by IO state, so there is no need for it to be repeated each time.

Furthermore, in the good case, the call (f 20) is also# outside any lambdas, and hence is computed only once. But in the bad case, the call (f 20) is inside the (\ res1 -> ...) abstraction, and so is computed once for each call of the (\res1...); that is 1000 times.

So that's why there's the big difference. It is a little puzzling, but it becomes much clearer when you desugar the do-notation.

The moral, as often, is that if you put a constant expression inside a loop, GHC will often compute it just once. But this occurs much more often in benchmarks than in real programs

Simon

comment:8 Changed 12 years ago by simona

Resolution: invalid
Status: newclosed

Ok, I think I get it. I've added a call to getArgs which makes both loops run quite slowly. The new timings are unfortunately not in my favour anymore (I'm comparing two different implementations) so I have to think of a way to explain that at the conference...

Thanks for taking the time to look into this. Axel.

Note: See TracTickets for help on using tickets.