Opened 12 years ago
Closed 12 years ago
#2252 closed bug (invalid)
Extreme performance degradation on minor code change
Reported by: | simona | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | 6.8.2 |
Keywords: | Cc: | ||
Operating System: | Linux | Architecture: | x86_64 (amd64) |
Type of failure: | None/Unknown | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
The attached code runs two similar calculations (call them a and b) 1000 times. If I calculate a and then b and then do the same 1000 times again the code runs in:
time ./tests/BadPerform real 0m40.429s user 0m40.360s sys 0m0.054s
If I run a 1000 times, then b 1000 times, the code runs in:
time ./tests/GoodPerform real 0m0.083s user 0m0.081s sys 0m0.001s
These are the times for ghc 6.6.1. For ghc 6.8.2, the first time reduces to 34s and the second time is identical.
Reproducing this bug is a bit involved: My program links to glpk which in turn uses gmp. Thus, you would have to:
- install GLPK somewhere
- extract the attached
sparsePolyhedra-0.1.0
cabal package - create a local copy of
libgmp
andlibglpk
in which all symbols containing gmp are renamed to mygmpcd
tosparsePolyhedra-0.1.0
- in
genLibs.sh
change the paths containing/users/absint/simona/local/lib
to where you installed GMP and GLPK - run
genLibs.sh
which will create local copies of
- in
sparsePolyhedra.cabal
change the path/users/absint/simona/current/source/sparseDomain
to the current directory - build and register the cabal package
- compile the attached
BadPerform.hs
andGoodPerform.hs
I guess this is quite complicated. I could send you the IM Core of these two files instead, if that helps.
Attachments (9)
Change History (17)
Changed 12 years ago by
Attachment: | sparsePolyhedra-0.1.0.tar.gz added |
---|
comment:1 Changed 12 years ago by
difficulty: | → Unknown |
---|
That is odd, I grant you.
However your code repeats the same computation 1000 times. That computation is in the I/O monad, but it does no I/O (or at least only outside the loop). Hence a clever optimiser might spot that it must give the same result every time, and compute that result just once.
I bet that is what is happening here. The "clever optimiser" may be somehow put off the scent by the minor change in structure of your program -- without more info I couldn't say exactly why.
If I'm right, then it's arguably bad that a minor change could make such a big performance difference. But repeating an identical computation 1000 times is equally arguably not something that a compiler should be robustly optimising for.
Of course I could be on the wrong track. Try -ddump-simpl
and attach the output from that. You might like to look into that output for top-level definitions that look like lvl_*
; these are the ones floated of loops. Also try -fno-full-laziness
to switch off the full laziness transformation, and see if that make Good perform the same as Bad.
Simon
comment:2 Changed 12 years ago by
Gosh I hope this isn't happening since I used the timing results of the good code in a paper.
Adding -fno-full-laziness
to the compilation of GoodPerform.hs
changes nothing (thankfully). Note also that the good performer is less likely to benefit from lazyness optimization since it first runs one loop for 1000 times, then another loop for 1000 times. The bad one has the potential for optimization since it runs two nearly identical calculations 1000 times. I might be possible, of course, that the loops in the good code are simple enough for optimizations to kick in whereas they are too big in the bad case. However, I ran other, more complex loops that have a good performance, too. I'm still assuming that it is the bad code that triggers some bad behavior.
There is one side-effect in each loop iteration which tests some invariant on the result. There are quite a few computations that call the external library but which are pure, so I've wrapped them in unsafePerformIO
. I don't know if that matters.
Also, I should mention that both programs run in constant space (around 20MB virtual space), hence it's not a space leak that keeps back the tighter loop in BadPerform.hs
. I compiled the code without any -O flags and I attach the two simplifier outputs. There are no lvl_
strings in either file.
comment:3 Changed 12 years ago by
Thanks. More info: what command line arguments are you using when compiling your program? In your -ddump-simpl
I see invocations of GHC.Base.$
, which suggests you aren't using -O. That never occurred to me. Do you get the same performance differences with -O?
Of course, you still should not get these huge differences even without -O, but I'm interested because it'll give clues about where the problem is.
The Core code looks ok to me, so I'm still v puzzled about why you would see big perf differences. I wonder if you can make this easier to reproduce, and at the same time give more clues. For example, suppose you trim out most of the functions the fixpoint
calls, and leave just some reasonably-expensive thing (some random quadratically-expensive algorithm, say). Does the effect still happen? Or is it somehow connected to the code that is called (meetAsym
, equalityToInequalities
, etc)? And if so can you narrow down which code is the culprit, by stubbing out other parts?
Simon
comment:4 Changed 12 years ago by
I've simplified the programs. This is what they do now:
aconit:~/current/source/sparseDomain:1821$ rm -f tests/BadPerform.o && ghc-6.8.2 --make tests/BadPerform.hs -ddump-simpl > badSimpl.txt && time tests/BadPerform [1 of 1] Compiling Main ( tests/BadPerform.hs, tests/BadPerform.o ) Linking tests/BadPerform ... real 0m9.585s user 0m9.551s sys 0m0.033s aconit:~/current/source/sparseDomain:1822$ rm -f tests/GoodPerform.o && ghc-6.8.2 --make tests/GoodPerform.hs -ddump-simpl > goodSimpl.txt && time tests/GoodPerform [1 of 1] Compiling Main ( tests/GoodPerform.hs, tests/GoodPerform.o ) Linking tests/GoodPerform ... real 0m0.015s user 0m0.012s sys 0m0.002s
There are some fishy non-improving optimizations going on. As you suggested, I re-complied with optimizations:
aconit:~/current/source/sparseDomain:1825$ rm -f tests/BadPerform.o && ghc-6.8.2 --make tests/BadPerform.hs -O -ddump-simpl > badSimplOpt.txt && time tests/BadPerform [1 of 1] Compiling Main ( tests/BadPerform.hs, tests/BadPerform.o ) Linking tests/BadPerform ... real 0m19.101s user 0m19.044s sys 0m0.054s aconit:~/current/source/sparseDomain:1826$ rm -f tests/GoodPerform.o && ghc-6.8.2 --make tests/GoodPerform.hs -O -ddump-simpl > goodSimplOpt.txt && time tests/GoodPerform [1 of 1] Compiling Main ( tests/GoodPerform.hs, tests/GoodPerform.o ) Linking tests/GoodPerform ... real 0m18.974s user 0m18.914s sys 0m0.056s
I have also tried to radically simplify the fixpoint
function which would then return immediately. It is so simple, that I could build a cabal package that doesn't link to the simplex solver and which would therefore be easy to build for you. However, I'm not sure if you can see the difference anymore: Repeating the good case one million times takes 0.4 seconds, repeating the bad case one million times (which does twice the amount of work in each iteration) takes 1.4 seconds. Repeating the good case two million times takes 0.8 seconds, as expected.
I hope the attached simplifier outputs are informative enough. Note that I've updated the source files too!
Changed 12 years ago by
Attachment: | badSimplOpt.txt added |
---|
simplifier output for BadPerform.hs (with -O)
Changed 12 years ago by
Attachment: | goodSimplOpt.txt added |
---|
simplifier output for GoodPerform.hs (with -O)
comment:5 Changed 12 years ago by
Crumbs! So switching on -O makes GoodPerform
run slowly too!
Can you try with -fno-state-hack
?
How fast do you expect this to run? For example, if you do just one of the fixpoints 1000 times, is that fast? Is it supposed to be fast? I'm trying to ask whether GoodPerform
is going bizarrely fast or BadPerform
is going bizarrely slowly.
Well done cutting down fixpoint
. It'd be a huge help if you could so so some more, so that GLPK, gmp etc were not needed. If it runs too fast, just do something expensive, like min [i+j | i <- [1..n], j <- [1..n]]
.
Simon
comment:6 Changed 12 years ago by
I think (and hope) that BadPerform is going bizarrely slow. BadPerform runs the same loop twice and should thus only be twice as slow as GoodPerform, not 800 times as slow. I ran more complex examples a million times and got something in the range of 4-8s which means that each fixpoint computation (involving 8 iterations around the loop) takes 4-8us which is actually quite fast since that would mean that some complicated list operations and a several calls to simplex are done in a few thousand instructions on this 2.X GHz machine. But the numbers did seem to make sense in relation to each other.
I tried to add the -fno-state-hack
flag but that didn't change anything.
I copied stubs of the relevant function into a new module called Fake.hs
-- I tried putting the functions directly into BadPerform.hs
but that it went as fast as GoodPerform.hs
. I made the fixpoint
function go slower by adding the test minimum [i+j | i <- [1..100], j <- [1..100]]<3
as you suggested.
The results:
1858$ rm -f tests/GoodPerform.o && ghc-6.8.2 --make GoodPerform.hs && time GoodPerform real 0m0.008s user 0m0.004s sys 0m0.004s 1859$ rm -f tests/BadPerform.o && ghc-6.8.2 --make BadPerform.hs && time BadPerform Linking BadPerform ... real 0m2.923s user 0m2.911s sys 0m0.010s
I will update the two source files and their simplifier dump. I'll also add Fake.hs
.
Thanks for looking into this.
Changed 12 years ago by
Attachment: | BadPerform.hs added |
---|
The file where ghc optimises in a funny way.
comment:7 Changed 12 years ago by
You made a mistake in the new GoodPerform.hs
; you wrote
replicateM_ 1000 $ do --res1 <- fixpoint initial bottom bottom vn 1 --unless (res1 `subset` res1) $ putStrLn "something's wrong" res2 <- fixpoint initial bottom bottom vn 1 unless (res2 `subset` res2) $ putStrLn "something's wrong"
but you meant to use two calls to replicateM_
, didn't you?
Anyway I know what's going on. Here is the crucial fragment of the two forms, after desguaring:
Bad: replicateM_ 1000 (f 10 >>= (\res1 -> f 20 >>= (\res2 -> return ()))) Good: replicateM_ 1000 (f 10) >> replicateM_ 1000 (f 20)
The f
is the fixpoint
function; the 10
and 20
are the constant args to those two calls. Just look at it! The key point is that in both cases, the subexpression (f 10)
is outside any lambdas, and hence is executed just once. It's just as I said, although obscured by clutter: the computation is unaffected by IO state, so there is no need for it to be repeated each time.
Furthermore, in the good
case, the call (f 20)
is also# outside any lambdas, and hence is computed only once. But in the bad
case, the call (f 20)
is inside the (\ res1 -> ...)
abstraction, and so is computed once for each call of the (\res1...); that is 1000 times.
So that's why there's the big difference. It is a little puzzling, but it becomes much clearer when you desugar the do-notation.
The moral, as often, is that if you put a constant expression inside a loop, GHC will often compute it just once. But this occurs much more often in benchmarks than in real programs
Simon
comment:8 Changed 12 years ago by
Resolution: | → invalid |
---|---|
Status: | new → closed |
Ok, I think I get it. I've added a call to getArgs
which makes both loops run quite slowly. The new timings are unfortunately not in my favour anymore (I'm comparing two different implementations) so I have to think of a way to explain that at the conference...
Thanks for taking the time to look into this. Axel.
The cabal package