Opened 5 years ago

Last modified 3 years ago

#9646 new bug

Simplifer non-determinism leading to 8 fold difference in run time performance

Reported by: erikd Owned by:
Priority: normal Milestone:
Component: Test Suite Version: 7.8.3
Keywords: Cc: gidyn
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case: testsuite/tests/simplCore/T9646
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2009
Wiki Page:

Description (last modified by erikd)

I've been chasing a weird performance bug for about a week now. I've reduced the testcase to the following git repo:

https://github.com/erikd/ghc-perfbug-test

The Readme.md file in that repo contains an explanation of how to reproduce the issue.

The benchmark program in the repo uses Criterion to test the performance of one particular function. Running the program in a particular way results into two significanlty different results.

The commands are:

make clean bench-integer.html
cp -f bench-integer.html bench-integer-slow.html
touch New3/GHC/Integer/Natural.hs && make bench-integer.html

The results for one function in the file New3/GHC/Integer/Natural.hs is 8 times slower in the first test that the second while the other test function always executes at about the speed.

I haven't been able to test this with ghc-7.6.3 because the code uses isTrue# which isn't available in 7.6.3.

Updated explanation:

Compiling the whole project results in the function under test running 4 times slower than the comparison. Touching the file containing the function under test and rebuilding the benchmark program results in the function under test runing 2 times faster than the comparison function.

Dumping intermediate files showed that the output of the simplifier was different enough to create this 8 fold difference in performance.

The simplifier should be deterministic. Same input file should result in the same simplifier output, preferably the output that results in 8 fold better performance.

Attachments (2)

fast-Natural.dump-simpl (6.3 KB) - added by erikd 5 years ago.
The fast version of the .dump-simpl code.
slow-Natural.dump-simpl (10.6 KB) - added by erikd 5 years ago.
The slow version.

Download all attachments as: .zip

Change History (30)

comment:1 Changed 5 years ago by jstolarek

isTrue# is an alias to tagToEnum#, so you can convert your code to GHC 7.6.

EDIT:

I'm getting this:

[killy@GLaDOS : /dane/download/ghc-perfbug-test] make clean bench-integer.html
sed "s/NewX/New1/" Check/NewX.template.hs > Check/New1.hs
sed "s/NewX/New3/" Check/NewX.template.hs > Check/New3.hs
sed "s/BenchX/Bench1/;s/NewX/New1/" Check/BenchX.template.hs > Check/Bench1.hs
sed "s/BenchX/Bench3/;s/NewX/New3/" Check/BenchX.template.hs > Check/Bench3.hs
ghc -Wall -Werror -fwarn-tabs -fPIC -ddump-to-file -ddump-prep -ddump-cmm -ddump-opt-cmm -ddump-stg -ddump-asm -O3 -XCPP -XMagicHash -XUnboxedTuples -XUnliftedFFITypes --make bench-integer.hs -o bench-integer

bench-integer.hs:5:18:
    Could not find module ‘Criterion.Main’
    Use -v to see a list of the files searched for.
make: *** [bench-integer] Błąd 1
Last edited 5 years ago by jstolarek (previous) (diff)

comment:2 Changed 5 years ago by dfeuer

I don't know what's going on, but I'm curious whether there might be some sort of interference from make or associated OS or CPU stuff. Have you tried running the benchmarks separately from the compilation process? Another thing to experiment with is reversing the order of the Criterion benchmarks to see if the slowness follows the position in the benchmark list.

comment:3 Changed 5 years ago by erikd

Architecture: Unknown/Multiplex86_64 (amd64)
Operating System: Unknown/MultipleLinux

@jstolarek : It looks you you do not have the criterion library installed.

@dferuer :

  • Changing the order of the benchmarks makes no difference, the New3* runs still exhibit the same behaviour.
  • The make program is not a factor. I can take the commands printed using make -n and put them in a shell script for the same result.
  • Running the benchmarking program separately from the compile process gives almost identical results.

I've also assured myself that the ghc command:

ghc -Wall -Werror -fwarn-tabs -fPIC
    -ddump-to-file -ddump-prep -ddump-cmm -ddump-opt-cmm -ddump-stg -ddump-asm
    -O3 -XCPP -XMagicHash -XUnboxedTuples -XUnliftedFFITypes
    --make bench-integer.hs -o bench-integer

is the same in both cases. This is to be expected as make really just calls this ghc --make command.

Behaviour is also the same with -O2 and -O1. I tried compiling with -O0 which gave radically different results but there were no interesting conclusions to be drawn.

comment:4 in reply to:  3 Changed 5 years ago by jstolarek

Replying to erikd:

@jstolarek : It looks you you do not have the criterion library installed.

I have criterion-1.0.1.0 installed in a cabal sandbox. Perhaps your make script does not recognize sandboxes?

comment:5 Changed 5 years ago by erikd

@jstolarek : No, my Makefile does not recognize cabal sandboxes.

Changed 5 years ago by erikd

Attachment: fast-Natural.dump-simpl added

The fast version of the .dump-simpl code.

Changed 5 years ago by erikd

Attachment: slow-Natural.dump-simpl added

The slow version.

comment:6 Changed 5 years ago by erikd

I added a a number of -ddump-* flags and reduced the New3/GHC/InteterNatural.hs file to contain nothing but the timeNatural function that we're interested in.

What I found was:

  • Desugar stage output (file Natural.dump-ds) output is identical.
  • Simplifier stage (file Natural.dump-simpl) is different.

I've added the fast and slow versions of the .dump-simpl outputs.

The slow version of the code seems to have a lot more let bindings that the fast version. The fast version is more likely to use case expressions.

Last edited 5 years ago by erikd (previous) (diff)

comment:7 Changed 5 years ago by erikd

Summary: Strange performance bugSimplifer non-determinism leading to 8 fold difference in run time performance

comment:8 Changed 5 years ago by erikd

Description: modified (diff)

comment:9 Changed 5 years ago by erikd

Amos Robinson has helped me make some progress on this.

The problem is with the StrictPrim monad (defined in the file Common/GHC/Ineteger/StrictPrim.hs of the ghc-perfbug-test project) which basically an ST monad with a bunch of explicit bang patterns to force strict evaluation.

The big difference between the slow and the fast version is that in the slow version of the .dump-simpl output of these inner loops have an extra lambda eg on line 39 of slow-Natural.dump-simpl:

(\ eta -> (# eta, vx #)) `cast` ...;

The eta variable in this case is the state token for the StrictPrim monad.

Furthermore, Amos Robinson was able to re-write the inner loop from:

    innerLoop2 !xi !yi !carryhi !carrylo !sum
        | yi < n2 = do
            x <- indexWordArrayM arr1 xi
            y <- indexWordArrayM arr2 yi
            let (# !cry0, !prod #) = timesWord2 x y
                (# !cry1, !sum1 #) = plusWord2 prod sum
                (# !tcryhi, !crylo #) = plusWord2C carrylo cry0 cry1
                !cryhi = plusWord carryhi tcryhi
            innerLoop2 (xi - 1) (yi + 1) cryhi crylo sum1
        | otherwise = do return $! (carryhi, carrylo, sum)

to

    innerLoop2 !xi !yi !carryhi !carrylo !sum =
        StrictPrim $ \s ->
            if yi < n2
                then
                    let StrictPrim go = do
                        x <- indexWordArrayM arr1 xi
                        y <- indexWordArrayM arr2 yi
                        let (# !cry0,   !prod  #) = timesWord2 x y
                            (# !cry1,   !sum1  #) = plusWord2  prod sum
                            (# !tcryhi, !crylo #) = plusWord2C carrylo cry0 cry1
                            !cryhi                = plusWord   carryhi tcryhi
                        innerLoop2 (xi - 1) (yi + 1) cryhi crylo sum1
                    in go s
                else (# s, (carryhi, carrylo, sum) #)

and this new formulation *always* compiles to run fast.

After some discussion with Amos, we came to the following conclusions:

  1. The simplifier should be deterministic. The same input file, with the same compiler flags should result in the same output (preferably the output that performs better).
  1. It is not un-reasonable to expect the compiler to apply the transformation that Amos appied manually.
Last edited 5 years ago by erikd (previous) (diff)

comment:10 Changed 5 years ago by simonpj

I spent a few minutes earlier today, and indeed I'm sure that difference in eta expansion is the cause of the trouble. What I do not understand is why it happens.

Would you be prepared to cut down the example still more? Dump the Criterion wrapping. Just leave enough to reproduce the strange difference in Core generated in the wot different ways. Strip away everything else; eg we don't actually need to run the program any more.

Thanks

Simon

comment:11 Changed 5 years ago by gidyn

Cc: gidyn added

comment:12 Changed 5 years ago by erikd

​The repo at https://github.com/erikd/ghc-perfbug-test has been stripped back as far as possible and no longer needs either criterion or hspec.

The code has been tested with ghc-7.8.3 and ghc-7.9.20140929 and both display the same problem.

Cloning the repo, cd-ing into it and running make will build the code and then print out:

    Fast version etas : 0
    Slow version etas : 18

which is simply the count of the occurence of the identifier eta in the twp -ddump-simpl outputs generated from the same file.

When this bug is fixed, both values should be zero.

comment:13 Changed 5 years ago by thomie

Milestone: 7.10.1
Resolution: fixed
Status: newclosed

I just ran make in a clone of the testrepo from comment:12, after installing primitive-0.5.4 in a cabal sandbox.

Results with GHC 7.8.3:

    Fast version etas : 0
    Slow version etas : 18

Results with GHC 7.9.20141108:

    Fast version etas : 0
    Slow version etas : 0

So this bug must have been fixed somewhere along the way, although I don't know in which commit. Feel free to reopen if you think this answer is unsatisfactory, or hang tight till the 7.10 release.

comment:14 Changed 5 years ago by jstolarek

It would be great if we could have a test for this so we're sure this does not regress again. I looked at the repo and the code to reproduce this is pretty big. I wonder if it could be reduced further to include it in the testsuite.

comment:15 in reply to:  14 Changed 4 years ago by thomie

Resolution: fixed
Status: closednew

Replying to jstolarek:

It would be great if we could have a test for this so we're sure this does not regress again. I looked at the repo and the code to reproduce this is pretty big. I wonder if it could be reduced further to include it in the testsuite.

Good idea. Reopening.

comment:16 Changed 4 years ago by thoughtpolice

Milestone: 7.10.17.10.3

Moving to 7.10.3 - would be nice to fix.

comment:17 Changed 4 years ago by simonpj

Moving to 7.10.3 - would be nice to fix.

I don't understand. It's apparently fixed in 7.9 and so should be fixed in 7.10.2, right? What we need is a test case so we will know if it regresses.

comment:18 Changed 4 years ago by thomie

Component: CompilerTest Suite

Bug is fixed. Just needs a test.

comment:19 Changed 4 years ago by bgamari

Milestone: 7.10.38.0.1

Still needs a test but this won't happen for 7.10.3.

comment:20 Changed 4 years ago by erikd

Really close to having this pared down enough to add to the test suite. Since this issue was actually fixed before the 7.10.2 release its not a problem that the test doesn't get added to 7.10.3.

comment:21 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:22 Changed 4 years ago by erikd

Differential Rev(s): Phab:D2009
Status: newpatch

comment:23 Changed 3 years ago by Erik de Castro Lopo <erikd@…>

In 220a0b93/ghc:

Add test for #9646

Test Plan: Test that it passes git HEAD and fails with GHC 7.8.

Reviewers: bgamari, hvr, austin, goldfire, thomie

Differential Revision: https://phabricator.haskell.org/D2009

GHC Trac Issues: #9646

comment:24 Changed 3 years ago by erikd

Test Case: testsuite/tests/simplCore/T9646

comment:25 Changed 3 years ago by erikd

Resolution: fixed
Status: patchclosed

Since the test has new been commited to master, this can be closed.

comment:26 Changed 3 years ago by Ben Gamari <ben@…>

In 08d254bc/ghc:

Fix T9646

For #9646.

comment:27 Changed 3 years ago by thomie

Resolution: fixed
Status: closednew

This test is failing for WAY=ghci. This is the output, after adding -fobject-code (because of the use of unboxed tuples):

--- ./T9646.run/T9646.stdout.normalised	2016-06-19 16:49:06.204443094 +0200
+++ ./T9646.run/T9646.run.stdout.normalised	2016-06-19 16:49:06.204443094 +0200
@@ -1,3 +1,96 @@
+
+==================== Desugared ====================
+bindIO
+  ((\ @ a ->
+      let {
+        $dGHCiSandboxIO
+        $dGHCiSandboxIO = $fGHCiSandboxIOIO } in
+      ghciStepIO $dGHCiSandboxIO)
+     (hSetBuffering stdout LineBuffering))
+  (\ it -> returnIO (build (\ @ a c n -> c (unsafeCoerce# it) n)))
+
+
+
+==================== Simplifier statistics ====================
+Total ticks:     12
+
+2 PreInlineUnconditionally
+  1 tpl
+  1 $dGHCiSandboxIO
+1 UnfoldingDone 1 unsafeCoerce#
+1 RuleFired 1 Class op ghciStepIO
+1 EtaExpansion 1 it
+7 BetaReduction
+  1 k0
+  1 k1
+  1 a
+  1 b
+  1 tpl
+  1 a
+  1 a
+
+
+
+==================== Simplified expression ====================
+bindIO
+  (id (hSetBuffering stdout LineBuffering))
+  ((\ it eta ->
+      ((returnIO (build (\ @ a c n -> c it n))) `cast` ...) eta)
+   `cast` ...)
+
+
+
+==================== Desugared ====================
+let {
+  $dMonad
+  $dMonad = $fMonadIO } in
+let {
+  $dMonad
+  $dMonad = $dMonad } in
+bindIO
+  ((\ @ a ->
+      let {
+        $dGHCiSandboxIO
+        $dGHCiSandboxIO = $fGHCiSandboxIOIO } in
+      ghciStepIO $dGHCiSandboxIO)
+     (>> $dMonad (runIOFastExit main) (return $dMonad ())))
+  (\ it -> returnIO (build (\ @ a c n -> c (unsafeCoerce# it) n)))
+
+
+
+==================== Simplifier statistics ====================
+Total ticks:     16
+
+3 PreInlineUnconditionally
+  1 tpl
+  1 $dGHCiSandboxIO
+  1 $dMonad
+1 PostInlineUnconditionally 1 $dMonad
+1 UnfoldingDone 1 unsafeCoerce#
+3 RuleFired
+  1 Class op >>
+  1 Class op ghciStepIO
+  1 Class op return
+1 EtaExpansion 1 it
+7 BetaReduction
+  1 k0
+  1 k1
+  1 a
+  1 b
+  1 tpl
+  1 a
+  1 a
+
+
+
+==================== Simplified expression ====================
+bindIO
+  (id (thenIO (runIOFastExit main) ($fMonadIO_$creturn ())))
+  ((\ it eta ->
+      ((returnIO (build (\ @ a c n -> c it n))) `cast` ...) eta)
+   `cast` ...)
+
+
 1234
 2345
 2893730
*** unexpected failure for T9646(ghci)

comment:28 Changed 3 years ago by Thomas Miedema <thomasmiedema@…>

In 1d938aa/ghc:

Testsuite: mark tests expect broken

* CgStaticPointers, GcStaticPointers, ListStaticPointers,
  TcStaticPointers01, TcStaticPointers02:  #12207
* T11535: #12210
* ffi017/ffi021: #12209
* T11108: #11108
* T9646: #9646
Note: See TracTickets for help on using tickets.