Opened 17 months ago

Last modified 8 months ago

#14980 new bug

Runtime performance regression with binary operations on vectors

Reported by: ttylec Owned by: bgamari
Priority: high Milestone: 8.10.1
Component: Compiler Version: 8.2.2
Keywords: vector bitwise operations Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are "matched" by the predicate.

In order to optimize performance, our code uses two main representations for data: one is a "human readable", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.

Simplified but representative code is attached.

In GHC 8.0.2 binary representation is faster by one order of magnitude than the "human readable" one:

➜  ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time                 342.6 μs   (338.3 μs .. 348.0 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 339.3 μs   (337.5 μs .. 342.0 μs)
std dev              7.203 μs   (5.596 μs .. 10.07 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 256 columns/binary packed
time                 32.07 μs   (29.69 μs .. 34.14 μs)
                     0.982 R²   (0.976 R² .. 0.997 R²)
mean                 29.97 μs   (29.41 μs .. 31.03 μs)
std dev              2.428 μs   (1.526 μs .. 3.750 μs)
variance introduced by outliers: 78% (severely inflated)

In GHC 8.2.2 (and later) binary representation performance is similar to "human readable", so no performance gain:

➜  ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time                 442.4 μs   (406.7 μs .. 474.5 μs)
                     0.978 R²   (0.969 R² .. 0.993 R²)
mean                 399.3 μs   (391.3 μs .. 415.1 μs)
std dev              34.73 μs   (20.36 μs .. 53.29 μs)
variance introduced by outliers: 71% (severely inflated)

benchmarking 256 columns/binary packed
time                 378.6 μs   (295.8 μs .. 488.0 μs)
                     0.637 R²   (0.492 R² .. 0.780 R²)
mean                 568.1 μs   (437.1 μs .. 747.6 μs)
std dev              308.7 μs   (233.6 μs .. 386.1 μs)
variance introduced by outliers: 98% (severely inflated)

However, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:

➜  ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time                 337.6 μs   (336.1 μs .. 339.3 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 349.6 μs   (344.4 μs .. 359.7 μs)
std dev              23.22 μs   (15.39 μs .. 38.22 μs)
variance introduced by outliers: 60% (severely inflated)

benchmarking 64 columns/binary packed
time                 21.66 μs   (21.53 μs .. 21.79 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 21.68 μs   (21.53 μs .. 21.89 μs)
std dev              613.2 ns   (466.0 ns .. 806.0 ns)
variance introduced by outliers: 30% (moderately inflated)

benchmarking 256 columns/raw unbox vectors
time                 344.5 μs   (341.6 μs .. 348.2 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 345.1 μs   (342.5 μs .. 349.3 μs)
std dev              10.66 μs   (7.997 μs .. 16.34 μs)
variance introduced by outliers: 25% (moderately inflated)

benchmarking 256 columns/binary packed
time                 28.04 μs   (27.70 μs .. 28.46 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 28.05 μs   (27.85 μs .. 28.30 μs)
std dev              758.2 ns   (628.2 ns .. 907.6 ns)
variance introduced by outliers: 27% (moderately inflated)

I made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:

➜  ghc-bug stack exec performance-bug-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time                 333.8 μs   (333.0 μs .. 335.1 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 333.6 μs   (332.4 μs .. 335.8 μs)
std dev              5.651 μs   (3.233 μs .. 9.582 μs)

benchmarking 64 columns/binary packed
time                 39.06 μs   (38.98 μs .. 39.14 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.94 μs   (38.83 μs .. 39.14 μs)
std dev              495.0 ns   (310.2 ns .. 782.1 ns)

benchmarking 256 columns/raw unbox vectors
time                 336.9 μs   (336.2 μs .. 337.9 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 337.5 μs   (336.2 μs .. 339.8 μs)
std dev              5.757 μs   (2.946 μs .. 8.979 μs)

benchmarking 256 columns/binary packed
time                 239.8 μs   (237.6 μs .. 243.0 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 251.4 μs   (247.9 μs .. 259.8 μs)
std dev              11.50 μs   (5.662 μs .. 19.79 μs)
variance introduced by outliers: 37% (moderately inflated)

In GHC 8.0.2 we have speed-up in both cases.

What may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the performance-bug-rfu.hs source).

Short description of attached sources:

  • performance-bug-pair-2.hs: using two binary representations
  • performance-bug-pair-1.hs: using one binary representation (one case commented)
  • performance-bug-1.hs: using one binary representation with simplified accumulator
  • performance-bug-2.hs: using two binary representation with simplified accumulator
  • performance-bug-rfu.hs: using random-fu to generate data (optional)
  • stack-8.0.yaml: stack file for GHC-8.0.2
  • stack-8.2.yaml: stack file for GHC-8.2.2
  • stack-nightly.yaml: stack file for GHC-8.4
  • performance-bug.cabal: to be able to stack build everything.

Attachments (1)

ghc-bug.tar.gz (3.0 KB) - added by ttylec 17 months ago.
source code

Download all attachments as: .zip

Change History (30)

Changed 17 months ago by ttylec

Attachment: ghc-bug.tar.gz added

source code

comment:1 Changed 17 months ago by ttylec

As a suplement: it is related to optimization. I usually use -O2, but effect is present also with -O2. With -O0 on GHC 8.0.2 there is no speed-up in binarized version. Program compiled with GHC 8.2.2 with -O0 crashes(!)

comment:2 Changed 17 months ago by bgamari

Priority: normalhigh

Thank you for the very complete ticket and repro!

We'll have a look.

comment:3 Changed 17 months ago by ttylec

It came to my mind that different stack snapshots could bring different vector version along GHC. But that's not the case, both have 0.12.0.1.

Please let me know if I can be of any assistance. I tried to debug it more but I have no CS background (I am mathematical physicist) so you need to treat me as a newbie.

comment:4 Changed 17 months ago by bgamari

Milestone: 8.6.1
Owner: set to bgamari

comment:5 Changed 14 months ago by bgamari

Milestone: 8.6.18.8.1

This won't be fixed in 8.6. Bumping to 8.8.

comment:6 Changed 14 months ago by tdammers

Hmm, I spent all day trying to reproduce this, but I'm getting exactly the results you'd want. I built the package with cabal, once using ghc 8.0.2, and once using 8.2.2 (both installed from linux-x86_64 release bundles downloaded directly from haskell.org), and I get the same behavior across both compilers, with performance metrics similar to what you're seeing in the 8.0.2 case.

Output from performance-bug-pair1, compiled with GHC 8.2.2:

tobias@zoidberg:~/well-typed/devel/ghc-T14980/ghc-bug/ > performance-bug-pair-1                                  
"Generated"
benchmarking 256 columns/raw unbox vectors
time                 478.6 μs   (477.8 μs .. 479.9 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 476.4 μs   (476.0 μs .. 477.0 μs)
std dev              1.501 μs   (1.076 μs .. 2.501 μs)

benchmarking 256 columns/binary packed
time                 275.2 μs   (271.7 μs .. 278.6 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 280.2 μs   (276.6 μs .. 283.2 μs)
std dev              6.706 μs   (5.090 μs .. 9.040 μs)
variance introduced by outliers: 13% (moderately inflated)


And with 8.0.2:

"Generated"
benchmarking 256 columns/raw unbox vectors
time                 476.9 μs   (474.4 μs .. 480.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 473.5 μs   (472.5 μs .. 475.1 μs)
std dev              4.097 μs   (2.212 μs .. 7.591 μs)

benchmarking 256 columns/binary packed
time                 291.3 μs   (286.8 μs .. 295.4 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 297.4 μs   (293.6 μs .. 300.1 μs)
std dev              6.918 μs   (4.971 μs .. 9.873 μs)
variance introduced by outliers: 12% (moderately inflated)

So there must be something about my setup that makes the bug disappear.

comment:7 Changed 14 months ago by ttylec

Hm... not exactly the bug disappear, you observed no speedup with "binary packed" version in the first place. Notice that in my benchmark, the "binary packed" version is an order of magnitude faster that the "unbox vectors" and the bug is about loosing that speed-up when we compile with 8.2.2 (and later)

In your case, there was no speed-up in the first place. May I ask you to check also stack exec performance-bug-pair-2 and stack exec performance-bug-2? I am curious on what machine/system you did tested it? Oh, and obviously optimization must be enabled (in case you didn't stack build it).

comment:8 in reply to:  7 Changed 14 months ago by tdammers

Replying to ttylec:

Hm... not exactly the bug disappear, you observed no speedup with "binary packed" version in the first place. Notice that in my benchmark, the "binary packed" version is an order of magnitude faster that the "unbox vectors" and the bug is about loosing that speed-up when we compile with 8.2.2 (and later)

Indeed, I noticed.

In your case, there was no speed-up in the first place. May I ask you to check also stack exec performance-bug-pair-2 and stack exec performance-bug-2?

Stack failed on a cassava thing, I'm not exactly sure how to fix it, so I tried cabal first. If stack vs. cabal is the problem, however, then most likely that means it's a library problem rather than a GHC bug. I'll see if I can sort it out though, so that maybe I can reproduce your results. However, I don't think optimization settings could be the problem - after all, these are specified in the .cabal file, not stack.yaml, I have not overridden anything, and I verified that -O2 actually gets passed to GHC. I have even compiled the programs manually, merely pointing GHC to the cabal sandbox's package cache, and got the same results regardless.

Results for performance-bug-2 are similar.

8.0.2:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 445.2 μs   (445.1 μs .. 445.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 443.4 μs   (442.9 μs .. 443.8 μs)
std dev              1.366 μs   (1.077 μs .. 1.655 μs)

benchmarking 64 columns/binary packed
time                 51.16 μs   (51.09 μs .. 51.24 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 50.95 μs   (50.90 μs .. 51.01 μs)
std dev              204.4 ns   (159.7 ns .. 264.0 ns)

benchmarking 256 columns/raw unbox vectors
time                 443.9 μs   (443.6 μs .. 444.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 442.7 μs   (442.1 μs .. 444.1 μs)
std dev              2.711 μs   (1.414 μs .. 5.048 μs)

benchmarking 256 columns/binary packed
time                 260.4 μs   (255.3 μs .. 266.2 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 266.6 μs   (263.1 μs .. 271.5 μs)
std dev              9.366 μs   (6.649 μs .. 13.29 μs)
variance introduced by outliers: 24% (moderately inflated)

8.2.2:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 445.0 μs   (444.7 μs .. 445.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 444.0 μs   (443.2 μs .. 447.0 μs)
std dev              4.654 μs   (1.118 μs .. 9.693 μs)

benchmarking 64 columns/binary packed
time                 51.13 μs   (51.11 μs .. 51.15 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 50.90 μs   (50.86 μs .. 50.94 μs)
std dev              146.1 ns   (122.9 ns .. 181.2 ns)

benchmarking 256 columns/raw unbox vectors
time                 440.4 μs   (440.1 μs .. 440.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 437.9 μs   (437.3 μs .. 438.4 μs)
std dev              1.797 μs   (1.576 μs .. 2.095 μs)

benchmarking 256 columns/binary packed
time                 289.5 μs   (285.1 μs .. 294.3 μs)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 295.6 μs   (292.0 μs .. 299.8 μs)
std dev              8.814 μs   (6.656 μs .. 11.69 μs)
variance introduced by outliers: 19% (moderately inflated)

I am curious on what machine/system you did tested it? Oh, and obviously optimization must be enabled (in case you didn't stack build it).

Debian 9, x86_64. Intel i5 CPU, 4 GB RAM, official GHC release builds.

I have a few possible explanations as to why we're seeing these differences:

  • Stack may be pulling in other GHC versions than the release bundles
  • Stack may be pulling in different version of some crucial library
  • Whatever platform you run on might trigger different code paths in GHC

I'll investigate further.

comment:9 Changed 14 months ago by tdammers

OK, some more experimenting.

Changed the .cabal file to force vector to 0.12.0.1 (the latest version on hackage); no difference, results are still more or less the same on both compilers.

Also managed to get things built with stack; the 8.2 results are also very similar:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 443.6 μs   (443.2 μs .. 443.9 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 441.6 μs   (441.2 μs .. 442.0 μs)
std dev              1.364 μs   (1.092 μs .. 1.738 μs)

benchmarking 64 columns/binary packed
time                 51.35 μs   (51.29 μs .. 51.44 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 51.07 μs   (51.00 μs .. 51.18 μs)
std dev              265.9 ns   (176.1 ns .. 453.5 ns)

benchmarking 256 columns/raw unbox vectors
time                 446.1 μs   (445.5 μs .. 447.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 443.9 μs   (443.4 μs .. 444.6 μs)
std dev              1.883 μs   (1.337 μs .. 3.374 μs)

benchmarking 256 columns/binary packed
time                 291.2 μs   (286.2 μs .. 296.9 μs)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 298.3 μs   (294.2 μs .. 301.2 μs)
std dev              7.891 μs   (5.688 μs .. 10.20 μs)
variance introduced by outliers: 15% (moderately inflated)

Stack-installed, with 8.0.2:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 412.3 μs   (411.7 μs .. 413.1 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 412.4 μs   (412.0 μs .. 413.1 μs)
std dev              1.643 μs   (1.124 μs .. 2.815 μs)

benchmarking 64 columns/binary packed
time                 21.89 μs   (21.87 μs .. 21.90 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 21.90 μs   (21.89 μs .. 21.91 μs)
std dev              38.13 ns   (26.94 ns .. 54.15 ns)

benchmarking 256 columns/raw unbox vectors
time                 417.6 μs   (416.0 μs .. 419.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 417.2 μs   (416.5 μs .. 419.1 μs)
std dev              3.403 μs   (935.8 ns .. 6.896 μs)

benchmarking 256 columns/binary packed
time                 29.56 μs   (29.51 μs .. 29.63 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.52 μs   (29.51 μs .. 29.56 μs)
std dev              71.00 ns   (36.37 ns .. 131.9 ns)

So apparently compiling with 8.0.2 through cabal gives different results than compiling though stack. Strange.

comment:10 Changed 14 months ago by tdammers

Even more strangely, I am now getting 8.0.2-like results using 8.2.2:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 417.9 μs   (417.5 μs .. 418.3 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 418.3 μs   (417.8 μs .. 419.6 μs)
std dev              2.364 μs   (827.2 ns .. 4.744 μs)

benchmarking 64 columns/binary packed
time                 22.95 μs   (22.90 μs .. 23.01 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 22.94 μs   (22.91 μs .. 22.97 μs)
std dev              105.9 ns   (75.87 ns .. 146.9 ns)

benchmarking 256 columns/raw unbox vectors
time                 432.7 μs   (422.4 μs .. 443.9 μs)
                     0.997 R²   (0.995 R² .. 0.999 R²)
mean                 423.4 μs   (420.3 μs .. 428.6 μs)
std dev              13.41 μs   (7.712 μs .. 19.14 μs)
variance introduced by outliers: 25% (moderately inflated)

benchmarking 256 columns/binary packed
time                 29.75 μs   (29.51 μs .. 30.16 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 29.64 μs   (29.57 μs .. 29.86 μs)
std dev              395.1 ns   (120.1 ns .. 801.2 ns)

I'm starting to suspect that the decisive factor here is not GHC itself, but rather something else. Quite possibly, library versions play a role - the cabal builds for 8.0.2 and 8.2.2 use the same versions for all libraries (except base), while the stack builds differ in a few of them.

Another possible explanation could be that while I'm forcing the compiler itself when compiling with cabal (--with-compiler / --with-ghc), I am not doing the same for any of the other tools cabal may be using, and those might not resolve in a compiler-specific way but rather just use whatever is installed system-wide (which happens to be from GHC 8.0.2 right now, but was 8.4.2 when I first started profiling).

comment:11 Changed 14 months ago by simonpj

Perhaps the difference between stack and cabal come from the two making difference choices about which versions of which libraries they choose.

comment:12 Changed 14 months ago by tdammers

I have written a script to analyze the library selections in the 4 different build scenarios. Cabal with GHC 8.0.2 and cabal with GHC 8.2.2 both pull in the exact same libraries (except for base, obv.); so I only compared cabal with GHC 8.0.2 (good1), stack LTS 9.21 (good2), and stack LTS 11.1 (bad).

I deleted ~/.stack, and made clean builds for each of the three and verified that the good ones expose the desired behavior, and the bad one regresses. Then I listed the packages that got installed (from .stack/snapshots/... and .cabal-sandbox/.../...packages.conf.d), and compared the three lists. I threw out all the packages that met any of the following criteria:

  • the package is not in all 3 lists (meaning that it's not a direct dependency, and the difference must also be reflected in some other package choice)
  • the package version for the bad list matches one of the good lists exactly (meaning that it cannot be the culprit)

The remaining list is this (columns giving the installed versions for the good1, good2 and bad lists in this order):

                         aeson:      1.4.0.0 |      1.1.2.0 |      1.2.4.0
                     criterion:      1.4.1.0 |      1.1.4.0 |      1.3.0.0
                     primitive:      0.6.4.0 |      0.6.2.0 |      0.6.3.0
                    scientific:      0.3.6.2 |      0.3.5.2 |      0.3.5.3
                 terminal-ansi:      0.8.0.4 |      0.6.3.1 |      0.8.0.2

Which is interesting, in that the bad versions are all between the good1 and good2 versions. So if any of these cause the badness, then the problem must be a regression that wasn't present in the first version, and has been fixed between the second and the third version.

Another remarkable fact is that Cabal installs the exact same versions of all the libraries for GHC 8.0.2 and GHC 8.2.2, despite base versions being different - this could mean one of two things:

a) Cabal fails to detect the correct version of base (using the system-wide GHC install instead of the one demanded by --with-compiler), so we end up resolving libraries for 8.0.2, and it just happens to compile cleanly with 8.2.2 anyway; or

b) All the required libraries happen to have bounds on base that are compatible with both 8.0.2 and 8.2.2, so Cabal's resolver ends up producing the exact same solution for both compilers.

a) would be a bug, b) would be a lucky coincidence.

comment:13 Changed 14 months ago by simonpj

If the badness is only for a particular stack LTS, I wonder if the badness will go away in the next stack LTS and we can declare this ticket moot?

I'd suggest the next thing to investigate: do a profiled build and see where the factor of 100 is going. It's such a huge change it should be blindingly obvious :-).

comment:14 Changed 14 months ago by tdammers

Profiled builds don't seem to reveal much, because as soon as I enable profiling, the performance difference between LTS 9 / GHC 8.0.2 and LTS 11 / GHC 8.2.2 disappears almost entirely - in fact, a profiling build from LTS 9 is *slower* than one from LTS 11.

Profile output from LTS 9:

        Tue Jun 26 16:12 2018 Time and Allocation Profiling Report  (Final)

           performance-bug-2 +RTS -p -RTS

        total time  =       28.28 secs   (28277 ticks @ 1000 us, 1 processor)
        total alloc = 24,003,734,896 bytes  (excludes profiling overheads)

COST CENTRE       MODULE                            SRC                                                    %time %alloc

>>=               Data.Vector.Fusion.Util           Data/Vector/Fusion/Util.hs:36:3-18                      12.4   19.9
compile'.\        Main                              performance-bug-2.hs:40:22-33                           11.4    4.3
basicUnsafeIndexM Data.Vector.Primitive             Data/Vector/Primitive.hs:222:3-75                       10.9   16.1
matchPacked       Main                              performance-bug-2.hs:(108,1)-(111,27)                    8.8   11.0
uniform           System.Random.MWC                 System/Random/MWC.hs:217:5-33                            8.7   15.0
basicUnsafeIndexM Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:266:841-899                  6.6    0.0
matchPacked.go    Main                              performance-bug-2.hs:(110,5)-(111,27)                    3.7    0.4
>>=               Data.Vector.Fusion.Util           Data/Vector/Fusion/Util.hs:50:3-19                       3.3    2.6
getOverhead       Criterion.Monad                   Criterion/Monad.hs:(47,1)-(56,12)                        3.3    0.0
basicUnsafeWrite  Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:115:3-69                2.7    2.3
basicUnsafeIndexM Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:345:3-73                     2.6   11.2
matched.matches   Main                              performance-bug-2.hs:28:5-27                             2.4    0.0
basicUnsafeIndexM Data.Vector.Unboxed.Base          internal/unbox-tuple-instances:(452,3)-(458,29)          2.2    0.0
compile'          Main                              performance-bug-2.hs:40:1-33                             1.7    0.0
compile.cc        Main                              performance-bug-2.hs:37:5-28                             1.6    0.0
getGCStats        Criterion.Measurement             Criterion/Measurement.hs:(46,1)-(48,16)                  1.5    0.0
matched           Main                              performance-bug-2.hs:(23,1)-(29,23)                      1.5    2.4
basicLength       Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:343:3-42                     1.1    0.0
fmap              Data.Vector.Fusion.Stream.Monadic Data/Vector/Fusion/Stream/Monadic.hs:(133,3)-(135,20)    0.7    1.0
toPacked4.inst    Main                              performance-bug-2.hs:88:5-26                             0.6    1.1
pack64bit         Main                              performance-bug-2.hs:(59,1)-(62,23)                      0.6    1.8

And LTS 11:

        Tue Jun 26 14:26 2018 Time and Allocation Profiling Report  (Final)

           performance-bug-2 +RTS -p -RTS

        total time  =       23.79 secs   (23791 ticks @ 1000 us, 1 processor)
        total alloc = 17,329,594,624 bytes  (excludes profiling overheads)

COST CENTRE       MODULE                            SRC                                                    %time %alloc

>>=               Data.Vector.Fusion.Util           Data/Vector/Fusion/Util.hs:36:3-18                      19.4   26.9
basicUnsafeIndexM Data.Vector.Primitive             Data/Vector/Primitive.hs:222:3-75                       14.8   18.3
compile'.\        Main                              performance-bug-2.hs:40:22-33                           13.0    5.0
matchPacked       Main                              performance-bug-2.hs:(108,1)-(111,27)                    7.2    8.6
>>=               Data.Vector.Fusion.Util           Data/Vector/Fusion/Util.hs:50:3-19                       4.2    3.0
basicUnsafeIndexM Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:266:841-899                  3.9    0.0
matchPacked.go    Main                              performance-bug-2.hs:(110,5)-(111,27)                    3.4    0.4
primitive         Control.Monad.Primitive           Control/Monad/Primitive.hs:178:3-16                      3.2    0.2
basicUnsafeIndexM Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:345:3-73                     2.9   13.2
matched.matches   Main                              performance-bug-2.hs:28:5-27                             2.7    0.0
uniform           System.Random.MWC                 System/Random/MWC.hs:217:5-33                            2.0    0.9
matched           Main                              performance-bug-2.hs:(23,1)-(29,23)                      2.0    2.8
basicUnsafeIndexM Data.Vector.Unboxed.Base          internal/unbox-tuple-instances:(452,3)-(458,29)          1.8    0.0
compile'          Main                              performance-bug-2.hs:40:1-33                             1.7    0.0
basicUnsafeWrite  Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:115:3-69                1.5    4.2
compile.cc        Main                              performance-bug-2.hs:37:5-28                             1.5    0.0
basicLength       Data.Vector.Unboxed.Base          Data/Vector/Unboxed/Base.hs:343:3-42                     1.2    0.0
toPacked4.inst    Main                              performance-bug-2.hs:88:5-26                             0.9    1.5
basicUnsafeRead   Data.Vector.Primitive.Mutable     Data/Vector/Primitive/Mutable.hs:112:3-63                0.7    1.7
fmap              Data.Vector.Fusion.Stream.Monadic Data/Vector/Fusion/Stream/Monadic.hs:(133,3)-(135,20)    0.7    1.1
take64.(...)      Main                              performance-bug-2.hs:66:17-38                            0.6    2.3

And the main culprits are roughly the same, give or take ordering.

Worth noting: the top cost centres are from the program itself, and from the vector package, but both builds use the same version of that library.

Last edited 14 months ago by tdammers (previous) (diff)

comment:15 in reply to:  13 Changed 14 months ago by tdammers

Replying to simonpj:

If the badness is only for a particular stack LTS, I wonder if the badness will go away in the next stack LTS and we can declare this ticket moot?

That would of course be a "beautiful" solution. I could try pinning it down to a specific library, and force that to the newest version in the LTS 11.1 build, see if that makes any difference.

Another thing to try would be to create a cabal.config with the exact library versions that stack uses; if that reproduces the badness, then it's the libraries, but if it doesn't, then the most likely explanation is that cabal does something wrong.

Yet another thing would be to set up two VMs, each of them having only one GHC toolchain installed, one for 8.0.2 and one for 8.2.2, and see if *that* has us reproduce the observed behavior.

comment:16 Changed 14 months ago by ttylec

Indeed (sorry I didn't write earlier) but enabling profiling kills the performance boost.

I suspect that the order-of-magnitude speedup is due to some really low level optimization. When the analogous code is run as a part of bigger code, the unboxed vector version scales across cores linearly up to 20 sth cores, while the binary packed does not scale linearly even with 2 cores.

Last edited 14 months ago by ttylec (previous) (diff)

comment:17 Changed 14 months ago by ttylec

Moreover, I try to reproduce my previous results (I had plan to switch to hs-gauge from criterion to lighten dependencies) and now I cannot reproduce results on my machine (?!). I consistently get order-of-magnitude speedup with ghc-8.0.2, ghc-8.2.2 (from lts-11.1 and lts-11.15) as well as ghc-8.4.3 (nightly snapshot).

I try to figure out what changed... and the only difference I find is the fact that I installed newest ubuntu in mean time. And I have newer version of stack tool. This is totally confusing. GHC is installed by stack from binaries too... but I guess that packages tagged with version number don't change in time.

BTW, times I get currently:

(zettafox) ➜  ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time                 377.3 μs   (372.6 μs .. 383.3 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 379.4 μs   (375.5 μs .. 387.9 μs)
std dev              19.02 μs   (12.01 μs .. 30.94 μs)
variance introduced by outliers: 46% (moderately inflated)

benchmarking 64 columns/binary packed
time                 21.32 μs   (21.20 μs .. 21.45 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 21.21 μs   (21.10 μs .. 21.35 μs)
std dev              426.0 ns   (334.1 ns .. 567.0 ns)
variance introduced by outliers: 18% (moderately inflated)

benchmarking 256 columns/raw unbox vectors
time                 373.3 μs   (366.7 μs .. 381.9 μs)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 368.1 μs   (365.6 μs .. 374.9 μs)
std dev              12.72 μs   (8.188 μs .. 22.90 μs)
variance introduced by outliers: 29% (moderately inflated)

benchmarking 256 columns/binary packed
time                 27.32 μs   (27.14 μs .. 27.53 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 27.30 μs   (27.16 μs .. 27.49 μs)
std dev              535.0 ns   (432.6 ns .. 704.7 ns)
variance introduced by outliers: 17% (moderately inflated)

comment:18 Changed 14 months ago by tdammers

One thing to try would be to delete ~/.stack and ~/.stack-work between builds. This should give you a clean slate.

comment:19 Changed 14 months ago by tdammers

Speaking of clean slates; I just noticed that $PATH can mess up the testing, because stack defaults to installing binaries into ~/.local/bin/, while cabal installs either into ~/.cabal/bin (if no sandbox is used), or into ./.cabal-sandbox/bin (if a sandbox is used). So you have to be careful to make sure you're running the correct binary, and just relying on $PATH might explain while switching between configurations or build tools might appear to leave some residue.

comment:20 Changed 14 months ago by tdammers

Another data point: I dumped stack's cabal.config for the LTS 11.1 release into my cabal project and compiled with --with-compiler=/usr/local/bin/ghc-8.2.2, and the result I'm getting is this:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 460.0 μs   (447.0 μs .. 473.6 μs)
                     0.995 R²   (0.992 R² .. 0.997 R²)
mean                 446.4 μs   (440.2 μs .. 455.1 μs)
std dev              24.42 μs   (17.29 μs .. 31.22 μs)
variance introduced by outliers: 48% (moderately inflated)

benchmarking 64 columns/binary packed
time                 52.25 μs   (51.66 μs .. 53.04 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 52.60 μs   (51.99 μs .. 53.67 μs)
std dev              2.665 μs   (1.919 μs .. 4.073 μs)
variance introduced by outliers: 55% (severely inflated)

benchmarking 256 columns/raw unbox vectors
time                 439.9 μs   (434.4 μs .. 447.4 μs)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 439.0 μs   (435.0 μs .. 446.4 μs)
std dev              17.95 μs   (10.79 μs .. 27.38 μs)
variance introduced by outliers: 35% (moderately inflated)

benchmarking 256 columns/binary packed
time                 304.7 μs   (288.7 μs .. 330.4 μs)
                     0.965 R²   (0.940 R² .. 0.998 R²)
mean                 324.1 μs   (302.9 μs .. 364.4 μs)
std dev              62.33 μs   (26.19 μs .. 97.61 μs)
variance introduced by outliers: 91% (severely inflated)

However, I'm getting the same bad behavior when I pin libraries to LTS 11.15, and also when I don't pin libraries at all. So it's not the libraries.

comment:21 Changed 14 months ago by tdammers

Performed an evil experiment: I put the cabal.config from LTS-11.1 into the project, then compiled with cabal, forcing the compiler to 8.0.2. Behavior is "good" (i.e., the 4th benchmark is in the 20-40 μs range. I also get the "good" behavior when compiling with 8.0.2, but without pinning.

So it's definitely not the libraries.

comment:22 in reply to:  20 Changed 14 months ago by ttylec

Replying to tdammers:

This is not totally "bad" behavior. This:

"Generated"
benchmarking 64 columns/raw unbox vectors
time                 460.0 μs   (447.0 μs .. 473.6 μs)
                     0.995 R²   (0.992 R² .. 0.997 R²)
mean                 446.4 μs   (440.2 μs .. 455.1 μs)
std dev              24.42 μs   (17.29 μs .. 31.22 μs)
variance introduced by outliers: 48% (moderately inflated)

benchmarking 64 columns/binary packed
time                 52.25 μs   (51.66 μs .. 53.04 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 52.60 μs   (51.99 μs .. 53.67 μs)
std dev              2.665 μs   (1.919 μs .. 4.073 μs)
variance introduced by outliers: 55% (severely inflated) 

this is "good", we have significant speedup EDIT: sorry, in the meantime I forgot we always had speedup in 64 columns case. But this:

benchmarking 256 columns/raw unbox vectors
time                 439.9 μs   (434.4 μs .. 447.4 μs)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 439.0 μs   (435.0 μs .. 446.4 μs)
std dev              17.95 μs   (10.79 μs .. 27.38 μs)
variance introduced by outliers: 35% (moderately inflated)

benchmarking 256 columns/binary packed
time                 304.7 μs   (288.7 μs .. 330.4 μs)
                     0.965 R²   (0.940 R² .. 0.998 R²)
mean                 324.1 μs   (302.9 μs .. 364.4 μs)
std dev              62.33 μs   (26.19 μs .. 97.61 μs)
variance introduced by outliers: 91% (severely inflated)

is "bad". However, what I observed with the full code of our project, the speed-up is lost when we exceed the specific number of columns... but that number is platform specific (AMD performs worst, Intel is usually good, but then to MacBook Pro i5 CPU seems to be better than in i7 Lenovo on ubuntu).

But since on the same platform you get different results for 256 columns, having speedup in with 64 columns, make me wonder, can the system kernel and/or libraries be affecting that?

As for me not being able to reproduce my original report: I did try to remove ~/.stack and ~/.stack-work. Tried both with split-obj in stack config and without. I still can't get "bad" results on my current hardware/software.

I am using only stack, I don't have system-wide GHC.

I will try to do more tests on different machine (with debian 9); try the macbook at home too.

But to sum up what we know until now: libraries are ruled out, compiler version seems to be ruled out too. What's left? GHC binary package, OS kernel, system libs? Does anything of that make sense?

Last edited 14 months ago by ttylec (previous) (diff)

comment:23 Changed 14 months ago by tdammers

So, to summarize the results I have so far on my own machine (i5, debian):

  • The 64 columns version sees massive improvement throughout, on any compiler, and with any set of libraries; however, when the 256-columns version performs badly, the 64-columns version is slower by a factor of about 2 (~50 μs vs. ~25 μs).
  • Whether the improvement is seen or not does not seem to depend on the libraries - the same set of libraries, when forced with cabal, will sometimes show improvement, and sometimes not.
  • I have both 8.0.2 and 8.2.2 produce "good" and "bad" behavior; however, some of that may be due to accidentally running the wrong binary. The test runs in which I made absolutely sure I was running the correct binaries, and started with a clean slate, do show that 8.0.2 consistently produces "good" code, while 8.2.2 consistently produces "bad" code.

So that leaves us with relatively obscure things.

One thing to check would be which exact binaries stack pulls in. The GHC versions may match, but the build configuration might not.

Kernel and system libs were all identical between my attempts, so I doubt those play a role.

comment:24 Changed 14 months ago by simonpj

This smells to me like a crucial fusion rule not firing. You could try compiling with -ddump-rule-firings and see if the "good" and "bad" output differs.

comment:25 Changed 14 months ago by tdammers

Ah yes, great. I had already suspected fusion rules, but hadn't thought of -ddump-rule-firings. I'll give it a spin.

comment:26 Changed 14 months ago by tdammers

OK, for a very rough first impression of RULES that fire in either case, I did this:

tobias@zoidberg:~/well-typed/devel/ghc-T14980/ > diff <(grep '^Rule' ghc-bug-8.0.2/rule-firings | sort -u) <(grep '^Rule' ghc-bug-8.2.2/rule-firings | sed -e 's/\s(.*)$//' | sort -u) -u | grep '^[+-]'
--- /proc/self/fd/13    2018-06-28 12:26:37.103065551 +0200
+++ /proc/self/fd/14    2018-06-28 12:26:37.103065551 +0200
-Rule fired: >=#
+Rule fired: ==#
+Rule fired: >#
-Rule fired: Class op <
-Rule fired: Class op <*>
-Rule fired: Class op >=
-Rule fired: Class op fmap
+Rule fired: Class op liftA2
-Rule fired: Class op $p1Applicative
-Rule fired: eftIntList
+Rule fired: divInt#
+Rule fired: foldr2/left
-Rule fired: narrow32Word#
-Rule fired: narrow8Word#
-Rule fired: or#
-Rule fired: SC:foldlM_loop0
-Rule fired: SC:foldlM_loop1
+Rule fired: SC:go0
-Rule fired: SC:$j1
+Rule fired: SC:$wfoldlM_loop0
-Rule fired: seq of cast
+Rule fired: stream/unstream [Vector]
+Rule fired: uncheckedShiftL#

That is, I compiled the performance-bug-2 program with both compilers using -ddump-rule-firings, and then filtered the dump as follows:

  1. Retain only the actual rule firings
  2. Sort by rule name
  3. Remove duplicates
  4. Strip the extra information that is only present in 8.2.2 output
  5. Diff
  6. Retain only + and - lines.

+ lines are rules that only fire on 8.2.2, - lines fire only on 8.0.2.

I don't know which of these makes the decisive difference, but I suspect it's one of the built-in rules.

comment:27 Changed 14 months ago by bgamari

While I've not looked at the program, I would guess eftIntList. That allows us to deforest a consumed list produced by enumFromTo

comment:28 Changed 10 months ago by ttylec

Sorry for long inactivity.

I had some time to do more tests recently, based on your suggestions.

To focus attention, I tried to investigate difference between:

  • performance-bug-pair-1: 256 columns
  • performance-bug-pair-2: 64 and 256, the same code as in performance-bug-pair-1 but with 64 column check uncommented

On ghc-8.2.2 and ghc-8.0.2.

  • on ghc-8.2.2, performance-bug-pair-2 behaves as expected -- we have speedup for both 64 and 256 columns, performance-bug-pair-1 does not have speed-up.
  • on ghc-8.0.2, I see speedup in both cases.

Based on @tdammers example, I tried to print fired rules difference between both performance-bug-pair-1 and performance-bug-pair-2, using analogous diff expression:

➜  ghc-bug diff <(grep '^Rule' rules-firing1-1 | sort -u) <(grep '^Rule' rules-firing-2 | sort -u) -u | grep '^[+-]'

--- /proc/self/fd/13    2018-10-23 15:49:27.063541325 +0200
+++ /proc/self/fd/14    2018-10-23 15:49:27.067541342 +0200
-Rule fired: eftIntList (GHC.Enum)
-Rule fired: stream/unstream [Vector] (Data.Vector.Generic)
+Rule fired: SPEC matchPacked @ BPack4 (Main)
+Rule fired: SPEC matchPacked @ BPack (Main)

- are rules that fired for performance-bug-pair-1 (the bad) and + are rules that fired for performance-bug-pair-2.

It's not about specialization, because I even tried to compile with performance-bug-pair-1 with explicit monomorphic type.

The only common rule on both rules, that fired in bad case and did not fire in good is Rule fired: stream/unstream [Vector] (Data.Vector.Generic).

Not that I understand it... but it seems that firing that rule interferes with some GHC optimizations?

comment:29 Changed 8 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of some high and highest priority tickets.

Note: See TracTickets for help on using tickets.