Opened 11 months ago

Last modified 9 months ago

#16004 new bug

Vector performance regression in GHC 8.6

Reported by: guibou Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.6.2
Keywords: Cc: alpmestan, simonpj, maoe, nh2
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Hello.

With the following code, I can observe a performance regression between ghc 8.4 and 8.6:

{-# LANGUAGE ScopedTypeVariables #-}
module Main where
import qualified Data.Vector.Unboxed.Mutable as Vector
import qualified Data.Vector.Unboxed as VectorU

import Data.Foldable (for_)

main :: IO ()
main = do
  let n = 1000

  let vUnmutable :: VectorU.Vector Double = VectorU.generate (n * n) (\i -> fromIntegral i)
  v :: Vector.IOVector Double <- VectorU.unsafeThaw vUnmutable

  for_ [0..(n - 1)] $ \k -> do
    for_ [0..(n - 1)] $ \i -> do
      for_ [0..(n - 1)] $ \j -> do
        a <- Vector.unsafeRead v (i * n + k)
        b <- Vector.unsafeRead v (k * n + j)
        c <- Vector.unsafeRead v (i * n + j)
        Vector.unsafeWrite v (i * n + j) (min (a + b) c)

Built with -O2 and with / without -fllvm. I'm using vector-0.12.0.1. Here are the timing results:

GHC 8.2.2

no llvm: 1.7s llvm: 1.0s

GHC 8.4.4

no llvm: 1.6s llvm: 0.9s

GHC 8.6.2

no llvm: 4.8s llvm: 4.3s

I'm using the following bash + nix script to gather theses timings:

nix-shell -p 'haskell.packages.ghc822.ghcWithPackages(p: [p.vector])' --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp; time ./FloydBench"
nix-shell -p 'haskell.packages.ghc822.ghcWithPackages(p: [p.vector])' -p llvm_39 --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp -fllvm; time ./FloydBench"
nix-shell -p 'haskell.packages.ghc844.ghcWithPackages(p: [p.vector])' --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp; time ./FloydBench"
nix-shell -p 'haskell.packages.ghc844.ghcWithPackages(p: [p.vector])' -p llvm_5 --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp -fllvm; time ./FloydBench"
nix-shell -p 'haskell.packages.ghc862.ghcWithPackages(p: [p.vector])' --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp; time ./FloydBench"
nix-shell -p 'haskell.packages.ghc862.ghcWithPackages(p: [p.vector])' -p llvm_6 --run "ghc-pkg list | grep vector; ghc -O2 FloydBench.hs -Wall -fforce-recomp -fllvm; time ./FloydBench"

Change History (11)

comment:1 Changed 11 months ago by alpmestan

Cc: alpmestan added

comment:2 Changed 11 months ago by carter

Ouch! Thanks for reporting this.

comment:3 Changed 11 months ago by RyanGlScott

Cc: simonpj added

This regression was introduced in commit 3d38e8284b7382844f9862e8d8afbae9c7248b09 (Do not unpack class dictionaries with INLINABLE).

comment:4 Changed 11 months ago by AndreasK

Reproduced with 8.4.3 and 8.6.1

For the original code using -fno-full-laziness performance is almost the same for 8.4 and 8.6, and what little difference there is probably comes from using a different branch order at the Cmm level.

$ ~/bench-exe.exe ./test-8.6-nofloat.exe -- ./test-8.4.exe
benchmarking execute: ./test-8.6-nofloat.exe
time                 1.632 s    (1.352 s .. 1.859 s)
                     0.997 R²   (0.988 R² .. 1.000 R²)
mean                 1.629 s    (1.600 s .. 1.658 s)
std dev              49.03 ms   (0.0 s .. 49.85 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking execute: ./test-8.4.exe
time                 1.646 s    (1.493 s .. 1.863 s)
                     0.998 R²   (0.994 R² .. NaN R²)
mean                 1.597 s    (1.560 s .. 1.622 s)
std dev              37.88 ms   (0.0 s .. 43.65 ms)
variance introduced by outliers: 19% (moderately inflated)

The difference between full-laziness not seems to be that with full-laziness we float out the creation of the [0..n] list, instead of transforming the code into a simple loop as intended.

So we end up with this inner loop that passes around the list explicitly. I assume deforestation fails here?

joinrec {
  go2_s7mc
  go2_s7mc ds3_X6W3 eta2_X3r
    = case ds3_X6W3 of {
        [] -> jump exit2_XF eta2_X3r;
        : y3_X6Yn ys2_X6Yq ->
          case readDoubleArray#
                  (ipv1_a6zy `cast` <Co:50>)
                  (+# (*# x1_a5F9 1000#) x_a69v)
                  (eta2_X3r `cast` <Co:14>)
          of
          { (# ipv4_X6am, ipv5_X6ao #) ->
          case y3_X6Yn of { I# y4_X6c8 ->
          case writeDoubleArray#
                  (ipv1_a6zy `cast` <Co:50>)
                  (+# (*# x1_a5F9 1000#) y4_X6c8)
                  ipv5_X6ao
                  ipv4_X6am
          of s'#1_X6b3
          { __DEFAULT ->
          jump go2_s7mc ys2_X6Yq (s'#1_X6b3 `cast` <Co:13>)
          }
          }
          }
      }; }

Not an expert in the deforestation machinery so leaving that for someone else.

Last edited 11 months ago by AndreasK (previous) (diff)

comment:5 Changed 11 months ago by simonpj

with full-laziness we float out the creation of the [0..n] list,

This particular thing is quite delicate. Consider

f xs = map (\x -> foo x [1..100]) xs

Should we float that [1..100] out to top level, and share it among all the calls that map makes? Or should we reconstruct it on every call. In the latter case we might be able to fuse it with the consumer (perhaps we inline foo). But even if we fuse away the list we might still allocate heap objects for I# 1, I# 2, ... I# 100, for each element of xs rather sharing those values between all those calls. Or, perhaps, we get to fuse away those I# boxes too! It all depends on foo.

I don't really have a good answer here. Sometimes full laziness is a win, sometimes not.

There is more discussion in #7206.

I'd love someone to work on this a bit. Somehow we should do better than we are doing.

PS: I have no idea why the particular commit that made this test worse did make it worse; that might be worth investigation.

Another thing that would be v useful is to distil the example into a standalone test case. vector is a complicated library, but David's analysis above isolates the issue well. Maybe someone can turn that into a repro case?

comment:6 Changed 11 months ago by carter

Simon,I think you mean Andreas, not David ;)

comment:7 Changed 11 months ago by guibou

Simon: Is the following example without vector interesting as a standalone test case? :

import Data.Foldable (for_)

main :: IO ()
main = do
  let n = 1000

  for_ [0 :: Int ..(n - 1)] $ \_k -> do
    for_ [0 :: Int ..(n - 1)] $ \_i -> do
      for_ [0 :: Int ..(n - 1)] $ \_j -> do
        pure ()

Timings:

ghc 8.2.2: 0.667s ghc 8.4.4: 0.357s ghc 8.6.2: 1.007s

Note that I have only included the results without using -fllvm because apparently the llvm backend is "smart" enough to understand that the loops can be removed and results in a runtime of 0.001s.

Version 1, edited 11 months ago by guibou (previous) (next) (diff)

comment:8 Changed 11 months ago by simonpj

Yes, that's great. I think we know why this is happening too.

Can someone try the patch in #7206, specifically wip/cheap-build-osa1, and see if it nails this case? And perhaps do a new nofib run.

The conclusion on #7206 was that we should probably accept the patch -- and if it cures this vector problem that'd be a strong positive reason to do so.

comment:9 Changed 10 months ago by bgamari

Milestone: 8.6.3

Ticket retargeted after milestone closed

comment:10 Changed 9 months ago by maoe

Cc: maoe added

comment:11 Changed 9 months ago by nh2

Cc: nh2 added
Note: See TracTickets for help on using tickets.