Opened 8 years ago

Closed 7 years ago

#5623 closed bug (fixed)

GHC 7.2.1 Performance Regression: Vector

Reported by: dterei Owned by: simonpj
Priority: high Milestone: 7.4.2
Component: Compiler Version: 7.3
Keywords: Cc: jwlato@…, wren@…, v.dijk.bas@…, pho@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: simplCore/should_compile/T5623
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

This program shows a severe performance drop under GHC 7.2.1 compared to GHC 7.0.4. I've tested with GHC HEAD and the performance drop is still there.

{-# LANGUAGE BangPatterns #-}

{-
    ghc 6.12.1 -O2
    1.752
-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types

-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat

------------------------------------------------------------------------

-- Ensure we can store it in an array
instance Storable Vec4 where
  sizeOf _ = sizeOf (undefined :: CFloat) * 4
  alignment _ = alignment (undefined :: CFloat)

  {-# INLINE peek #-}
  peek p = do
             a <- peekElemOff q 0
             b <- peekElemOff q 1
             c <- peekElemOff q 2
             d <- peekElemOff q 3
             return (Vec4 a b c d)
    where
      q = castPtr p

  {-# INLINE poke #-}
  poke p (Vec4 a b c d) = do
                            pokeElemOff q 0 a
                            pokeElemOff q 1 b
                            pokeElemOff q 2 c
                            pokeElemOff q 3 d
    where
      q = castPtr p

------------------------------------------------------------------------

a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')

mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')

vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d

multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.map (\v -> add (mult v m) a) src

main = do
    print $ Data.Vector.Storable.sum
          $ Data.Vector.Storable.map vsum
          $ multList repCount
          $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

Timings on my machine:

 * GHC 7.0.3 (-fasm -O2): 1.481s
 * GHC 7.2.1 (-fasm -O2): 2.050s
 * GHC HEAD  (-fasm -O2): 2.051s

The bug occurs with the llvm backend as well, but just test with '-fasm' as this test case came from a different performance bug specific to the llvm backend I was tracking (#4223).

Change History (14)

comment:1 Changed 8 years ago by rl

This seems to be due to a rather unfortunate duplication of shared computations. Here is a simplified example:

module Foo where

import qualified Data.Vector.Storable as V
import Foreign
import Control.Monad

data Vec2 = Vec2 {-# UNPACK #-} !Float
                 {-# UNPACK #-} !Float

instance Storable Vec2 where
  sizeOf _ = sizeOf (undefined :: Float) * 2
  alignment _ = alignment (undefined :: Float)

  {-# INLINE peek #-}
  peek p = liftM2 Vec2 (peekElemOff q 0) (peekElemOff q 1)
    where
      q = castPtr p

  {-# INLINE poke #-}
  poke p (Vec2 a b) = pokeElemOff q 0 a >> pokeElemOff q 1 b
    where
      q = castPtr p


vsum :: Vec2 -> Float
{-# INLINE vsum #-}
vsum (Vec2 a b) = a+b

xsum :: V.Vector Vec2 -> Float
xsum = V.sum . V.map vsum

Here is the code that 7.0.4 generates for xsum:

$s$wfoldlM'_loop_s18Q :: GHC.Prim.Int# -> GHC.Prim.Float# -> GHC.Prim.Float#
$s$wfoldlM'_loop_s18Q =
  \ (sc_s18K :: GHC.Prim.Int#) (sc1_s18L :: GHC.Prim.Float#) ->
    case GHC.Prim.>=# sc_s18K ww_s187 of _ {
      GHC.Bool.False ->
        let {
          a2_s16J [Dmd=Just L] :: GHC.Prim.Addr#
          a2_s16J = GHC.Prim.plusAddr# ww1_s188 (GHC.Prim.*# sc_s18K 8) } in
        case GHC.Prim.readFloatOffAddr# @ GHC.Prim.RealWorld a2_s16J 0 GHC.Prim.realWorld#
        of _ { (# s2_aOP, x_aOQ #) ->
        case GHC.Prim.readFloatOffAddr# @ GHC.Prim.RealWorld a2_s16J 1 s2_aOP
        of _ { (# s1_XPG, x1_XPI #) ->
        case GHC.Prim.touch# @ GHC.ForeignPtr.ForeignPtrContents ww2_s189 s1_XPG
        of _ { __DEFAULT ->
        $s$wfoldlM'_loop_s18Q
          (GHC.Prim.+# sc_s18K 1)
          (GHC.Prim.plusFloat# sc1_s18L (GHC.Prim.plusFloat# x_aOQ x1_XPI))
        }
        }
        };
      GHC.Bool.True -> sc1_s18L
    };

And this is with the HEAD:

$s$wfoldlM'_loop_sZQ :: GHC.Prim.Int# -> GHC.Prim.Float# -> GHC.Prim.Float#
$s$wfoldlM'_loop_sZQ =
  \ (sc_sZK :: GHC.Prim.Int#) (sc1_sZL :: GHC.Prim.Float#) ->
    case GHC.Prim.>=# sc_sZK ww_sZq of _ {
      GHC.Types.False ->
        let {
          a2_sXN [Dmd=Just L] :: GHC.Prim.Int#
          a2_sXN = GHC.Prim.*# sc_sZK 8 } in
        case GHC.Prim.readFloatOffAddr#
               @ GHC.Prim.RealWorld
               (GHC.Prim.plusAddr# ww1_sZr a2_sXN)
               0
               GHC.Prim.realWorld#
        of _ { (# s2_aLR, x_aLS #) ->
        case GHC.Prim.readFloatOffAddr#
               @ GHC.Prim.RealWorld
               (GHC.Prim.plusAddr# ww1_sZr a2_sXN)
               1
               s2_aLR
        of _ { (# s1_XNk, x1_XNm #) ->
        case GHC.Prim.touch# @ GHC.ForeignPtr.ForeignPtrContents ww2_sZs s1_XNk
        of _ { __DEFAULT ->
        $s$wfoldlM'_loop_sZQ
          (GHC.Prim.+# sc_sZK 1)
          (GHC.Prim.plusFloat# sc1_sZL (GHC.Prim.plusFloat# x_aLS x1_XNm))
        }
        }
        };
      GHC.Types.True -> sc1_sZL
    };

Note how the HEAD duplicates the plusAddr# computation. This happens several times in the original example and is probably a bug in the simplifier.

comment:2 Changed 8 years ago by rl

Come to think of it, here is a much simpler example:

foo :: Ptr Float -> IO Float
foo p = liftM2 (+) (peekElemOff q 0) (peekElemOff q 1)
  where
    q = p `plusPtr` 4

Again, the HEAD computes q twice. This looks like a rather bad bug to me, I hope this will be fixed in time for 7.4!

comment:3 Changed 8 years ago by rl

Just to spam a little more, it seems that the HEAD happily duplicates all computations on unboxed types. It even duplicates x+x in this example:

foo :: Float -> Float
foo x = let y = x+x in y+y

comment:4 Changed 8 years ago by igloo

Milestone: 7.4.1
Owner: set to simonpj
Priority: normalhigh

comment:5 Changed 8 years ago by simonmar

Probably a result of the tuning we did to make primops look cheaper (see #4978), though we shouldn't be just duplicating them willy-nilly.

comment:6 Changed 8 years ago by simonpj@…

commit 479504030370947ff3e8d62adb193dd492cf5725

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Nov 11 22:04:20 2011 +0000

    Make certainlyWillInline more conservative, so that it is never true of thunks.  Otherwise the worker-wrapper phase can make a thunk into an unconditionally inline UnfWhen thing, which is Very Bad Thing.  Shown up by Trac #5623.
    
    See Note [certainlyWillInline: be caseful of thunks].

 compiler/coreSyn/CoreUnfold.lhs |   28 +++++++++++++++++++++-------
 1 files changed, 21 insertions(+), 7 deletions(-)

comment:7 Changed 8 years ago by rl

Alas, the patch doesn't fix this example (nor the original one):

foo :: Ptr Float -> IO Float
foo p = liftM2 (+) (peekElemOff q 0) (peekElemOff q 1)
  where
    q = p `plusPtr` 4

comment:8 Changed 8 years ago by jwlato

Cc: jwlato@… added

comment:9 Changed 8 years ago by WrenThornton

Cc: wren@… added

comment:10 Changed 8 years ago by basvandijk

Cc: v.dijk.bas@… added

comment:11 Changed 8 years ago by igloo

Milestone: 7.4.17.4.2

comment:12 Changed 7 years ago by PHO

Cc: pho@… added

comment:13 Changed 7 years ago by simonpj@…

commit 2112f43c466935818a371c53c706608cfa069d01

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Wed May 9 11:06:44 2012 +0100

    Be a little less aggressive about inlining (fixes Trac #5623)
    
    When inlining, we are making a copy of the expression, so we have to
    be careful about duplicating work.  Previously we were using
    exprIsCheap for that, but it is willing to duplicate a cheap primop --
    and that is terribly bad if it happens inside some inner array loop
    (Trac #5623).  So now we use a new function exprIsWorkFree.  Even
    then there is some wiggle room:
       see Note [exprIsWorkFree] in CoreUtils
    
    This commit does make wheel-sieve1 allocate a lot more, but we decided
    that's just tough; it's more important for inlining to be robust
    about not duplicating work.

 compiler/coreSyn/CoreSyn.lhs    |   12 ++++----
 compiler/coreSyn/CoreUnfold.lhs |   60 ++++++++++++++++++------------------
 compiler/coreSyn/CoreUtils.lhs  |   64 ++++++++++++++++++++++++++++++++++++++-
 compiler/coreSyn/PprCore.lhs    |    4 +-
 4 files changed, 101 insertions(+), 39 deletions(-)

comment:14 Changed 7 years ago by simonpj

difficulty: Unknown
Resolution: fixed
Status: newclosed
Test Case: simplCore/should_compile/T5623

I've added a test for Roman's plusPtr example. I'm not sure how to robustly test for the original performance regression, but Roman's example is one way to see if we start duplicating the primop again.

Note: See TracTickets for help on using tickets.