#6166 closed bug (fixed)
Performance regression in mwc-random since 7.0.x
Reported by: | bos | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | 8.2.1 |
Component: | Compiler | Version: | 7.4.2 |
Keywords: | Cc: | pho@…, dima@…, bgamari@… | |
Operating System: | Unknown/Multiple | Architecture: | x86_64 (amd64) |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description (last modified by )
I've had a report that the performance of the mwc-random package has regressed seriously after upgrading from GHC 7.0 to 7.4. It turns out that 7.2 also has the regression.
Here's a sample program.
import qualified Data.Vector.Unboxed as U import qualified System.Random.MWC as R import System.Random.MWC.Distributions (standard) count = 1000 * 1000 fast gen = standard gen slow gen = standard gen >>= return -- Edit this to choose fast or slow. which gen = slow gen main = do gen <- R.create v <- U.replicateM count (which gen) print (U.last v)
With GHC 7.0.3 -O2, this runs in 0.294 sec, regardless of whether fast
or slow
is used.
Under 7.4, fast
runs in 0.062 sec (a nice speedup!), but slow
now takes 9.2 sec (yikes!).
Roman suggested compiling the slow
version with -fno-state-hack
, which brings performance back up to 0.062 sec.
Change History (23)
comment:1 Changed 7 years ago by
difficulty: | → Unknown |
---|---|
Milestone: | → 7.6.1 |
Priority: | normal → high |
comment:2 Changed 7 years ago by
Cc: | pho@… added |
---|
comment:3 Changed 7 years ago by
Cc: | dima@… added |
---|
comment:4 Changed 7 years ago by
Cc: | bgamari@… added |
---|
comment:6 Changed 7 years ago by
Milestone: | 7.6.1 → 7.6.2 |
---|
comment:7 Changed 6 years ago by
Bug is still present in 7.6.3. I've made a reduced test case with stripped-down standard inlined. Note that adding or removing return in main loop have no effect. Something interesting is going on with blocks. Replacing f with constant or removing cons all makes bug go away. Simplifying go function changes runtime drastically.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Control.Monad.Primitive (PrimMonad, PrimState) import Data.Word import Data.Bits import Control.Monad import System.Random.MWC main :: IO () main = do g <- create replicateM_ (200*1000) $ standard g standard :: PrimMonad m => Gen (PrimState m) -> m Double {-# INLINE standard #-} standard gen = do u <- (subtract 1 . (*2)) `liftM` uniform gen ri <- uniform gen let i = fromIntegral ((ri :: Word32) .&. 127) bi = (I.!) blocks i return $! u * bi where blocks = I.cons r -- Removing cons $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v = 9.91256303526217e-3 r = 3.442619855899 f = exp (-0.5 * r * r) -- Replacing with constant make bug go away! -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
comment:8 Changed 6 years ago by
Thank you for a stripped-down case. Can you explain exactly how to demonstrate the bug with this test program? Ie "Try X and program runs in 2s; make trivial change to Y and it takes 10s". Or whatever.
Does it need mwc-random
in all its glory, or would it be possible to make a standalone test case?
Simon
comment:9 Changed 6 years ago by
I've slightly simplified test case. I've tried to replace call to uniform with mock function but to avail. It's certainly possible to add only relevant parts of mwc-random. Only small part is actually used
Test case is slow (~100x) version of program. It's quite fragile. Small changes can return program to normal performance. Known methods: replace definition of f
with constant (marked as fast), float blocks
to the top level.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Data.Word import Data.Bits import Control.Monad import System.Random.MWC main :: IO () main = do g <- create replicateM_ (100*1000) $ standard g standard :: GenIO -> IO Double {-# INLINE standard #-} standard gen = do ri <- uniform gen return $! blocks ! fromIntegral ((ri :: Word32) .&. 127) where blocks :: I.Vector Double blocks = I.cons r -- Removing cons $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v,r,f :: Double v = 9.91256303526217e-3 r = 3.442619855899 -- f = 2.669629083880923e-3 -- FAST f = exp (-0.5 * r * r) -- SLOW -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
comment:10 Changed 6 years ago by
I've been able to remove all stuff from mwc-random. Here is test case. Again it's slow version.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Control.Monad main :: IO () main = replicateM_ (100*1000) (return $! standard) standard :: Double {-# INLINE standard #-} standard = blocks ! 0 where blocks :: I.Vector Double blocks = I.cons r $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v,r,f :: Double v = 9.91256303526217e-3 r = 3.442619855899 -- f = 2.669629083880923e-3 -- FAST f = exp (-0.5 * r * r) -- SLOW -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
Couple of observations
- Replacing
f
with constant restores run time to normal. AFAIR GHC cannot constant foldexp
and similar functions. So it may matter
- Floating
block
to top level or removingI.cons
restores run time too.
- Simplifying
go
function changes run time. Removingsqrt
orlog
reduce rim time. It looks likeblocks
is reevaluated every timestandard
is evaluated.
comment:11 Changed 6 years ago by
Another simplification
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Control.Monad main :: IO () main = replicateM_ (200*1000) (return $! standard) standard :: Double -- Removing or replacing with NOINLINE returns perfomance to normal {-# INLINE standard #-} standard = blocks ! 0 where blocks :: I.Vector Double blocks = I.cons 0.123 $ I.unfoldrN 130 go (T f) where go q@(T a) = Just (log (exp a), q) {-# NOINLINE blocks #-} r,f :: Double r = 3.442619855899 -- replacing f with constant return perfomance to normal -- f = 2.669629083880923e-3 f = exp (-0.5 * r * r) -- replacing data with newtype returns performance to normal data T = T {-# UNPACK #-} !Double
Problem is visible at the core level. Code is compiled down to the something similar to following pseudocode:
loop i = if i /= 0 then evaluate (blocks ! 0) >> loop (i-1) else return ()
blocks array is inlied despite being marked as NOINLINE and is evaluated on each iteration so performance is abysmal. When small chages to the program are made it's not inlined and evaluated only once.
comment:12 Changed 5 years ago by
Priority: | high → normal |
---|
Lowering priority (these tickets are assigned to older versions, so they're getting bumped as they've been around for a while).
comment:14 Changed 5 years ago by
Milestone: | 7.10.1 → 7.12.1 |
---|
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
comment:16 Changed 4 years ago by
Description: | modified (diff) |
---|
comment:17 Changed 4 years ago by
Priority: | normal → high |
---|
This looks like a pretty serious performance regression, bumping priority.
comment:18 Changed 4 years ago by
Does same regression present in latest version of mwc-random? AFAIR performance regression appears because GHC sometimes chose to inline lookup table in where block. It's marked as NOINLINE. It thus was recalculated on each call and caused slowdown. I moved it at top level and problem disapeared.
comment:19 Changed 3 years ago by
Milestone: | 8.0.1 → 8.2.1 |
---|
Nothing will be happening on this front for 8.0.1.
comment:21 Changed 3 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
The original version of mwc-random and the minimized test cases in the comments are a bit fragile, in that they rely crucially on blocks
being floated out of the IO action which contains it, ideally to top level. I'm not sure why, but GHC 7.8.4 isn't doing this floating out in the versions marked SLOW.
For better or worse, GHC 7.10.1 does float out blocks
to top level in the SLOW versions, and I also checked that it produces efficient code for the original version mwc-random-0.13.1.0 that prompted this report. Without a way to reproduce this in a recent version, I'm going to assert that the underlying issue was probably fixed between 7.8 and 7.10.
comment:22 Changed 3 years ago by
There's a little-known file nofib/Simon-nofib-notes
that contains per-benchmark commentary. Could you add the info above to it?
The notes concentrate on aspects of nofib that have proved fragile, and are supposed to stop you reinventing the wheel when you are investigating a nofib perf blip.
Thanks
Simon
comment:23 Changed 3 years ago by
Ah, good to know, but mwc-random isn't a nofib program, even though it sounds like it could be. It's a package on hackage for generating random numbers. (You may be thinking of the k-nucleotide issue, which is a nofib program.)
We should look into this:
-fno-state-hack
should never make something faster; the whole point of the state hack is to improve performance.