#7429 closed bug (wontfix)
Unexplained performance boost with +RTS -h
Reported by: | simonmar | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.8.1 |
Component: | Compiler | Version: | 7.6.1 |
Keywords: | Cc: | pho@…, tomberek@… | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
In #5505 the following program was reported:
{-# LANGUAGE RankNTypes ,RecursiveDo ,TypeFamilies ,GeneralizedNewtypeDeriving #-} import System.TimeIt import qualified Data.Vector as V import Control.Monad.State.Strict import Control.Monad.ST.Strict import Data.STRef -- STO Monad is just StateT with an Int stacked on top of a ST monad. newtype STO s a = STO { unSTO :: StateT Int (ST s) a } deriving (Functor, Monad, MonadFix) runSTO :: (forall s. STO s a) -> a runSTO x = runST (evalStateT (unSTO x) 0) data CircList s = ConsM {value :: {-# UNPACK #-} !Int ,cons :: {-# UNPACK #-} !(STRef s (CircList s )) } -- | Defines a circular list of length ns clist ns = do rec allItems <- let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns) in V.generateM ns next return $ (V.!) allItems 0 where nextCons v n = do n' <- switchSTorSTO $ newSTRef $ n return $ ConsM v n' -- | Take one step in the CircList oneStep (ConsM v c) = switchSTorSTO $ readSTRef c -- | I tie a circular list of size 100 and step through it n times. main :: IO () main = do let n = 333222111 :: Int timeIt . print $ runSTO ( clist 1001 >>= liftM value . iterateM n oneStep) --timeIt . print $ runST ( clist 1001 >>= liftM value . iterateM n oneStep) -- **************************** TO SWITCH TO ST: switch between the two sets of definitions above and below this line --switchSTorSTO = id switchSTorSTO = STO . lift iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b iterateM n f c = go n c where go 0 b = return b go ns b = f b >>= go (ns-1)
Which on my system, when compiled for profiling, goes more than twice as fast with +RTS -h
. The following is some output from perf stat
with both versions:
> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505 222 CPU time: 5.34s Performance counter stats for './5505': 10,676,916,633 cycles # 0.000 GHz [75.00%] 11,023,609,154 instructions # 1.03 insns per cycle # 0.60 stalled cycles per insn [75.01%] 462,115,593 L1-dcache-load-misses [75.01%] 6,661,087,690 stalled-cycles-backend # 62.39% backend cycles idle [75.01%] 5.346786849 seconds time elapsed
and for the fast version:
> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505 +RTS -h 222 CPU time: 2.38s Performance counter stats for './5505 +RTS -h': 4,766,694,528 cycles # 0.000 GHz [75.02%] 10,691,687,462 instructions # 2.24 insns per cycle # 0.06 stalled cycles per insn [75.02%] 18,763,412 L1-dcache-load-misses [75.02%] 673,890,507 stalled-cycles-backend # 14.14% backend cycles idle [74.99%] 2.399953273 seconds time elapsed
There's a huge difference in the number of L1-dcache misses, and the slow version is stalled for 62% of the time.
Why is this? I'm not exactly sure. The program doesn't allocate anything except at the beginning. The effect of -h
is to cause a major GC to happen every 0.1s or so, which copies some data, which should tend to cause more cache misses, rather than fewer.
I don't know what's going on, so I'm leaving the details in this ticket for now.
Attachments (3)
Change History (14)
Changed 7 years ago by
comment:1 Changed 7 years ago by
Cc: | pho@… added |
---|
comment:2 Changed 7 years ago by
Replying to simonmar:
{-# LANGUAGE RankNTypes ,RecursiveDo ,TypeFamilies ,GeneralizedNewtypeDeriving #-} import System.TimeIt import Control.Monad.State.Strict import Control.Monad.ST.Strict import Data.STRef -- STO Monad is just StateT with an Int stacked on top of a ST monad. newtype STO s a = STO { unSTO :: StateT Int (ST s) a } deriving (Functor, Monad) runSTO :: (forall s. STO s a) -> a runSTO x = runST (evalStateT (unSTO x) 0) data CircList s = ConsM {value :: {-# UNPACK #-} !Int ,cons :: {-# UNPACK #-} !(STRef s (CircList s )) } -- | Defines a circular list of length 2 twoItemList :: forall s. ST s (CircList s) twoItemList = do rec item1 <- liftM (ConsM 0) $ newSTRef item2 item2 <- liftM (ConsM 1) $ newSTRef item1 return item1 -- | I tie a circular list of size 2 and step through it n times. main :: IO () main = do let n = 633222111 :: Int let takeOneStep = switchSTorSTO . readSTRef . cons runLine $ switchSTorSTO twoItemList >>= liftM value . iterateM n takeOneStep --switchSTorSTO = id --runLine = timeIt . print . runST -- **************************** TO SWITCH TO ST: switch between the two sets of definitions above and below this line switchSTorSTO = STO . lift runLine = timeIt . print . runSTO -- I find myself writing this function a lot, did I miss some basic Monad helper? iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b iterateM n f c = go n c where go 0 b = return b go ns b = f b >>= go (ns-1)
Here is a simpler version (also attached as Main.hs) that uses a circular list of length two. The odd thing is that this runs around 3 seconds when I only use the ST monad, or the STO without profiling. But the STO with profiling makes it FASTER to 2.6 seconds. The two versions seem to have the same GC stats as before. So then why not just use profiling ALL the time? I don't know if this difference exists for large 'project sized' uses of ST, but it seems a monad transformer on top of it along with profiling helps it go faster.
The STO version produces this core, which looks pretty good to me:
Rec { Main.$wa [Occ=LoopBreaker] :: forall s_XF2. GHC.Prim.Int# -> Main.CircList s_XF2 -> GHC.Types.Int -> GHC.Prim.State# s_XF2 -> (# GHC.Prim.State# s_XF2, (Main.CircList s_XF2, GHC.Types.Int) #) [GblId, Arity=4, Caf=NoCafRefs, Str=DmdType LLLL] Main.$wa = \ (@ s_XF2) (ww_s1na :: GHC.Prim.Int#) (w_s1nc :: Main.CircList s_XF2) (w1_s1nd :: GHC.Types.Int) (w2_s1ne :: GHC.Prim.State# s_XF2) -> case ww_s1na of ds_XFh { __DEFAULT -> case w_s1nc of _ { Main.ConsM rb_dFC rb1_dFD -> case GHC.Prim.readMutVar# @ s_XF2 @ (Main.CircList s_XF2) rb1_dFD w2_s1ne of _ { (# ipv_aJ9, ipv1_aJa #) -> Main.$wa @ s_XF2 (GHC.Prim.-# ds_XFh 1) ipv1_aJa w1_s1nd ipv_aJ9 } }; 0 -> (# w2_s1ne, (w_s1nc, w1_s1nd) #) } end Rec }
The effect disappears when you switch to using just ST, but the core looks almost identical other than the threading of the original Int in the StateT. Here is the core produced for the ST version:
Rec { Main.$wpoly_a [Occ=LoopBreaker] :: forall s_aDw. GHC.Prim.Int# -> Main.CircList s_aDw -> GHC.Prim.State# s_aDw -> (# GHC.Prim.State# s_aDw, Main.CircList s_aDw #) [GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL] Main.$wpoly_a = \ (@ s_aDw) (ww_s1la :: GHC.Prim.Int#) (w_s1lc :: Main.CircList s_aDw) (w1_s1ld :: GHC.Prim.State# s_aDw) -> case ww_s1la of ds_XFl { __DEFAULT -> case w_s1lc of _ { Main.ConsM rb_dFI rb1_dFJ -> case GHC.Prim.readMutVar# @ s_aDw @ (Main.CircList s_aDw) rb1_dFJ w1_s1ld of _ { (# ipv_aIP, ipv1_aIQ #) -> Main.$wpoly_a @ s_aDw (GHC.Prim.-# ds_XFl 1) ipv1_aIQ ipv_aIP } }; 0 -> (# w1_s1ld, w_s1lc #) } end Rec }
comment:3 Changed 7 years ago by
Some code golf at #haskell produced a new version of iterateM:
iterateM2 :: Monad m => Int -> (b -> m b) -> b -> m b iterateM2 = (fmap.fmap) (foldr (>=>) return) replicate
This produces slightly different CORE as follows:
Rec { Main.main6 [Occ=LoopBreaker] :: forall s_atH. GHC.Prim.Int# -> Main.CircList s_atH -> GHC.Types.Int -> GHC.Prim.State# s_atH -> (# GHC.Prim.State# s_atH, (Main.CircList s_atH, GHC.Types.Int) #) [GblId, Arity=4, Caf=NoCafRefs, Str=DmdType LSLL] Main.main6 = \ (@ s_atH) (m_a16D :: GHC.Prim.Int#) (eta_B3 :: Main.CircList s_atH) (eta1_X14 :: GHC.Types.Int) (eta2_X15 :: GHC.Prim.State# s_atH) -> case GHC.Prim.<=# m_a16D 1 of _ { GHC.Types.False -> case eta_B3 of _ { Main.ConsM rb_dE3 rb1_dE4 -> case GHC.Prim.readMutVar# @ s_atH @ (Main.CircList s_atH) rb1_dE4 eta2_X15 of _ { (# ipv_aEY, ipv1_aEZ #) -> Main.main6 @ s_atH (GHC.Prim.-# m_a16D 1) ipv1_aEZ eta1_X14 ipv_aEY } }; GHC.Types.True -> case eta_B3 of _ { Main.ConsM rb_dE3 rb1_dE4 -> case GHC.Prim.readMutVar# @ s_atH @ (Main.CircList s_atH) rb1_dE4 eta2_X15 of _ { (# ipv_aEY, ipv1_aEZ #) -> (# ipv_aEY, (ipv1_aEZ, eta1_X14) #) } } } end Rec }
This destroys the effect, ie: it only produces the "slow" version. The iterateM as is, WITH StateT over ST WITH (+RTS -s -h) is the only way I have been able to produce the "fast" version.
And to clarify the effect:
- I compile with -O2 -rtsopts.
- I run with +RTS -s
- Without recompiling or anything else I run +RTS -s -h
#3 is faster than #2 or ANYTHING else, including no RTS, no -rtsopts. It still works with -O1 I end up thrashing the system when I compile without an -Ox flag
comment:5 Changed 7 years ago by
comment:6 Changed 7 years ago by
I suspect it's a weird caching effect. The program creates a small data structure and then repeatedly traverses it (if I'm understanding correctly). After a GC the program performs much better, which suggests that the GC has rearranged the data in memory in a more cache-friendly way (not deliberately, I should point out). Or perhaps the data rearrangement has avoided some cache conflicts, but that seems unlikely - L1 caches are usually very associative.
You should be able to get the same effect by adding a performGC after allocating the data structure but before traversing it.
comment:7 follow-up: 8 Changed 7 years ago by
Cc: | tomberek@… added |
---|
I'm not sure how to insert an IO () into the ST/STO monad, either that sounds "unsafe" or I lift everything to IO? I tried using:
lift $ unsafeIOToST performGC
in Main2.hs
This fixes the problem when the circular list is of size 2 (using function twoItemList). The problem creeps back in when using threeItemList or when the list is produced by nlist. Perhaps my generation of the circular list is suspect. Perhaps too lazy? The inner loop in Core is unchanged from previous versions.
Regarding "small data structure": This effect is actually more pronounced the larger the circular list is. I've tested this up to n = 2 million and it seems that the GC overhead of profiling catches up around 1 million, but the non-GC time difference becomes very pronounced.
So the "GC fix" works for an explicitly constructed circular list of length 2, but nothing else. And still, working in only ST is always slower than in a superfluous StateT over ST.
comment:8 Changed 7 years ago by
Replying to tomberek: correction, the GC fix doesn't work for twoItemList either (it's late, i got confused)
comment:9 follow-up: 10 Changed 7 years ago by
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Aha - of course I was forgetting something, the GC will also eliminate the indirections, and that seems to explain the difference at least in Main2.hs
.
In the original version, the data structure with the indirections is enough to push it over the size of the L1 cache, so all the misses are capacity misses (on my machine anyway). In Main2.hs
, the difference is all due to having to traverse the extra indirections in the data structure.
Mystery solved - and I don't think there's anything we can do here. If it is important to make this case go fast, you might try to rewrite the program so that it creates the data structure strictly, which will eliminate the indirections at creation time.
comment:10 Changed 7 years ago by
Ok, then I'd like to know what I can do to create the data structure strictly. I've got the bang patterns on the CircList,
data CircList s = ConsM {value :: {-# UNPACK #-} !Int ,cons :: {-# UNPACK #-} !(STRef s (CircList s )) }
Is there something in the following that should be "strictified"?
twoItemList :: forall s. ST s (CircList s) twoItemList = do rec item1 <- liftM (ConsM 0) $ newSTRef item2 item2 <- liftM (ConsM 1) $ newSTRef item1 return item1
or in
-- | Defines a circular list of length n nlist :: forall s. Int -> ST s (CircList s) nlist ns = do rec allItems <- let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns) in V.generateM ns next return $ (V.!) allItems 0 where nextCons v n = do n' <- newSTRef n return $ ConsM v n'
comment:11 Changed 7 years ago by
The laziness here is in the contents of the STRef
, where you can't put a bang pattern. Now the problem is that because you're creating a circular list, there must be at least one lazy element in it, because you're using laziness to create the circularity. So you could probably add one bang pattern in twoItemList
, but not two (I haven't tried this).
Smaller version of the same thing