Opened 2 years ago
Last modified 2 years ago
#14035 new bug
Weird performance results.
Reported by: | danilo2 | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Compiler | Version: | 8.0.1 |
Keywords: | Cc: | ||
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | None/Unknown | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description (last modified by )
Hi! I was recently testing performance of a critical code in a product we are shipping and I'm getting really weird results.
The code is compiled with -XStrict
enabled globally. The full source code for this ticket is attached, while the exposed code below uses ...
to hide some non-important implementations.
To get desired results, we use following GHC flags: -O2 -funfolding-use-threshold=10000
.
Let's consider the following program. It is just a pseudo-parser implementation. It consumes 'a' chars in a loop and fails on empty input in the end:
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! -- | WARNING: -XStrict enabled in this file !!! -- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! module Main where imports ... (full source attached to this ticket) ------------------------ -- === Primitives === -- ------------------------ -- === Strict Either === -- data Either e a = Left e | Right a deriving (Eq, Generic, Ord, Read, Show, Functor) newtype EitherT e m a = EitherT { runEitherT :: m (Either e a) } instance Monad m => Functor (EitherT e m) where ... instance Monad m => Applicative (EitherT e m) where ... instance Monad m => Monad (EitherT e m) where ... -- === Strict Bool === -- data XBool = XTrue | XFalse deriving (Show, Generic) (|||) :: XBool -> XBool -> XBool (|||) !a !b = case a of XTrue -> a XFalse -> b {-# INLINE (|||) #-} -- === Strict Tuple === -- data T a b = T !a !b deriving (Generic, Show, Functor) ------------------------ -- === FailParser === -- ------------------------ -- === Definition === -- -- | It is just like EitherT, but also contains progress indicator - a field of type XBool -- which tells us if we've already parsed a char or not yet. In this snippet code however, -- it does not do anything valuable - it just stores the value. newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T XBool a) } deriving (Functor) instance Monad m => Applicative (FailParser m) where pure = undefined (<*>) = undefined instance Monad m => Monad (FailParser m) where return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-} FailParser ma >>= f = FailParser $ do T !b !a <- ma T !b' !a' <- fromFailParser $ f a return $ T (b ||| b') a' {-# INLINE (>>=) #-} _ >> _ = undefined ; {-# INLINE (>>) #-} -- === Running === -- failParser :: m (Either () (T XBool a)) -> FailParser m a failParser a = FailParser $ EitherT a ; {-# INLINE failParser #-} runFailParser :: forall m a. FailParser m a -> m (Either () (T XBool a)) runFailParser f = runEitherT $ fromFailParser f ; {-# INLINE runFailParser #-} -- === MonadFailedParser === -- -- | Behaves just like "left" - lifts until it hits MonadFailedParser class Monad m => MonadFailedParser m where failed :: m a instance {-# OVERLAPPABLE #-} (MonadFailedParser m, MonadTrans t, Monad (t m)) => MonadFailedParser (t m) where failed = lift failed ; {-# INLINE failed #-} instance Monad m => MonadFailedParser (FailParser m) where failed = failParser $ return $ Left () ; {-# INLINE failed #-} ----------------------- -- === Main loop === -- ----------------------- parserLoop :: StateT Text (FailParser Identity) Bool parserLoop = parserStep >> parserLoop parserStep :: StateT Text (FailParser Identity) Char parserStep = get >>= \s -> case Text.uncons s of Just (!t, !s') -> if t == 'a' then put s' >> return t else failed Nothing -> failed {-# INLINE parserStep #-} -- === Criterion === -- instance NFData XBool instance (NFData l, NFData r) => NFData (Either l r) instance (NFData a, NFData b) => NFData (T a b) genText :: Int -> Text genText i = fromString $ replicate i 'a' ; {-# INLINE genText #-} a_parsing_main :: IO () a_parsing_main = do defaultMain [ env (return $ genText $ 10^6) $ bench "a*" . nf (runIdentity . runFailParser . evalStateT parserLoop) ] main = a_parsing_main
The most important part is the bind
implementation of FailParser
:
FailParser ma >>= f = FailParser $ do T b a <- ma T b' a' <- fromFailParser $ f a return $ T (b ||| b') a'
There are several performance related observations and problems:
- INFO: Everything is compiled with
-XStrict
and every field in this code is fully evaluated, in particularb
andb'
are fully evaluated, strict values of typeXBool
.
- INFO: Neither
b
norb'
are used anywhere else in the code. They are just fields inFailParser
which should be used to store information if we did consume a letter or we did not.
- PROBLEM: When provided with
10^6
characters this code works in 1ms. If we replace(b ||| b')
with(b' ||| b)
or with(b')
the time do NOT change. However, if we replace it with(b)
, we've got 15 times slowdown. Moreover, the resulting core is changed drastically in some places.
- PROBLEM: Another interesting observation is that the value of
XBool
is created only in one place in the code, namely in:FailParser
'sMonad
implementation, inreturn
function:return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
. We never change the XFalse, so this is the only value that could appear in this code. If we change it toXTrue
in this implementation however, we again get 15 times slowdown.
- INFO: The order of
case
expressions in definition of(|||)
or the order of constructor defintions of any datatype does not affect the above results.
Attachments (1)
Change History (11)
comment:1 Changed 2 years ago by
Description: | modified (diff) |
---|
comment:2 Changed 2 years ago by
Description: | modified (diff) |
---|
comment:3 Changed 2 years ago by
Description: | modified (diff) |
---|
comment:4 Changed 2 years ago by
Description: | modified (diff) |
---|
Changed 2 years ago by
comment:5 Changed 2 years ago by
comment:6 Changed 2 years ago by
I have made some progress.
(1), I discovered that -XStrict
was generating some stunningly bad desugarings for very ordinary function bindings. I have a fix in the works. This seems to be responsible for almost all the performance loss.
(2), look at your code
FailParser ma >>= f = FailParser $ do T b a <- ma T b' a' <- fromFailParser $ f a return $ T (b ||| b') a'
If b
turns out to be XFalse
, this amounts to
FailParser ma >>= f = FailParser $ do T b a <- ma T b' a' <- fromFailParser $ f a return $ T b' a'
and GHC can re-use the (T b' a') that fromFailParser
returned. Moreover in the critical inner loop b
is indeed XFalse
:
parserLoop = parserStep >> parserLoop
because return
returns XFalse
.
But if you change the (>==)
to
return $ T b a'
now GHC can't re-use anthing, and so allocate a fresh T
every time round the loop. So an apparently simpler program is actually more complicated!
But (2) is not a massive effect. The big thing is the desugaring. Stay tuned.
Meanwhile, try without -XStrict
.
comment:7 Changed 2 years ago by
Simon, first of all, thank you very much for your time and help with this topic! I added some important notices to the points mentioned in your response:
(1) I'm so happy that you've found out that something is wrong and you've got fix for that! In generall, -XStrict
is awesome, we need it in high performance Haskell code, putting bangs everywhere (and remembering about it) could be cumbersome.
(2) You're of course right. I just opened the browser to add comment exactly about the same finding. The specification of (|||)
allows GHC to easily discover that if we always use XFalse
value, it could shorten the mentioned code to s@(T b' a') <- fromFailParser $ f a ; return s
(just reuse the value). There are however 3 other non-obvious questions involved:
(2a) Why GHC is able to optimize the code this way if we use everywhere -XFalse
but it does not when using everywhere -XTrue
? Very similar final core could be generated in the later case – if b
is XFalse
we can just reuse the output value, if it is XTrue
we can be sure the output always contains XTrue
as well.
(2b) Even if GHC needs to create code like T b' a' <- fromFailParser $ f a ; return $ T something a'
, why it takes so long? This is a strict, fully evaluated value, so why "updating a field" takes 10x longer than Char comparison?
(2c) Moreover, what is the reason to "allocate a fresh T
every time round the loop"? The fields of the tuple T
do not "interact" with each other, they are just 2 separate outputs from a function. I could of course be very wrong, but I think it should be possible to just optimize T a b
to (# a,b #)
and cut the "fresh T
allocation time" completely out, am I right? If GHC cannot do it for any reason, are we able to manualny optymise it somehow not to allocate new T every loop run?
(3)
I was testing performance in 3 different configurations - without -XStrict
, with -XStrict
and with manually inserted bangs literally everywhere.
During these tests I used 10^7
(instead of 10^6
) chars to get better understanding of the results:
Method | (b | b') [ms] | (b') [ms] | (b) [ms] |
---|---|---|---|
Without -XStrict | 149.2 | 145.3 | 150.9 |
With manual bangs | 010.9 | 010.7 | 143.8 |
With -XStrict | 010.8 | 010.9 | 136.5 |
As we can observe here, using -XStrict
and inserting bangs by hand gives identical results. This is especially interesting in combination with questions (2b) and (2c).
comment:8 Changed 2 years ago by
One more important thing to note here is that the provided code was shortened to the limits. It does not use the XBool
value in any place (it puts -XFalse
everywhere, even after successful parse). It implies that the problem (2a) is also not very important - it is just an optimization opportunity in a very special and rare use case.
We can easily fix the code and make it a real use case by inserting the following code:
class Monad m => ProgressMonad m where returnProgressed :: forall a. a -> m a instance {-# OVERLAPPABLE #-} (ProgressMonad m, Monad (t m), MonadTrans t) => ProgressMonad (t m) where returnProgressed = lift . returnProgressed ; {-# INLINE returnProgressed #-} instance Monad m => ProgressMonad (FailParser m) where returnProgressed a = failParser $ return $ Right $ T XTrue a ; {-# INLINE returnProgressed #-}
and replacing the line 125 to:
Just (!t, !s') -> if t == 'a' then put s' >> returnProgressed t else failed
The XBool
value would then be used to implement Alternative
instance, but we do not need it here. We can observe the same slowdown (10^6
chars parsed in 15ms with -XStrict
enabled). Which is expected, based on the results so far, however if we want to base on a real use case, this code help us transform abstract program to real one.
comment:9 Changed 2 years ago by
Let's do one thing at a time. My brain is too small to accommodate all these variations.
I'll commit my -XStrict
fix. You can try it out. If you are happy, close the ticket; if not, can you give a new repro case?
I've just shortened the example code and made the performance related questions simpler. I think it would be easier now debug what is going on under the hood.
I would be very thankful for any information regarding this issue. We've been talking with many people - both in the company I'm working in as well as on IRC and we do not see any reason why this code behaves in this way and why it is so sensitive to the changes. We started to be worried a lot about how we can use Haskell for high-performance parts at all, if it is not obvious if a very simple change do not affect performance a lot. This situation makes the source code both very fragile to any changes and unmaintainable as a result. I'm writing this because I'm deeply worried about where these problems originate from and I would really like to solve them / know why they appear.