Opened 2 years ago

Last modified 2 years ago

#14035 new bug

Weird performance results.

Reported by: danilo2 Owned by:
Priority: high Milestone:
Component: Compiler Version: 8.0.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by danilo2)

Hi! I was recently testing performance of a critical code in a product we are shipping and I'm getting really weird results.

The code is compiled with -XStrict enabled globally. The full source code for this ticket is attached, while the exposed code below uses ... to hide some non-important implementations.

To get desired results, we use following GHC flags: -O2 -funfolding-use-threshold=10000.

Let's consider the following program. It is just a pseudo-parser implementation. It consumes 'a' chars in a loop and fails on empty input in the end:

-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- | WARNING: -XStrict enabled in this file !!!
-- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

module Main where

imports ... (full source attached to this ticket)

-- === Primitives === --

-- === Strict Either === --

data    Either  e   a = Left e | Right a deriving (Eq, Generic, Ord, Read, Show, Functor)
newtype EitherT e m a = EitherT { runEitherT :: m (Either e a) }

instance Monad m => Functor (EitherT e m) where ...
instance Monad m => Applicative (EitherT e m) where ...
instance Monad m => Monad (EitherT e m) where ...

-- === Strict Bool === --

data XBool = XTrue | XFalse deriving (Show, Generic)

(|||) :: XBool -> XBool -> XBool
(|||) !a !b = case a of
    XTrue  -> a
    XFalse -> b
{-# INLINE (|||) #-}

-- === Strict Tuple === --

data T a b = T !a !b deriving (Generic, Show, Functor)

-- === FailParser === --

-- === Definition === --
-- | It is just like EitherT, but also contains progress indicator - a field of type XBool
--   which tells us if we've already parsed a char or not yet. In this snippet code however,
--   it does not do anything valuable - it just stores the value.

newtype FailParser m a = FailParser { fromFailParser :: EitherT () m (T XBool a) } deriving (Functor)

instance Monad m => Applicative (FailParser m) where
    pure  = undefined
    (<*>) = undefined

instance Monad m => Monad (FailParser m) where
    return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}
    FailParser ma >>= f = FailParser $ do
        T !b  !a  <- ma
        T !b' !a' <- fromFailParser $ f a
        return $ T (b ||| b') a'
    {-# INLINE (>>=) #-}
    _ >> _ = undefined ; {-# INLINE (>>) #-}

-- === Running === --

failParser :: m (Either () (T XBool a)) -> FailParser m a
failParser a = FailParser $ EitherT a ; {-# INLINE failParser #-}

runFailParser :: forall m a. FailParser m a -> m (Either () (T XBool a))
runFailParser f = runEitherT $ fromFailParser f ; {-# INLINE runFailParser #-}

-- === MonadFailedParser === --
-- | Behaves just like "left" - lifts until it hits MonadFailedParser

class Monad m => MonadFailedParser m where
    failed :: m a

instance {-# OVERLAPPABLE #-} (MonadFailedParser m, MonadTrans t, Monad (t m))
      => MonadFailedParser (t m) where
    failed = lift failed ; {-# INLINE failed #-}

instance Monad m => MonadFailedParser (FailParser m) where
    failed = failParser $ return $ Left () ; {-# INLINE failed #-}

-- === Main loop === --

parserLoop :: StateT Text (FailParser Identity) Bool
parserLoop = parserStep >> parserLoop

parserStep :: StateT Text (FailParser Identity) Char
parserStep = get >>= \s -> case Text.uncons s of
    Just (!t, !s') -> if t == 'a' then put s' >> return t else failed
    Nothing        -> failed
{-# INLINE parserStep #-}

-- === Criterion === --

instance NFData XBool
instance (NFData l, NFData r) => NFData (Either l r)
instance (NFData a, NFData b) => NFData (T a b)

genText :: Int -> Text
genText i = fromString $ replicate i 'a' ; {-# INLINE genText #-}

a_parsing_main :: IO ()
a_parsing_main = do
        [ env (return $ genText $ 10^6) $ bench "a*" . nf (runIdentity . runFailParser . evalStateT parserLoop)

main = a_parsing_main

The most important part is the bind implementation of FailParser:

FailParser ma >>= f = FailParser $ do
    T b  a  <- ma
    T b' a' <- fromFailParser $ f a
    return $ T (b ||| b') a'

There are several performance related observations and problems:

  1. INFO: Everything is compiled with -XStrict and every field in this code is fully evaluated, in particular b and b' are fully evaluated, strict values of type XBool.
  1. INFO: Neither b nor b' are used anywhere else in the code. They are just fields in FailParser which should be used to store information if we did consume a letter or we did not.
  1. PROBLEM: When provided with 10^6 characters this code works in 1ms. If we replace (b ||| b') with (b' ||| b) or with (b') the time do NOT change. However, if we replace it with (b), we've got 15 times slowdown. Moreover, the resulting core is changed drastically in some places.
  1. PROBLEM: Another interesting observation is that the value of XBool is created only in one place in the code, namely in: FailParser's Monad implementation, in return function: return a = FailParser $ return $ (T XFalse a) ; {-# INLINE return #-}. We never change the XFalse, so this is the only value that could appear in this code. If we change it to XTrue in this implementation however, we again get 15 times slowdown.
  1. INFO: The order of case expressions in definition of (|||) or the order of constructor defintions of any datatype does not affect the above results.

Attachments (1)

Main.hs (3.8 KB) - added by danilo2 2 years ago.

Download all attachments as: .zip

Change History (11)

comment:1 Changed 2 years ago by danilo2

Description: modified (diff)

comment:2 Changed 2 years ago by danilo2

Description: modified (diff)

comment:3 Changed 2 years ago by danilo2

Description: modified (diff)

comment:4 Changed 2 years ago by danilo2

Description: modified (diff)

Changed 2 years ago by danilo2

Attachment: Main.hs added

comment:5 Changed 2 years ago by danilo2

I've just shortened the example code and made the performance related questions simpler. I think it would be easier now debug what is going on under the hood.

I would be very thankful for any information regarding this issue. We've been talking with many people - both in the company I'm working in as well as on IRC and we do not see any reason why this code behaves in this way and why it is so sensitive to the changes. We started to be worried a lot about how we can use Haskell for high-performance parts at all, if it is not obvious if a very simple change do not affect performance a lot. This situation makes the source code both very fragile to any changes and unmaintainable as a result. I'm writing this because I'm deeply worried about where these problems originate from and I would really like to solve them / know why they appear.

comment:6 Changed 2 years ago by simonpj

I have made some progress.

(1), I discovered that -XStrict was generating some stunningly bad desugarings for very ordinary function bindings. I have a fix in the works. This seems to be responsible for almost all the performance loss.

(2), look at your code

FailParser ma >>= f = FailParser $ do
    T b  a  <- ma
    T b' a' <- fromFailParser $ f a
    return $ T (b ||| b') a'

If b turns out to be XFalse, this amounts to

FailParser ma >>= f = FailParser $ do
    T b  a  <- ma
    T b' a' <- fromFailParser $ f a
    return $ T b' a'

and GHC can re-use the (T b' a') that fromFailParser returned. Moreover in the critical inner loop b is indeed XFalse:

parserLoop = parserStep >> parserLoop

because return returns XFalse.

But if you change the (>==) to

    return $ T b a'

now GHC can't re-use anthing, and so allocate a fresh T every time round the loop. So an apparently simpler program is actually more complicated!

But (2) is not a massive effect. The big thing is the desugaring. Stay tuned.

Meanwhile, try without -XStrict.

comment:7 Changed 2 years ago by danilo2

Simon, first of all, thank you very much for your time and help with this topic! I added some important notices to the points mentioned in your response:

(1) I'm so happy that you've found out that something is wrong and you've got fix for that! In generall, -XStrict is awesome, we need it in high performance Haskell code, putting bangs everywhere (and remembering about it) could be cumbersome.

(2) You're of course right. I just opened the browser to add comment exactly about the same finding. The specification of (|||) allows GHC to easily discover that if we always use XFalse value, it could shorten the mentioned code to s@(T b' a') <- fromFailParser $ f a ; return s (just reuse the value). There are however 3 other non-obvious questions involved:

(2a) Why GHC is able to optimize the code this way if we use everywhere -XFalse but it does not when using everywhere -XTrue? Very similar final core could be generated in the later case – if b is XFalse we can just reuse the output value, if it is XTrue we can be sure the output always contains XTrue as well.

(2b) Even if GHC needs to create code like T b' a' <- fromFailParser $ f a ; return $ T something a', why it takes so long? This is a strict, fully evaluated value, so why "updating a field" takes 10x longer than Char comparison?

(2c) Moreover, what is the reason to "allocate a fresh T every time round the loop"? The fields of the tuple T do not "interact" with each other, they are just 2 separate outputs from a function. I could of course be very wrong, but I think it should be possible to just optimize T a b to (# a,b #) and cut the "fresh T allocation time" completely out, am I right? If GHC cannot do it for any reason, are we able to manualny optymise it somehow not to allocate new T every loop run?

(3) I was testing performance in 3 different configurations - without -XStrict, with -XStrict and with manually inserted bangs literally everywhere.

During these tests I used 10^7 (instead of 10^6) chars to get better understanding of the results:

Method (b | b') [ms] (b') [ms] (b) [ms]
Without -XStrict 149.2 145.3 150.9
With manual bangs 010.9 010.7 143.8
With -XStrict 010.8 010.9 136.5

As we can observe here, using -XStrict and inserting bangs by hand gives identical results. This is especially interesting in combination with questions (2b) and (2c).

Last edited 2 years ago by danilo2 (previous) (diff)

comment:8 Changed 2 years ago by danilo2

One more important thing to note here is that the provided code was shortened to the limits. It does not use the XBool value in any place (it puts -XFalse everywhere, even after successful parse). It implies that the problem (2a) is also not very important - it is just an optimization opportunity in a very special and rare use case.

We can easily fix the code and make it a real use case by inserting the following code:

class Monad m => ProgressMonad m where
    returnProgressed :: forall a. a -> m a

instance {-# OVERLAPPABLE #-} (ProgressMonad m, Monad (t m), MonadTrans t) => ProgressMonad (t m) where
    returnProgressed = lift . returnProgressed ; {-# INLINE returnProgressed #-}

instance Monad m => ProgressMonad (FailParser m) where
    returnProgressed a = failParser $ return $ Right $ T XTrue a ; {-# INLINE returnProgressed #-}

and replacing the line 125 to:

Just (!t, !s') -> if t == 'a' then put s' >> returnProgressed t else failed

The XBool value would then be used to implement Alternative instance, but we do not need it here. We can observe the same slowdown (10^6 chars parsed in 15ms with -XStrict enabled). Which is expected, based on the results so far, however if we want to base on a real use case, this code help us transform abstract program to real one.

comment:9 Changed 2 years ago by simonpj

Let's do one thing at a time. My brain is too small to accommodate all these variations.

I'll commit my -XStrict fix. You can try it out. If you are happy, close the ticket; if not, can you give a new repro case?

comment:10 Changed 2 years ago by Simon Peyton Jones <simonpj@…>

In 4636886/ghc:

Improve the desugaring of -XStrict

Trac #14035 showed that -XStrict was generating some TERRIBLE
desugarings, espcially for bindings with INLINE pragmas. Reason: with
-XStrict, all AbsBinds (even for non-recursive functions) went via the
general-case deguaring for AbsBinds, namely "generate a tuple and
select from it", even though in this case there was only one variable
in the tuple.  And that in turn interacts terribly badly with INLINE

This patch cleans things up:

* I killed off AbsBindsSig completely, in favour of a boolean flag
  abs_sig in AbsBinds.  See Note [The abs_sig field of AbsBinds]

  This allowed me to delete lots of code; and instance-method
  declarations can enjoy the benefits too.  (They could have
  before, but no one had changed them to use AbsBindsSig.)

* I refactored all the AbsBinds handling in DsBinds into a new
  function DsBinds.dsAbsBinds.  This allowed me to handle the
  strict case uniformly
Note: See TracTickets for help on using tickets.