Opened 2 years ago

Last modified 18 months ago

#14208 new bug

Performance with O0 is much better than the default or with -O2, runghc performs the best

Reported by: harendra Owned by: osa1
Priority: normal Milestone:
Component: Compiler Version: 8.2.1
Keywords: Cc: simonpj, sibi, lelf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by harendra)

In this particular case -O2 or the default is 2x slower than -O0 and -O0 is 2x slower than runghc. Please see the github repo: https://github.com/harendra-kumar/ghc-perf to reproduce the issue. Readme file in the repo has instructions to reproduce.

The issue seems to occur when the code is placed in a different module. When all the code is in the same module the problem does not occur. In that case -O2 or the default is faster than -O0. However, when the code is split into two modules the performance gets inverted.

Also, it does not occur always, when I tried to change the code to make it simpler for repro the problem did not occur.

Attachments (1)

t14208_.tar.xz (12.4 KB) - added by osa1 18 months ago.

Download all attachments as: .zip

Change History (40)

comment:1 Changed 2 years ago by harendra

I added a branch named does-not-occur in the repo with a simpler case where the problem does not occur even when the code is split across modules.

comment:2 Changed 2 years ago by mpickering

I can reproduce this. Thanks for the excellent test case.

You can also build everything with cabal new-build benchmarks.

Adding an {- INLINABLE toList -} improves the situation a lot but it is still slower than with -O0.

comment:3 Changed 2 years ago by harendra

Earlier I thought the problem occurs with -O2 but that is not the case, -O2 is irrelevant. -O2 gives the same performance as the default i.e. without any optimization flags. The difference is between -O0 and the absence of it. Adding -O0 improves performance drastically. I have updated the github repo and removed -O2.

This is such an ironic case, runghc has the best performance, the next best is -O0 and any optimization is the worst! Maybe we should just reverse the flags :-)

comment:4 Changed 2 years ago by harendra

Description: modified (diff)
Summary: Performance with -O2 is worse than with -O0 which is worse than runghcPerformance with O0 is much better than the default or with -O2, runghc performs the best

comment:5 Changed 2 years ago by harendra

It was hard to find in the manual the difference between O0 and the default. The manual says "O0 is the default", which seems to be incorrect (NEED TO BE FIXED). So I had to turn to the GHC code. Alright, so O0 seems to ignore or omit "interface-pragmas" whatever the heck they are, from DynFlags.hs:

    , ([0],     Opt_IgnoreInterfacePragmas)
    , ([0],     Opt_OmitInterfacePragmas)

When I used "-fignore-interface-pragmas" I got the same improvement in performance. The GHC manual documents this flag but says nothing about what this really means (NEED TO BE FIXED). I can only guess. Since this has to do something with the interface, this also explains why the performance is good when the whole code is in the same module and bad when the code is split into two modules.

comment:6 Changed 2 years ago by harendra

Alright, I guessed that ignoring the interface disables inlining so the problem has something to do with inlining. I looked at all the inlining options available and narrowed down the problem to -fno-pre-inlining. Just using -fno-pre-inlining brings us back to the performance that we are seeing with -O0. The fine manual does not say what pre-inlining is so I have to look at the GHC code again.

I am confused by the fine manual. It says:

-fno-pre-inlining 
Default:	off

Does that mean pre-inlining is on by default? Can we make this clearer, explicit?

comment:7 Changed 2 years ago by harendra

Cc: simonpj added

The pre-inlining flag maps to pre_inline_unconditionally in SimplUtils.hs. Ccing SPJ who seems to have touched it last via commit 33452dfc6cf.

comment:8 Changed 2 years ago by harendra

I can confirm that this problem does not manifest in GHC 8.0.2.

comment:9 Changed 2 years ago by harendra

I measured runghc vs ghc performance for this test case on 7.10.3, 8.0.2 and 8.2.1. It seems runghc has always been faster, though the difference was not much in 7.10.3, runghc seems to have improved a lot in 8.0 performing better than ghc.

7.10.3 ghc      : time                 11.43 ms   (11.05 ms .. 11.75 ms)
7.10.3 runghc   : time                 10.55 ms   (9.461 ms .. 11.46 ms)

8.0.2 ghc       : time                 11.00 ms   (10.64 ms .. 11.38 ms)
8.0.2 runghc    : time                 6.441 ms   (6.025 ms .. 6.790 ms)

8.2.1 ghc (O0)  : time                 8.986 ms   (8.728 ms .. 9.313 ms)
8.2.1 runghc    : time                 4.598 ms   (4.350 ms .. 4.890 ms)

It will be awesome if ghc can be as good as runghc.

comment:10 Changed 2 years ago by MikolajKonarski

Interesting. From what you write, my guess would be that when all functions are inlined (e.g., everything is placed in the same module) or when none get inlined, performance is good, but some combination of inlined and not inlined functions causes the slowdown (such things do happen and they are impossible to avoid). If I was to try and minimize this example, I'd put everything in the same module and then explore all combinations of INLINE and NOINLINE pragmas on functions. BTW, does the slowdown vanish if you enable -fexpose-all-unfoldings (and -fspecialise-aggressively for a good measure)?

comment:11 Changed 2 years ago by bgamari

It was hard to find in the manual the difference between O0 and the default. The manual says "O0 is the default", which seems to be incorrect (NEED TO BE FIXED).

I've opened #14214 to track this.

comment:12 Changed 2 years ago by harendra

my guess would be that when all functions are inlined (e.g., everything is placed in the same module) or when none get inlined, performance is good,

That seems to be a correct observation. When everything is in the same module the simplifier can choose to inline any of the functions while when they are in different modules only those functions that are marked INLINE/INLINABLE get inlined. This is supported by the fact that when we mark the toList function INLINE the difference is eliminated to a large extent. Also, it does not go away with -fexpose-all-unfoldings but it goes away with -fspecialise-aggressively. I also suspect that there is an interaction of foldr/build fusion with the simplifier.

comment:13 Changed 2 years ago by harendra

I added a much simpler example on the "simplified" branch in the same repo. I can paste it here as well:

Main.hs

import List
...

len :: IO Int
len = do
    xs <- toList $ (foldr (<>) mempty $ map (\x -> Yield x Stop) [1..100000 :: Int])
    return (length xs)

List.hs

module List where

import Control.Monad  (liftM)

data List a = Stop | Yield a (List a)

instance Monoid (List a) where
    mempty = Stop
    mappend x y =
        case x of
            Stop -> y
            Yield a r -> Yield a (mappend r y)

toList :: Monad m => List a -> m [a]
toList m =
    case m of
        Stop -> return []
        Yield a r -> liftM (a :) (toList r)

It essentially generates a custom list in the main module and calls toList function from another module to covert it into a Haskell list. The perf difference is not as dramatic as the previous example but it is significant.

All in the same module:

-O0              : 14ms
-O1              : 8ms
-fno-pre-inlining: 4ms

Different modules:

-O0              : 8ms
-O1              : 14ms
-fno-pre-inlining: 8ms
INLINE toList    : 4ms

comment:14 Changed 2 years ago by harendra

When I look at -ddump-rewrite-rules output the list fusion rules seem to get fired in the -fno-pre-inlining case but not in the -O1 case. Specifically I do not see the "map" rule (and subsequent other rules triggered by it) in the -O1 output.

As long as we understand why it is happening and there is no easy way to fix it, I guess it is fine. But I hope this can be fixed so that we do not spend time wondering about and figuring out such things and manually tweaking options and pragmas to get to the right combination.

comment:15 Changed 2 years ago by harendra

Another question that I am seeking an answer for - is there a combination of options to make a multi-module program behave the same way as if all the code is in a single module, from the performance perspective? I expected that -fexpose-all-unfoldings will do that for me but it does not seem to be equivalent. I thought it is equivalent to making all functions INLINABLE but it does not seem to be doing that. Even when using this flag I need to mark a function INLINABLE to get it inlined. What exactly does this flag do?

comment:16 Changed 2 years ago by MikolajKonarski

Great job simplifying the example! Every bit of code eliminated helps GHC hackers immensely to tackle this (which may take a while anyway, especially given the busy time of year). Even if the contrived example looks nonsensical and silly. There is less noise, fewer suspects and the Core is less overwhelming.

The combination of -fexpose-all-unfoldings and -fspecialise-aggressively is close to (or even equivalent) to putting everything in a single module. See https://ghc.haskell.org/trac/ghc/ticket/12463#comment:19 and other comments in that thread. The only drawback is that GHC takes much more memory. A hack around that (at least before 8.2.1) is to restart GHC during compilation when it hogs too much memory (or travis_retry after out-of-memory segfault in travis scripts).

If you see worse fusion behaviour with -O1 than -O0, I guess there is hope it can be fine-tuned in that particular case or even that it's a bug. I wonder who is hacking on the fusion machinery these days...

But in general, inlining (especially of only a subset of functions) that makes performance worse is a fact of life, though GHC strives hard not to _automatically_ inline in such suspect cases. I wonder, if you marked toList NOINLINE and the Monoid methods INLINE, but put everything in the same module, would you still have the bad behaviour? That would hint that the (partial) inlining inhibits fusion and would show which combination of inline decisions is responsible (so that GHC may be improved for that combination or may be prevented from automatically generating such partial inlining).

Last edited 2 years ago by MikolajKonarski (previous) (diff)

comment:17 Changed 2 years ago by harendra

I guess some function getting inlined too early is preventing list fusion.

The combination of -fexpose-all-unfoldings and -fspecialise-aggressively is not "exactly" equivalent to putting everything in the same module. O1 with everything in the same module finishes in 8ms while with the combination of these two finishes in 4ms. So they do something more. I guess the added effect is that they make everything INLINEABLE.

When everything is in the same module and toList marked NOINLINE then it takes 14ms (i.e. the worst case) irrespective of the monoid functions being marked INLINE or not.

comment:18 Changed 2 years ago by mpickering

Harendra, have you read my post about how this part of GHC works? It might be useful for you or at least useful to point other people to. Your tickets have good examples in, I will try to get back to them once I am back from holiday.

http://mpickering.github.io/posts/2017-03-20-inlining-and-specialisation.html

comment:19 in reply to:  17 Changed 2 years ago by MikolajKonarski

Replying to harendra:

The combination of -fexpose-all-unfoldings and -fspecialise-aggressively is not "exactly" equivalent to putting everything in the same module. O1 with everything in the same module finishes in 8ms while with the combination of these two finishes in 4ms. So they do something more. I guess the added effect is that they make everything INLINEABLE.

Yep, forgot that bit. That's exactly what I use the two options for: to be able to split things among modules and to avoid INLINEABLE for every polymorphic function. With this, I only ever need an occasional INLINE in random places, but then it's not for specialization, but real inlining.

When everything is in the same module and toList marked NOINLINE then it takes 14ms (i.e. the worst case) irrespective of the monoid functions being marked INLINE or not.

And what if they are marked NOINLINE? In any case, that means we now have an example of failed fusion that fits in one module. And additionally, we know that GHC can effectively generate such an example from innocently looking set of modules, by automatically inlining too much (and not enough).

Last edited 2 years ago by MikolajKonarski (previous) (diff)

comment:20 Changed 2 years ago by harendra

Thanks Matthew, that post is a good reference and it confirmed a few things I was guessing about. Some of this can perhaps go into the GHC manual.

Mikolaj, there is no difference when the Monoid instance functions are also marked NOINLINE in addition to toList.

comment:21 Changed 2 years ago by harendra

A similar issue has been reported in this stack overflow question as well: https://stackoverflow.com/questions/46296919/haskell-webframeworks-speed-ghci-vs-compiled . There is an appalling difference in the ghc compiled code vs ghci code, in case of snap it is a 50x difference! That sounds unacceptable whatever the reason maybe.

comment:22 Changed 2 years ago by simonpj

This is odd. I can't reproduce what you are seeing. I get this, measuring allocation

HEAD, separate modules, but with {-# INLINABLE #-} on toList

-O0                        496M
-O1                        160M
-O1 -fno-pre-inlining      312M


HEAD, One module

-O0                         496M
-O1                         160M
-O1 -fno-pre-inlining       312M


GHC-8.2-branch, one module

-O0                         496M
-O1                         160M
-O1 -fno-pre-inlining       312M

Here's the code I'm using (I had to add instance Semigroup (List a):

module Main where


import Control.Monad  (liftM)
import Data.Semigroup (Semigroup(..))

-- import T14208a

-------------------------------
data List a = Stop | Yield a (List a)

instance Monoid (List a) where
    mempty = Stop
    mappend x y =
        case x of
            Stop -> y
            Yield a r -> Yield a (mappend r y)

instance Semigroup (List a) where
   (<>) = mappend

toList :: Monad m => List a -> m [a]
toList m =
    case m of
        Stop -> return []
        Yield a r -> liftM (a :) (toList r)
--------------------------------

len :: IO Int
len = do
    xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop) [1..1000000 :: Int])
    return (length xs)

main = do { x <- len; print x }

With "two files" I pushed the stuff between the lines into a separate file.

I did an an INLINABLE pragma for toList. And I multiplied the limit count by 10. I did not use Criterion (less to depend on).

Can you say more precisely how to reproduce the problem?

comment:23 Changed 2 years ago by sibi

Cc: sibi added

comment:24 Changed 2 years ago by harendra

It seems, in my repo the cabal file is building the library without any optimization flag and in the different benchmark runs the flag is being changed only on the benchmark module and not on the library. So there is a mixup of optimization flags. Here is the new matrix taking that into account:

Main.hs       List.hs               INLINE   Default
----------------------------------------------------
             Identical Flags
----------------------------------------------------

-O1               -O1               4.6 ms  14.2 ms
-O0               -O0               14.2 ms 14.2 ms
-fno-pre-inlining -fno-pre-inlining 4.6 ms  9.9 ms

----------------------------------------------------
             Mixed Flags
----------------------------------------------------

-fno-pre-inlining -O1               4.6 ms  8.8 ms
-O0               -O1               8.8 ms  8.8 ms

----------------------------------------------------
             runghc
----------------------------------------------------

runghc           -O0                5.2 ms  5.2 ms
runghc           -O1                4.7 ms  4.7 ms

Observations:

  1. When toList is INLINEd the results are more or less expected. Simon, what you are seeing is the INLINE column with identical flags.
  2. In the default case (no pragmas are used) -fno-pre-inlining does better than -O1 and runghc seems to be doing well irrespective of the flag used to build the library (i.e. List.hs). Does it mean that -O1 can also do better in this case?
  3. Mixing up the optimization flags brings one more variable in the picture. I would like to ignore those cases. What does GHC recommend? If this is not recommended, is there a way to warn the user when the flags are mixed up? If not, will it be possible to implement something like that?
  4. In my original package I am still seeing -O0 as well as runghc doing much better than -O2, even when using INLINE pragmas and identical optimization options for all code. I guess, I need to work again to get a simplified example keeping these in mind.

comment:25 Changed 19 months ago by lelf

Cc: lelf added

comment:26 Changed 18 months ago by bgamari

Owner: set to osa1

Omer will look.

comment:27 Changed 18 months ago by osa1

I can somewhat reproduce this with HEAD.

I'm currently focusing on the compiled code issues, ignoring GHCi.

My setup: I have two files

Main.hs:

{-# LANGUAGE CPP #-}

module Main where

import Criterion.Main (defaultMain, bench, nfIO)

-- Uncomment this to have all the code in one module
-- #define SINGLE_MODULE
#ifndef SINGLE_MODULE
import List
#else
import Control.Monad  (liftM)

data List a = Stop | Yield a (List a)

instance Semigroup (List a) where
    x <> y =
      case x of
          Stop -> y
          Yield a r -> Yield a (mappend r y)

instance Monoid (List a) where
    -- {-# INLINE mempty #-}
    mempty = Stop
    -- {-# INLINE mappend #-}
    mappend = (<>)

-- {-# NOINLINE toList #-}
toList :: Monad m => List a -> m [a]
toList m =
    case m of
        Stop -> return []
        Yield a r -> liftM (a :) (toList r)

#endif

{-# NOINLINE len #-}
len :: IO Int
len = do
    xs <- toList $ (foldr mappend mempty $ map (\x -> Yield x Stop) [1..100000 :: Int])
    return (length xs)

main :: IO ()
main = defaultMain [ bench "len" $ nfIO len ]

When I'm measuring allocations I remove criterion imports and use this main:

main = len >>= print

Note that I have a NOINLINE on len to avoid optimising it in the benchmark site. The original report does not have this.

List.hs:

module List where

import Control.Monad  (liftM)

data List a = Stop | Yield a (List a)

instance Semigroup (List a) where
    x <> y =
      case x of
          Stop -> y
          Yield a r -> Yield a (mappend r y)

instance Monoid (List a) where
    mempty = Stop
    mappend = (<>)

toList :: Monad m => List a -> m [a]
toList m =
    case m of
        Stop -> return []
        Yield a r -> liftM (a :) (toList r)

I have three configurations:

  • -O0
  • -O1
  • -O2
  • -O0 -DSINGLE_MODULE
  • -O1 -DSINGLE_MODULE
  • -O2 -DSINGLE_MODULE

I first run all these with +RTS -s using main = len >>= print as the main function.

============ -O0 ===============================================================
      49,723,096 bytes allocated in the heap
      25,729,264 bytes copied during GC
       6,576,744 bytes maximum residency (5 sample(s))
          29,152 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        41 colls,     0 par    0.011s   0.011s     0.0003s    0.0008s
  Gen  1         5 colls,     0 par    0.010s   0.010s     0.0020s    0.0047s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.011s  (  0.012s elapsed)
  GC      time    0.021s  (  0.021s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.032s  (  0.033s elapsed)

  %GC     time      64.0%  (63.8% elapsed)

  Alloc rate    4,366,732,069 bytes per MUT second

  Productivity  35.6% of total user, 35.9% of total elapsed

============ -O1 ===============================================================
      28,922,528 bytes allocated in the heap
      18,195,344 bytes copied during GC
       4,066,200 bytes maximum residency (5 sample(s))
         562,280 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        22 colls,     0 par    0.008s   0.008s     0.0004s    0.0016s
  Gen  1         5 colls,     0 par    0.008s   0.008s     0.0016s    0.0029s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.009s  (  0.009s elapsed)
  GC      time    0.016s  (  0.016s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.025s  (  0.025s elapsed)

  %GC     time      63.8%  (63.9% elapsed)

  Alloc rate    3,262,174,222 bytes per MUT second

  Productivity  35.3% of total user, 35.3% of total elapsed

============ -O2 ===============================================================
      28,922,528 bytes allocated in the heap
      18,195,344 bytes copied during GC
       4,066,200 bytes maximum residency (5 sample(s))
         562,280 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        22 colls,     0 par    0.008s   0.008s     0.0003s    0.0008s
  Gen  1         5 colls,     0 par    0.008s   0.008s     0.0017s    0.0029s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.008s  (  0.008s elapsed)
  GC      time    0.016s  (  0.016s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.024s  (  0.024s elapsed)

  %GC     time      66.6%  (66.6% elapsed)

  Alloc rate    3,714,684,268 bytes per MUT second

  Productivity  32.7% of total user, 32.7% of total elapsed

============ -O0 -DSINGLE_MODULE ===============================================
      49,723,032 bytes allocated in the heap
      25,729,184 bytes copied during GC
       6,576,728 bytes maximum residency (5 sample(s))
          29,152 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        41 colls,     0 par    0.010s   0.010s     0.0003s    0.0008s
  Gen  1         5 colls,     0 par    0.010s   0.010s     0.0019s    0.0042s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.011s  (  0.011s elapsed)
  GC      time    0.020s  (  0.020s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.031s  (  0.031s elapsed)

  %GC     time      65.0%  (65.0% elapsed)

  Alloc rate    4,609,752,610 bytes per MUT second

  Productivity  34.8% of total user, 34.8% of total elapsed

============ -O1 -DSINGLE_MODULE ===============================================
      16,122,496 bytes allocated in the heap
       7,392,664 bytes copied during GC
       3,438,424 bytes maximum residency (4 sample(s))
          55,464 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        10 colls,     0 par    0.004s   0.004s     0.0004s    0.0008s
  Gen  1         4 colls,     0 par    0.005s   0.005s     0.0012s    0.0019s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.004s  (  0.004s elapsed)
  GC      time    0.009s  (  0.009s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.014s  (  0.014s elapsed)

  %GC     time      66.5%  (66.6% elapsed)

  Alloc rate    3,663,260,346 bytes per MUT second

  Productivity  32.5% of total user, 32.5% of total elapsed

============ -O2 -DSINGLE_MODULE ===============================================
      13,722,496 bytes allocated in the heap
       6,798,640 bytes copied during GC
       2,158,376 bytes maximum residency (3 sample(s))
          33,248 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         9 colls,     0 par    0.007s   0.007s     0.0008s    0.0021s
  Gen  1         3 colls,     0 par    0.004s   0.005s     0.0015s    0.0030s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.004s  (  0.004s elapsed)
  GC      time    0.012s  (  0.012s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.016s  (  0.016s elapsed)

  %GC     time      74.2%  (74.3% elapsed)

  Alloc rate    3,479,572,009 bytes per MUT second

  Productivity  25.2% of total user, 25.2% of total elapsed

Summary: allocations consistently reduce as optimisation level increases.

Secondly I run criterion benchmark to measure runtime, using the same configurations:

============ -O0 ===============================================================
benchmarking len
time                 13.50 ms   (13.23 ms .. 13.71 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 13.55 ms   (13.35 ms .. 13.81 ms)
std dev              613.5 μs   (424.7 μs .. 918.2 μs)
variance introduced by outliers: 18% (moderately inflated)

============ -O1 ===============================================================
benchmarking len
time                 15.83 ms   (15.62 ms .. 16.02 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 15.92 ms   (15.75 ms .. 16.10 ms)
std dev              463.5 μs   (340.2 μs .. 669.1 μs)

============ -O2 ===============================================================
benchmarking len
time                 15.70 ms   (15.51 ms .. 15.90 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 15.74 ms   (15.59 ms .. 15.87 ms)
std dev              355.2 μs   (271.2 μs .. 470.7 μs)

============ -O0 -DSINGLE_MODULE ===============================================
benchmarking len
time                 14.85 ms   (13.81 ms .. 16.06 ms)
                     0.976 R²   (0.959 R² .. 0.997 R²)
mean                 13.60 ms   (13.22 ms .. 14.14 ms)
std dev              1.152 ms   (773.1 μs .. 1.614 ms)
variance introduced by outliers: 41% (moderately inflated)

============ -O1 -DSINGLE_MODULE ===============================================
benchmarking len
time                 6.802 ms   (6.702 ms .. 6.922 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 6.845 ms   (6.765 ms .. 6.945 ms)
std dev              261.8 μs   (201.3 μs .. 336.8 μs)
variance introduced by outliers: 18% (moderately inflated)

============ -O2 -DSINGLE_MODULE ===============================================
benchmarking len
time                 6.614 ms   (6.501 ms .. 6.712 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 6.399 ms   (6.317 ms .. 6.472 ms)
std dev              239.1 μs   (201.7 μs .. 292.5 μs)
variance introduced by outliers: 18% (moderately inflated)

So;

  • Everything works as expected in single module case. Both runtime and allocations get lower as optimisation level increases.
  • In multi-module -O1 and -O2 produce identical outputs, runtime difference is just noise.
  • In multi-module we get better allocations with -O1 vs. -O0, but runtime gets somewhat worse. This is what we should investigate.

To see why we allocate less in multi-module with -O1 I compared the STG outputs (multi-module -O0 vs. multi-module -O1), the answer is fusion kicking in with -O1. We have an intermediate function application for foldr mappend mempty in -O0 output which disappears with -O1.

Why does the runtime get worse? I don't know but I suspect it's just noise. Really the code is better (as in, it does less work) with -O1 than with -O0.

I also compared single-module -O1 with multi-module -O1, the reason why single module is better is becuase the toList function is not inlined cross-module but it's inlined within the module. correction: it's never inlined, but specialized in single-module case.

So I think in the compiled case there are no problems. Only remaining question is why GHCi is faster than compiled code.

I've attached a tarball with my setup + outputs. It includes Core/STG outputs of all 6 configurations and criterion and +RTS -s outputs as well.

Last edited 18 months ago by osa1 (previous) (diff)

Changed 18 months ago by osa1

Attachment: t14208_.tar.xz added

comment:28 Changed 18 months ago by simonpj

So I think in the compiled case there are no problems.

OK good; that's reassuring.

Do you know why the single-module case gets better? I suspect it may be that toList is specialised. If you add {-# INLINABLE toList #-} does the difference go away?

Perhaps this isn't a big deal -- it's reasonable for single module to be faster -- but GHC does make real efforts NOT to penalise you for multi-module, so I'm curious.

Only remaining question is why GHCi is faster than compiled code.

Can you reproduce this difference? It is indeed puzzling!

comment:29 Changed 18 months ago by osa1

Do you know why the single-module case gets better? I suspect it may be that toList is specialised.

Yes, the reason is toList being specialised only when it's in the same module.

Multi-module

Main.len1
  :: GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #)
[GblId, Arity=1, Str=<S,U>, Unf=OtherCon []] =
    [] \r [s_s5yc]
        case Main.len_go 1# of sat_s5yd {
          __DEFAULT ->
              case List.toList GHC.Base.$fMonadIO sat_s5yd s_s5yc of {
                (#,#) ipv_s5yf [Occ=Once] ipv1_s5yg [Occ=Once] ->
                    let {
                      sat_s5yi [Occ=Once] :: GHC.Types.Int
                      [LclId] =
                          [ipv1_s5yg] \u []
                              case GHC.List.$wlenAcc ipv1_s5yg 0# of ww2_s5yh {
                                __DEFAULT -> GHC.Types.I# [ww2_s5yh];
                              };
                    } in  (#,#) [ipv_s5yf sat_s5yi];
              };
        };

-- toList in another module
List.toList [Occ=LoopBreaker]
  :: forall (m :: * -> *) a. GHC.Base.Monad m => List.List a -> m [a]
[GblId,
 Arity=2,
 Caf=NoCafRefs,
 Str=<S,U(A,C(C1(U)),A,C(U),A)><S,1*U>,
 Unf=OtherCon []] =
    [] \r [$dMonad_s36Z m1_s370]
        case m1_s370 of {
          List.Stop -> GHC.Base.return $dMonad_s36Z GHC.Types.[];
          List.Yield a1_s372 [Occ=OnceL] r_s373 [Occ=Once] ->
              let {
                sat_s377 [Occ=Once] :: [a_a1fh] -> m_a1fg [a_a1fh]
                [LclId] =
                    [$dMonad_s36Z a1_s372] \r [x1_s375]
                        let {
                          sat_s376 [Occ=Once] :: [a_a1fh]
                          [LclId] =
                              CCCS :! [a1_s372 x1_s375];
                        } in  GHC.Base.return $dMonad_s36Z sat_s376; } in
              let {
                sat_s374 [Occ=Once] :: m_a1fg [a_a1fh]
                [LclId] =
                    [$dMonad_s36Z r_s373] \u [] List.toList $dMonad_s36Z r_s373;
              } in  GHC.Base.>>= $dMonad_s36Z sat_s374 sat_s377;
        };

Single module:

Main.len1
  :: GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #)
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>, Unf=OtherCon []] =
    [] \r [s_s5uU]
        case Main.len_go 1# of sat_s5uV {
          __DEFAULT ->
              case Main.len2 sat_s5uV s_s5uU of {
                (#,#) ipv_s5uX [Occ=Once] ipv1_s5uY [Occ=Once] ->
                    let {
                      sat_s5v0 [Occ=Once] :: GHC.Types.Int
                      [LclId] =
                          [ipv1_s5uY] \u []
                              case GHC.List.$wlenAcc ipv1_s5uY 0# of ww2_s5uZ {
                                __DEFAULT -> GHC.Types.I# [ww2_s5uZ];
                              };
                    } in  (#,#) [ipv_s5uX sat_s5v0];
              };
        };

Main.len2 [Occ=LoopBreaker]
  :: forall a.
     Main.List a
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, [a] #)
[GblId,
 Arity=2,
 Caf=NoCafRefs,
 Str=<S,1*U><S,U>,
 Unf=OtherCon []] =
    [] \r [m_s5uw eta_s5ux]
        case m_s5uw of {
          Main.Stop -> (#,#) [eta_s5ux GHC.Types.[]];
          Main.Yield a1_s5uz [Occ=Once] r_s5uA [Occ=Once] ->
              case Main.len2 r_s5uA eta_s5ux of {
                (#,#) ipv_s5uC [Occ=Once] ipv1_s5uD [Occ=Once] ->
                    let {
                      sat_s5uE [Occ=Once] :: [a_X1Bw]
                      [LclId] =
                          CCCS :! [a1_s5uz ipv1_s5uD];
                    } in  (#,#) [ipv_s5uC sat_s5uE];
              };
        };

If you add {-# INLINABLE toList #-} does the difference go away?

With INLINE toList or INLINABLE toList multi-module and single-module become identical with -O1 and -O2. (there are some small changes in the STG outputs but nothing that changes runtime or allocations)

Can you reproduce this difference? It is indeed puzzling!

I can, by running the criterion benchmark in GHCi:

$ ghc-stage2 --interactive Main.hs
GHCi, version 8.5.20180322: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/omer/rcbackup/.ghci
[1 of 2] Compiling List             ( List.hs, interpreted )
[2 of 2] Compiling Main             ( Main.hs, interpreted )
Ok, two modules loaded.
λ:1> main
benchmarking len
time                 10.79 ms   (10.68 ms .. 10.98 ms)
                     0.993 R²   (0.979 R² .. 1.000 R²)
mean                 10.66 ms   (10.54 ms .. 10.98 ms)
std dev              480.8 μs   (147.7 μs .. 952.1 μs)
variance introduced by outliers: 20% (moderately inflated)

λ:2>
Leaving GHCi.

$ ghc-stage2 --interactive Main.hs -DSINGLE_MODULE
GHCi, version 8.5.20180322: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/omer/rcbackup/.ghci
[1 of 1] Compiling Main             ( Main.hs, interpreted )
Ok, one module loaded.
λ:1> main
benchmarking len
time                 11.30 ms   (11.20 ms .. 11.42 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 10.77 ms   (10.64 ms .. 10.90 ms)
std dev              346.5 μs   (317.7 μs .. 398.6 μs)

-O1 and -O2 with -DSINGLE_MODULE is faster than GHCi, but otherwise GHCi is faster than the other 4 configurations.

Last edited 18 months ago by osa1 (previous) (diff)

comment:30 Changed 18 months ago by osa1

Just updated the previous comment: toList is never inlined, but when it's in the same module as the using code or marked as INLINABLE it gets specialized.

comment:31 Changed 18 months ago by harendra

In case you need another data point, my original streaming library that made me file this issue still exhibits the same behavior. GHCi is 6x faster than my regular compiled code. I tried even compiling everything including all dependencies with exactly the same optimization flags to make sure there is no funny business due to mixing of opt flags. You can see the behavior in the tree available on github here: https://github.com/composewell/streamly/tree/199e20dd4b62ac2dafea0a40dc2ce3d97c307542

You can clone the repo and run the experiment like this:

$ stack bench
benchmarked streaming ops
time                 34.39 ms   (32.99 ms .. 35.67 ms)
                     0.995 R²   (0.991 R² .. 0.998 R²)
mean                 33.97 ms   (33.24 ms .. 35.43 ms)

$ stack runghc benchmark/Main.hs
benchmarked streaming ops
time                 6.215 ms   (5.684 ms .. 6.860 ms)
                     0.945 R²   (0.896 R² .. 0.978 R²)
mean                 6.610 ms   (6.333 ms .. 6.991 ms)

If I change the optimization flags to -O0 for benchmark stanza in cabal file I can get close to ghci performance. The code that I am benchmarking is like this:

{-# INLINE streamlyOp #-}
streamlyOp :: IO Int
streamlyOp = do
    xs <- S.toList $ S.serially $
          S.each [1..100000 :: Int]
        & fmap (+1)
        & fmap (+1)
        & fmap (+1)
        & fmap (+1)
    return (Prelude.length xs)

It seems the problem is with the fmap operation (I may be wrong), it is 6 times slower in case of GHC, and every other fmap I add, the benchmark timings increase but the ratio remains the same. I tried using an INLINE on fmap, I also tried to SPECIALIZE it to IO and INT type but no change.

The fmap op is defined in src/Streamly/Streams.hs file like this:

instance Monad m => Functor (StreamT m) where
    fmap f (StreamT (Stream m)) = StreamT $ Stream $ \_ stp yld ->
        let yield a Nothing  = yld (f a) Nothing
            yield a (Just r) = yld (f a)
                               (Just (getStreamT (fmap f (StreamT r))))
        in m Nothing stp yield

comment:32 Changed 18 months ago by mpickering

fmap in that module doesn't have an INLINE pragma on it? Should it?

comment:33 Changed 18 months ago by harendra

I tried using INLINE, and I tried SPECIALIZE as well but it made no difference.

comment:34 Changed 18 months ago by mpickering

I can reproduce this when all the dependencies are installed with ghc benchmarks/Main.hs -isrc/ -O2 which is slow and ghc benchmarks/Main.hs -isrc.

comment:35 Changed 18 months ago by mpickering

I noticed two things from a brief look.

  1. fmap is not SATed. Which means the static f won't be eliminated.
  2. A bad(?) loopbreaker is chosen in Prelude.foldrM, yield rather than go.

Neither of these make a real difference though. I'm unsure what exactly happens to make it slower but perhaps Omer can make more progress.

Last edited 18 months ago by mpickering (previous) (diff)

comment:36 Changed 18 months ago by mpickering

Another possible answer is that your library has a lot of recursive functions in it and the base types are written in CPS which means things don't optimise too well. Again, this is not an answer as to why the optimiser makes the program slower.

comment:37 Changed 18 months ago by harendra

@mpickering is there something to read about how things do and don't get optimized with CPS especially with GHC? What are the pitfalls and why it does not optimize well? I am very much interested in this, any pointers will be appreciated. I was using the direct style before and I had some trouble with it, I had to use a lot of SPECIALIZE to IO types to extract decent performance wherever I was using monad polymorphic instances. The performance did not seem to be much better compared to CPS. Maybe I am wrongly attributing the problem to CPS-vs-Direct style and it might have been something stupid that I did, at that time. But I never had any similar problem in CPS style and then I never went back to direct.

Though I want to try and see if direct would be any better but it will take some time and effort to do that. However, there is some proof that direct style is not very much different in that the "streaming" library performance is more or less the same as this library and "streaming" is actually more or less the same thing but written in direct style. See the performance comparison between "streaming" and "streamly" here:

https://github.com/composewell/streaming-benchmarks/blob/master/charts-1/CheaperOperations.svg

comment:38 Changed 18 months ago by simonpj

If I change the optimization flags to -O0 for benchmark stanza in cabal file I can get close to ghci performance.

That contradicts what Omer found in comment:27.

Nevertheless, if what you say is true, it'd be easier to debug with -O0 than GHCi (which brings the bytecode generator into the picture).

GHCi is 6x faster than my regular compiled code

This is totally bonkers and we MUST find out what is happening :-).

I suggest not getting diverted into speculation about CPS. We have a repro case; let's just dig into it and find out what is going on.

My suggestions

  • In comment:31 Does the same thing happen with -O0 vs -O, or only with GHCi vs -O?
  • In all repros, do the huge differences also show up in the bytes-allocated numbers? (If so, we don't need the Criterion apparatus.)
  • I notice that in comment:27, in the 2-module case, comparing -O0 and -O1:
    • Allocation is about halved in -O1
    • But runtime actually increases

That is most peculiar.

  • Matthew says in comment:34 "I can reproduce this..". That's great. But what is "this" precisely? Which version of GHC? What timing data? What happened to allocation and GC numbers?

Somehow a 6x increase in execution time ought not to be hard to find!

comment:39 Changed 18 months ago by mpickering

I cloned the repo - https://github.com/composewell/streamly/tree/199e20dd4b62ac2dafea0a40dc2ce3d97c307542

Checked out commit - 199e20dd4b62ac2dafea0a40dc2ce3d97c307542

I could build all the dependencies with ghc-8.2.2.

Then I ran in in the root of the repo

ghc benchmarks/Main.hs -isrc/ -O2

and then ran the executable.

The benchmark numbers were slower than doing the same without -O2.

Note: See TracTickets for help on using tickets.