Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#12603 closed bug (fixed)

INLINE and manually inlining produce different code

Reported by: bgamari Owned by:
Priority: high Milestone: 8.2.1
Component: Compiler Version: 8.0.1
Keywords: Cc: mikolaj.konarski@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #12747 #12781 Differential Rev(s):
Wiki Page:

Description (last modified by MikolajKonarski)

Mikolaj reported that he was seeing significantly different code generated in the case of an INLINE pragma versus manually inlining. I haven't looked into what the cause it and this isn't necessarily problematic; this is just a reminder to look into what is happening.

See https://github.com/LambdaHack/LambdaHack/blob/97724fe8c73e80b329ddf326a8eb001020870b2d/Game/LambdaHack/Common/Color.hs#L99.

Edit, by Mikolaj: here is a minimal example:

-- ghc --make Main.hs -O1; ./Main +RTS -s -RTS

seqFrame2 :: [Int] -> IO ()
{-# NOINLINE seqFrame2 #-}
seqFrame2 l = do
  let crux = attrFromInt
  --   Total   time    2.052s  (  2.072s elapsed)
  -- but the following version is many times slower:
  -- let crux = attrFromIntINLINE
  --   Total   time    7.896s  (  7.929s elapsed)
  mapM_ (\a -> crux a `seq` return ()) l

main :: IO ()
main = seqFrame2 $ replicate 100000000 0


data Attr = Attr !Int  --- the bang is essential

attrFromInt :: Int -> Attr
{-# NOINLINE attrFromInt #-}
attrFromInt w = Attr (w + (2 ^ (8 :: Int)))

fgFromInt :: Int -> Int
{-# INLINE fgFromInt #-}  -- removing this INLINE makes it many times faster
                          -- just like the manually inlined version 
                          -- and NOINLINE lands in between
fgFromInt w = w + (2 ^ (8 :: Int))

attrFromIntINLINE :: Int -> Attr
{-# NOINLINE attrFromIntINLINE #-}
attrFromIntINLINE w = Attr (fgFromInt w)

My guess is the inlining forced with INLINE is just broken in GHC 8.0.1 --- it omits some optimizations that ghc would normally do on manually inlined code, such as floating out common subexpressions.

Change History (38)

comment:1 Changed 3 years ago by bgamari

Owner: set to bgamari

comment:2 Changed 3 years ago by MikolajKonarski

Description: modified (diff)

comment:3 Changed 3 years ago by MikolajKonarski

Now on the branch https://github.com/LambdaHack/LambdaHack/tree/ghc-bug-12603 the manually inlined code is 3 times faster than the code with INLINE. I tried creating a small example by starting just with the offending module and Main.hs with a loop, but it's not enough, so I gave up for now.

comment:4 Changed 3 years ago by MikolajKonarski

Cc: mikolaj.konarski@… added

comment:5 Changed 3 years ago by simonpj

That is indeed very odd. I'd love to see a test case. Thanks! -- I know it's real work to create one

Simon

comment:6 Changed 3 years ago by MikolajKonarski

I've compared the resulting Core. The difference is that the version with INLINE recomputes ((2 ^ (8 :: Int) - 1) every time, while the manually inlined version uses a value computed just once (note that the value is of type Word32 --- I haven't verified that it matters). In the source code the constant is, e.g., here: https://github.com/LambdaHack/LambdaHack/blob/ghc-bug-12603/Game/LambdaHack/Common/Color.hs#L94

Here are the relevant snippets of Core.

379842537fadd870f9dd3304e0182c0b
  attrCharFromW1 :: Word32
  {- Unfolding: (case $wf 2## 8# of ww { DEFAULT ->
                 W32# (narrow32Word# (minusWord# ww 1##)) }) -}
0a91b1a84599352c004a37572b9588c1
  $wf :: Word# -> Int# -> Word#
  {- Arity: 2, HasNoCafRefs, Strictness: <L,U><S,U>, Inline: [0] -}
41b3e215bf375441beb5fb607472bd73
  $wattrCharFromW32 :: Word# -> (# Attr, Char# #)
  {- Arity: 1, Strictness: <S,U>, Inline: [0],
     Unfolding: (\ (ww :: Word#) ->
                 case attrCharFromW1 of wild1 { W32# y# ->
                 let {
                   x :: Int# = word2Int# (and# (uncheckedShiftRL# ww 8#) y#)
                 } in
                 case tagToEnum# @ Bool (>=# x 0#) of wild {
                   False -> case $fBinaryColor3 x ret_ty (# Attr, Char# #) of {}
                   True
                   -> case tagToEnum# @ Bool (<=# x 15#) of wild2 {
                        False -> case $fBinaryColor3 x ret_ty (# Attr, Char# #) of {}
                        True
                        -> case tagToEnum# @ Color x of dt { DEFAULT ->
                           let {
                             x1 :: Int# = word2Int# (and# ww y#)
                           } in
                           case tagToEnum# @ Bool (>=# x1 0#) of wild3 {
                             False -> case $fBinaryColor3 x1 ret_ty (# Attr, Char# #) of {}
                             True
                             -> case tagToEnum# @ Bool (<=# x1 15#) of wild4 {
                                  False -> case $fBinaryColor3 x1 ret_ty (# Attr, Char# #) of {}
                                  True
                                  -> case tagToEnum# @ Color x1 of dt1 { DEFAULT ->
                                     let {
                                       i# :: Int# = word2Int# (uncheckedShiftRL# ww 16#)
                                     } in
                                     case tagToEnum#
                                            @ Bool
                                            (leWord# (int2Word# i#) 1114111##) of wild5 {
                                       False -> case chr2 i# ret_ty (# Attr, Char# #) of {}
                                       True -> (# Attr dt dt1, chr# i# #) } } } } } } } }) -}
8b854d1c31d32d5f70ae60f00eb8d52b
  $wattrCharFromW32' :: Word# -> (# Attr, Char# #)
  {- Arity: 1, Strictness: <S,U>, Inline: [0],
     Unfolding: (\ (ww :: Word#) ->
                 case $wf 2## 8# of ww1 { DEFAULT ->
                 let {
                   x :: Int#
                   = word2Int#
                       (and#
                          (uncheckedShiftRL# ww 8#)
                          (narrow32Word# (minusWord# ww1 1##)))
                 } in
                 case tagToEnum# @ Bool (>=# x 0#) of wild {
                   False -> case $fBinaryColor3 x ret_ty (# Attr, Char# #) of {}
                   True
                   -> case tagToEnum# @ Bool (<=# x 15#) of wild1 {
                        False -> case $fBinaryColor3 x ret_ty (# Attr, Char# #) of {}
                        True
                        -> case tagToEnum# @ Color x of dt { DEFAULT ->
                           let {
                             x1 :: Int#
                             = word2Int# (and# ww (narrow32Word# (minusWord# ww1 1##)))
                           } in
                           case tagToEnum# @ Bool (>=# x1 0#) of wild2 {
                             False -> case $fBinaryColor3 x1 ret_ty (# Attr, Char# #) of {}
                             True
                             -> case tagToEnum# @ Bool (<=# x1 15#) of wild3 {
                                  False -> case $fBinaryColor3 x1 ret_ty (# Attr, Char# #) of {}
                                  True
                                  -> case tagToEnum# @ Color x1 of dt1 { DEFAULT ->
                                     let {
                                       i# :: Int# = word2Int# (uncheckedShiftRL# ww 16#)
                                     } in
                                     case tagToEnum#
                                            @ Bool
                                            (leWord# (int2Word# i#) 1114111##) of wild4 {
                                       False -> case chr2 i# ret_ty (# Attr, Char# #) of {}
                                       True -> (# Attr dt dt1, chr# i# #) } } } } } } } }) -}

comment:7 Changed 3 years ago by MikolajKonarski

I now I remember, I was seeing similar problems in GHC 8 quite a few times and I've just stumbled on another, unrelated one. I have a big local function used only once. When I INLINE the function, I get 3,421,310,504 bytes allocated in the heap (runtime 5.32s, but there is much wider measurement error margin that with allocation), when I NOINLINE it, I get 2,932,616,792 (5.17s) and when I leave it alone (I guess GHC inlines it somehow differently), I get 4,309,699,560 (5.57s).

This is with -O1 and generally nothing special in the .cabal file. Alternating between -A1m and -A99m has almost no effect on total allocation, though it has on GC, which however scales proportionally to the total allocation in each case.

(BTW, in a prof the left-alone version wins; the numbers are, respectively: 5,315,881,616 bytes allocated in the heap, 4,916,409,824, 4,738,887,896.)

Last edited 3 years ago by MikolajKonarski (previous) (diff)

comment:8 Changed 3 years ago by simonpj

When I INLINE the function, I get 3,421,310,504 bytes allocated in the heap (runtime 5.32s, but there is much wider measurement error margin that with allocation), when I NOINLINE it, I get 2,932,616,792 (5.17s) and when I leave it alone (I guess GHC inlines it somehow differently), I get 4,309,699,560 (5.57s).

This isn't necessarily surprising. Consider

module M( f, g, h ) where
  f x = BIG
  g x = (f x, True)
  h x = ...(g x)...

Without an INLINE on f, GHC won't inline it (because it's big). But g is small, so it'll get inlined into h, and good things may happen because h can see the pair and True.

But if you add an INLINE pragma to f, then g becomes big, so GHC won't inline it.

These effects can be large, and are very hard to predict. GHC makes no guarantees, I'm afraid.

It's a bit more puzzling that you say your big function is called only once; so it might come down to a race as to whether f gets auto-inlined before g does. That's a bit mysterious I admit.

However a difference between 2.9G and 4.3G is very large, and it would be great to get more insight into why. I use -ticky to investigate this kind of thing.

comment:9 Changed 3 years ago by simonpj

I've compared the resulting Core. The difference is that the version with INLINE recomputes ((2 ^ (8 :: Int) - 1) every time, while the manually inlined version uses a value computed just once.

Here's how that could happen:

f x y = (expensive x) + y

g x ys = map (f x) ys

Executed as-is each call to (f x yi) will evaluate (expensive x) afresh. In this particular example it'd be better if GHC transformed to

f x = let v = expensive x in 
      \y -> v + y

but GHC's full laziness transformation never separates adjacent lambdas. (Doing so can be very bad in other ways.) But if you INLINE f we get

g x ys = map (\y -> expensive x + y) ys

and now full laziness can float (expensive x) outwards.

To make your program robust, I'd write f with the local let-binding as I do above. Then it shouldn't repeatedly evaluate (expensive x) regardless of optimisation or inlining.

I'm guessing a bit of course. It could just be a bug. I'm really swamped right now, but maybe I've given you enough to investigate further. If you think it's a bug, it'd be really helpful to boil out a smaller example with full repro instructions.

comment:10 in reply to:  8 Changed 3 years ago by MikolajKonarski

Replying to simonpj:

When I INLINE[...], when I NOINLINE it [...] and when I leave it alone [...

This isn't necessarily surprising. Consider [...]

Thank you for the example. I've fixed the inlining status of all enclosing or competing functions in the module, but the strange behaviour persists. Now I suspect the complaints I have may not be related after all: 1. not floating out constants with INLINE as opposed to identical manual inlining; 2. erratic/unpredictable/surprising/buggy behaviour of INLINE vs NOINLINE vs <nothing> 3. three different figures for these, instead of two.

I've just opened a feature request ticket for 3: https://ghc.haskell.org/trac/ghc/ticket/12747#ticket. This comment is about 2. If I find a smaller or simpler example for 2 and it's still as surprising, I will open a new bug report for it. As you point out, it's possible that 2 is not a bug, but just an exemplification of GHC being smarter than either us.

But if you add an INLINE pragma to f, then g becomes big, so GHC won't inline it. [...] These effects can be large, and are very hard to predict. GHC makes no guarantees, I'm afraid.

The guarantee that would help greatly would be that the behaviour of the program with neither INLINE nor NOINLINE for function f is the same as the behaviour with INLINE or that with NOINLINE. It would help tremendously with profiling experiments, because then I could fix the state of inlining of f and tune g and h without worry that inlining of f changes silently. But if f has the lowest allocation only without any pragmas at all, I can't fix it. See the feature request.

It's a bit more puzzling that you say your big function is called only once

I meant syntactically (even taking into account inlining of any other functions). But it's called inside a loop. It's referenced exactly once and fully applied, here: https://github.com/LambdaHack/LambdaHack/blob/e7b31a0d937b6ef6e53665eab23663dcaf4ced81/Game/LambdaHack/Client/UI/DrawM.hs#L273

However a difference between 2.9G and 4.3G is very large, and it would be great to get more insight into why. I use -ticky to investigate this kind of thing.

Thank you for the tips.

Last edited 3 years ago by MikolajKonarski (previous) (diff)

comment:11 Changed 3 years ago by simonpj

The guarantee that would help greatly would be that the behaviour of the program with neither INLINE nor NOINLINE for function f is the same as the behaviour with INLINE or that with NOINLINE.

I think that's very problematic. Currently GHC decides on a call-site-by-call-site basis whether to inline a given function. See module CoreUnfold and in particular callSiteInline. You could change it to do something different but I believe that'd make GHC generate either bigger code or slower code or both. But do try!

INLINE/NOINLINE let you take control; otherwise you are letting GHC decide.

comment:12 in reply to:  11 Changed 3 years ago by MikolajKonarski

I think that's very problematic. Currently GHC decides on a call-site-by-call-site basis whether to inline a given function.[...]

When there is only one call site, IMHO it's reasonable that the outcome should be reproducible with either INLINE or NOINLINE (and it is not currently, see above).

INLINE/NOINLINE let you take control; otherwise you are letting GHC decide.

I'm all for letting GHC outsmart me, but I'd like to be able to then fix the result GHC came up with and tweak it further or tweak other bits of code, keeping this part constant, or stick to it in order to debug my program with slightly varying other portions of code or easily come up with a minimized example for a GHC bug, without GHC sneakily intefering. Currently I can't. Let's move the discussion to ​https://ghc.haskell.org/trac/ghc/ticket/12747#ticket where I also suggest that INLINABLE+inline+noinline should let the programmer reproduce GHC choices in the multi-call-site case.

comment:13 Changed 3 years ago by MikolajKonarski

Phew, you goaded me into spending some extra time and using some extra pragmas, I managed to concoct a tiny example that reproduces the original problem (many times slower with INLINE vs manual inlining that exactly mimics the supposed GHC behaviour; allocation the same). I haven't checked, but most probably in the INLINE version, the constants are not floated out, just as in the Core of the original problem show above.

import Data.Bits (unsafeShiftR, (.&.))
import Data.Word (Word32)

-- ghc --make Main.hs -O1; ./Main +RTS -s -RTS

seqFrame2 :: [AttrW32] -> IO ()
{-# NOINLINE seqFrame2 #-}
seqFrame2 l = do
  let crux = attrCharFromW32
  --   Total   time    2.052s  (  2.072s elapsed)
  -- let crux = attrCharFromW32'
  --   Total   time    7.896s  (  7.929s elapsed)
  mapM_ (\a -> crux a `seq` return ()) l

main :: IO ()
main = seqFrame2 $ replicate 100000000 $ AttrW32 0


data Attr = Attr !Int !Int  --- bangs here are essential

newtype AttrW32 = AttrW32 {attrW32 :: Word32}

attrCharFromW32 :: AttrW32 -> Attr
{-# NOINLINE attrCharFromW32 #-}
attrCharFromW32 w =
        Attr (fromEnum $ unsafeShiftR (attrW32 w) 8 .&. (2 ^ (8 :: Int) - 1))
             (fromEnum $ attrW32 w .&. (2 ^ (8 :: Int) - 1))

fgFromW32 :: AttrW32 -> Int
{-# INLINE fgFromW32 #-}
fgFromW32 w = fromEnum $ unsafeShiftR (attrW32 w) 8 .&. (2 ^ (8 :: Int) - 1)

bgFromW32 :: AttrW32 -> Int
{-# INLINE bgFromW32 #-}
bgFromW32 w = fromEnum $ attrW32 w .&. (2 ^ (8 :: Int) - 1)

attrCharFromW32' :: AttrW32 -> Attr
{-# NOINLINE attrCharFromW32' #-}
attrCharFromW32' w = Attr (fgFromW32 w) (bgFromW32 w)

comment:14 Changed 3 years ago by simonpj

Terrific, thanks. So what change do I make to the source code to exhibit the change in perf?

comment:15 Changed 3 years ago by MikolajKonarski

In line let crux = attrCharFromW32 you add prime at the end and it should be ~4 times slower, as indicated.

comment:16 Changed 3 years ago by simonpj

Thanks. Sorry for missing that.

comment:17 Changed 3 years ago by MikolajKonarski

[Edit: simpler example is just below.]

I have an example that may or may not capture the original case 2 (allocation bloat due to INLINE, here by a factor of 2000). Perhaps it's just INLINE pushing the subexpression mapVT n over the threshold where some kind of simplification and/or floating out is not done any more. If it's interesting, please let me know and I will file a new bug report.

{-# LANGUAGE BangPatterns, RankNTypes #-}
import Control.Monad.ST.Strict
import qualified Data.IntMap.Strict as IM
import Data.List
import Data.Maybe

-- ghc --make -O1 InlineBloat.hs; ./InlineBloat +RTS -s

data P = P !Int

instance Enum P where
  fromEnum (P x) = x
  toEnum n = undefined

main = do
  let {-# NOINLINE z #-}
      z = 44
      dis :: Int -> ()
      {-# INLINE dis #-}  -- change here to NOINLINE
--
-- with INLINE:
--           384,409,080 bytes allocated in the heap
-- with NOINLINE:
--               169,080 bytes allocated in the heap
      dis pI =
        let p0 = let (_, x) = pI `quotRem` z in P x
            p1 = let (y, _) = pI `quotRem` z in P y
            !_ = isJust $ IM.lookup (fromEnum p0) IM.empty
            !_ = isJust $ IM.lookup (fromEnum p1) IM.empty
        in ()
      {-# NOINLINE l #-}
      l = [0..1600]
      mapVT :: forall s. Int -> ST s ()
      {-# INLINE mapVT #-}
      mapVT _ = mapM_ (\x -> return $! dis x) l
      !runRes = foldl' (\() n -> runST (mapVT n)) () [1..10000]
  return ()
Last edited 3 years ago by MikolajKonarski (previous) (diff)

comment:18 Changed 3 years ago by MikolajKonarski

Here is an even simpler version of INLINE allocation bloat, but with less difference vs NOINLINE. I guess virtually every feature of this example is needed to trigger the bloat.

{-# LANGUAGE RankNTypes #-}
import Control.Monad.ST.Strict
import qualified Data.IntMap.Strict as IM

-- ghc --make -O1 InlineBloat.hs; ./InlineBloat +RTS -s

data P = P Int

instance Enum P where
  fromEnum (P x) = x
  toEnum n = undefined

main = do
  let {-# NOINLINE z #-}
      z = 44
      dis :: Int -> ()
      {-# INLINE dis #-}  -- change here to NOINLINE and observe lower alloc
      dis pI =
        let p0 = let (_, x) = pI `quotRem` z in P x
            p1 = let (_, x) = pI `quotRem` z in P x
            m = IM.lookup (fromEnum p0) IM.empty
            b = IM.member (fromEnum p1) IM.empty
        in m == Just 'c' `seq` b `seq` ()
      {-# NOINLINE l #-}
      l = [0..10000000]
      mapVT :: forall s. () -> ST s ()
      {-# INLINE mapVT #-}
      mapVT _ = mapM_ (\x -> return $! dis x) l
  return $! runST (mapVT ())

comment:19 Changed 3 years ago by MikolajKonarski

Description: modified (diff)

I've created a separate ticket for the case 2 (much more allocation with INLINE than NOINLINE): https://ghc.haskell.org/trac/ghc/ticket/12781

So, now cases 3 and 2 have separate tickets and so I move (a new version of) minimal example for case 1 to the main ticket description.

comment:20 Changed 3 years ago by MikolajKonarski

comment:21 Changed 3 years ago by MikolajKonarski

Description: modified (diff)
Priority: normalhigh

IMHO, inlining forced with INLINE is just broken in GHC 8.0.1 --- it omits some optimizations that ghc normally does when performing the inlining without pragmas (may be related to the fact that .hi files function code with INLINE is not optimized, as opposed INLINABLE or without pragmas). So I'm bumping the priority.

comment:22 Changed 3 years ago by MikolajKonarski

One more clue: INLINE vs. (INLINABLE + inline for each call) produce different code --- at the least the binary size differs, the one with INLINE yields bigger exes; possibly also slower, but I haven't pumped the difference enough to make sure.

comment:23 Changed 3 years ago by MikolajKonarski

Type of failure: None/UnknownRuntime performance bug

comment:24 Changed 3 years ago by MikolajKonarski

Type: taskbug

comment:25 Changed 3 years ago by MikolajKonarski

Description: modified (diff)

comment:26 Changed 3 years ago by simonpj

OK I know what is happening here.

Without an INLINE on fgFromInt we get:

-- Initially
  fgFromInt w = w + (2^8)
  attrFromIntINLINE w = Attr (fgFromInt w)

-- After float-out
  lvl = 2^8
  fgFromInt w = w + lvl
  attrFromIntINLINE w = Attr (fgFromInt w)

-- After inlining
  attrFromIntINLINE w = case w of I# w' ->
                        case lvl of I# lvl' ->
                        Attr (w' +# lvl')

The Attr constructor has one strict field, which is reprsented unboxed. We only compute (2^8) once.

But with an INLINE on fgFromInt we get this:

-- Initially
  fgFromInt w = w + (2^8)
  attrFromIntINLINE w = Attr (fgFromInt w)

-- After float-out
  lvl = 2^8
  fgFromInt w = w + lvl
    {- INLINE rhs = w + (2^8) -}
  attrFromIntINLINE w = Attr (fgFromInt w)

-- After inlining
  attrFromIntINLINE w = case w of I# w' ->
                        case 2# ^# 8# of lvl' ->
                        Attr (w' +# lvl')

The INLINE pragma promises to inline what you wrote, not some optimised version thereof. So we inline w + (2^8). In pcinciple we should optimise that just as well after it has been inlined. We've missed the float-out pass, but there's another one later.

Alas, however, by the time the second float-out pass runs, the (2^8) has been transformed to its unboxed form, and currently we don't float those. Result we compute (2^8) on each iteration, rather than just once.

There's even a Note about it in SetLevels:

Note [Unlifted MFEs]
~~~~~~~~~~~~~~~~~~~~
We don't float unlifted MFEs, which potentially loses big opportunites.
For example:
        \x -> f (h y)
where h :: Int -> Int# is expensive. We'd like to float the (h y) outside
the \x, but we don't because it's unboxed.  Possible solution: box it.

What do do? The bad thing is really the inability to float expressions of unboxed type, because that inability makes GHC vulnerable to these "phase effects" when optimisation depends declicately on the exact order things happen in.

Perhaps we could fix that by boxing. So, just before doing the floating stuff SetLevels could do

    e :: Int#
--->
    (case e of y -> \void. y) void
--->
    let v = /\a -> case e of y -> \void. y
    in v a void

at least when e is not cheap. Now it can float the (case e of y -> I# y).

comment:27 Changed 3 years ago by MikolajKonarski

Thank you for looking into this. A naive question: why can't we perform the programmer-accesible INLINE very, very early, macro-like and save the smart stuff for INLINE[k] or some new INLINE* or the spontaneous inlining that GHC does without pragmas? Why not assume a programmer brave enough to use a pragma knows his code and his data and that he benchmarked his code well enough to be sure he wants something totally equivalent to manual inlining but without sacrificing code quality?

For me 'INLINE' (or 'INLINABLE' and 'inline', when I need more control) serves 2 purposes. The first is just forcing the particular trade-off between code duplication and speed. The second is benchmarking and optimization, when I add INLINE and NOINLINE to several related functions and based on which combination compiles to faster code with GHC, I then rewrite the functions to make sure the ones that need to be inlined have only one call site, etc. (I know GHC can do that for me sometimes, but I don't need nor want to rely on that).

Of these two, the second purpose is more important, because I can always inline things manually in the final code, but if I had to inline manually when tweaking and benchmarking I'd go mad. And this is, why INLINE has to have a clear, simple, deterministic semantics, close enough to the semantics of manual inlining. If GHC outsmarts me and compiles to faster code without any pragmas (or with the INLINE* superpragma), all the better, I would experiment and remove some (NO)INLINEs from the final code. But for benchmarking, I need GHC to be dumb wrt the matrix of variables (INLINE/NOINLINE on a few functions) that I tweak. And ideally, I'd like to be able to tweak manually a few more knobs that also translate directly to source code manipulations, like FLOATOUT and NOFLOATOUT, etc.

Perhaps I should just switch to Rust, which is specifically designed for manual control, but perhaps it's possible to make optimizing with GHC more like computer-aided proving, where the proof can always be inspected manually (and based on that, the set of tactics and orders for the prover modified) and less like a operating an immensely powerful black box that knows better.

comment:28 Changed 3 years ago by simonpj

It's all to do with interactions between multiple INLINEs. If we did as you say, and inlined earlier, then we would also inline functions like (+) which also have INLINE pragmas on them. So we might get

  attrFromIntINLINE w = case w of I# w' ->
                        case 2# ^# 8# of lvl' ->
                        Attr (w' +# lvl')

which, as I say, is currently unfloatable. You may say that the INLINE on (+) should be delayed and that would be one solution. But perhaps a better one would be to make floating more robust.

comment:29 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In bc3d37d/ghc:

Float unboxed expressions by boxing

This patch makes GHC's floating more robust, by allowing it
to float unboxed expressions of at least some common types.

See Note [Floating MFEs of unlifted type] in SetLevels.

This was all provoked by Trac #12603

comment:30 Changed 3 years ago by simonpj

Mikolaj, I think the above patch will make it robust for you. Can you check?

Meanwhile, I'm going to leave this open because I also want to investigate making INLINE pragmas fire in the "gentle" phase, on the grounds that that's what the programmer said.

Simon

comment:31 Changed 3 years ago by MikolajKonarski

Mikolaj, I think the above patch will make it robust for you. Can you check?

Thank you. I will check and report back, when your patch is back in HEAD and when I manage to make my package compile with HEAD (the parsec problem is gone; thank you; I'm currently prodding maintainers of other packages that I depends on, mostly broken due to changes in cabal). This way I can avoid compiling GHC and instead use some nightly HEAD build (e.g. the one here https://launchpad.net/~hvr/+archive/ubuntu/ghc).

comment:32 Changed 3 years ago by bgamari

Owner: bgamari deleted

comment:33 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In 432f952/ghc:

Float unboxed expressions by boxing

This patch makes GHC's floating more robust, by allowing it
to float unboxed expressions of at least some common types.

See Note [Floating MFEs of unlifted type] in SetLevels.

This was all provoked by Trac #12603

In working this through I also made a number of other corner-case
changes in SetLevels:

* Previously we inconsistently use exprIsBottom (which checks for
  bottom) instead of exprBotStrictness_maybe (which checks for
  bottoming functions).  As well as being inconsistent it was
  simply less good.

  See Note [Bottoming floats]

* I fixed a case where were were unprofitably floating an
  expression because we thought it escaped a value lambda
  (see Note [Escaping a value lambda]).  The relevant code is
       float_me = (dest_lvl `ltMajLvl` (le_ctxt_lvl env)
                  && not float_is_lam)   -- NEW

* I made lvlFloatRhs work properly in the case where abs_vars
  is non-empty.  It wasn't wrong before, but it did some stupid
  extra floating.

comment:34 Changed 3 years ago by simonpj

The patch was reverted because (oddly) it made a couple of GHCi-debugger tests behave differently. I worked some more on it, fixing some extra things as above, and behold the debugger problems went away.

So I've re-pushed is: comment:33.

I still want to try the effect of earlier inlining of INLINE things... working on that now.

Last edited 3 years ago by bgamari (previous) (diff)

comment:35 Changed 3 years ago by bgamari

Resolution: fixed
Status: newclosed

comment:36 Changed 3 years ago by MikolajKonarski

Phab link for the "I still want to try the effect of earlier inlining of INLINE thing" part: https://phabricator.haskell.org/D3203

comment:37 Changed 3 years ago by simonpj

It's committed in HEAD.

commit 2effe18ab51d66474724d38b20e49cc1b8738f60
Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Tue Feb 28 16:07:20 2017 -0500

    The Early Inline Patch
    
    This very small patch switches on sm_inline even in the InitialPhase
    (aka "gentle" phase).   There is no reason not to... and the results
    are astonishing.

Back to you Mikolaj!

comment:38 Changed 3 years ago by MikolajKonarski

I've checked the original big project with GHC 8.2.1-rc1 and based on non-scientific benchmarks, the problem is gone. BTW, the program also runs ~10% faster vs GHC 8.0.1, but I can't rule out it's caused by new versions of other packages, not just GHC.

Note: See TracTickets for help on using tickets.