Opened 6 years ago

Last modified 3 years ago

#8662 new bug

GHC does not inline cheap inner loop when used in two places

Reported by: nh2 Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.6.3
Keywords: Inlining Cc: mail@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

When I made a Criterion benchmark of Neil Michell's "Tight Inner Loop" blog post (http://neilmitchell.blogspot.co.uk/2014/01/optimising-haskell-for-tight-inner-loop.html), I noticed that GHC 7.6.3 will not inline the performance-critical function when it's called from two different places (inside the same module).

break0_2pleaseinline :: (Char -> Bool) -> ByteString0 -> (ByteString, ByteString0)
break0_2pleaseinline f (BS0 bs) = (BS.unsafeTake i bs, BS0 $ BS.unsafeDrop i bs)
    where
        i = Internal.inlinePerformIO $ BS.unsafeUseAsCString bs $ \ptr -> do
            let start = castPtr ptr :: Ptr Word8
            let end = go start
            return $! Ptr end `minusPtr` start

        go s@(Ptr a) | c == '\0' || f c = a
                     | otherwise = go $ inc s
            where c = chr s

versionInl1 :: ByteString0 -> (ByteString, ByteString0)
versionInl1 str = break0_2pleaseinline test str
  where
    test x = x <= '$' && (x == ' ' || x == '\r' || x == '\n' || x == '$')

versionInl2 :: ByteString0 -> (ByteString, ByteString0)
versionInl2 str = break0_2pleaseinline test str
  where
    test x = x <= '$' && (x == ' ' || x == '\r' || x == '\n' || x == '$')

Full code here: https://github.com/nh2/inner-loop-benchmarks/blob/6715d1e9946d6b5e6d9bb53203982ed3d2ed32ff/Bench.hs#L166.

break0_2pleaseinline does not get inlined, which makes versionInl1 and versionInl2 over 4 times slower than when inlined. The inlining doens't happen, not even with -O2 and -O3, only an INLINE pragma will move GHC to do it.

If I was a compiler, I would so inline that function!

I am surprised that GHC doesn't decide to do so.

Change History (4)

comment:1 Changed 6 years ago by schyler

Maybe you just need to tune the numbers higher.

Will -funfolding-use-threshold=16 do it?

Maybe this is relevant to #1371

Edit: Some naive fiddling. ghc -O2 -optlo=-O3 -fllvm -funfolding-use-threshold=100 -funfolding-creation-threshold=100 -fforce-recomp Trac.hs doesn't do anything, and versionInl{1,2} are performing 20x worse than version7 no matter what I adjust those to.

Edit2: Success! I was able to get -funfolding-use-threshold=1000 -funfolding-creation-threshold=1000 to match version6 performance, but still 10us slower than version7.

GHC probably shouldn't do this automatically.

I'd very much like to know what makes them 10us (~20%) slower even when they are inlined completely. Maybe that's a performance bug that can be looked into instead?

Last edited 6 years ago by schyler (previous) (diff)

comment:2 Changed 6 years ago by nh2

Why should it not do this automatically?

It is a pretty cheap function (from my human-not-compiler perspective).

Why do you need to set such a big threshold to get it inlined? Is there a page that describes how these thresholds work?

comment:3 Changed 6 years ago by simonpj

GHC decides when to inline like this:

  • It computes the "size" of the function
  • At a call site, it takes the size of the function, subtracts a call-site "discount", and if the result is less than the "unfolding threshold" it does the inlining.
  • The discount is increased if both (a) an actual argument has some structure (eg is a constructor application) and (b) that argument is scrutinised by a case expression in the function body.
  • There is also a "result discount" if (a) the function call is consumed by a case, and (b) the function body returns a constructor application

All this is computed in module CoreUnfold, so it is nicely separated from the rest of the compiler.

Do by all means have a go at changing the computation of size or discount. Then measure (a) the change in code size and complication time vs (b) the change in allocation and/or run-time. If you can reduce (b) without increasing (a) you are winning!

But it's easy to mess up. I have spent many happy hours fiddling with these heuristics.

Simon

comment:4 Changed 3 years ago by mpickering

Keywords: Inlining added
Note: See TracTickets for help on using tickets.