Opened 2 years ago

Last modified 2 years ago

#14211 new bug

Compiler is unable to INLINE as well as the programmer can manually

Reported by: harendra Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.2.1
Keywords: StaticArgumentTransformation, Inlining Cc: mpickering
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


The test case is in this repo on the inlining-issue branch:

Performance with manually inlining a function is more than 10% faster compared to factoring out code and using INLINE pragma.

stack bench for compiler inlined code

time                 46.71 ms   (45.53 ms .. 47.79 ms)

stack bench --flag ghc-perf:manual for manually inlined code

time                 39.46 ms   (38.92 ms .. 39.94 ms)

Here is the relevant code:

{-# INLINE bindWith #-}
    :: (forall c. AsyncT m c -> AsyncT m c -> AsyncT m c)
    -> AsyncT m a
    -> (a -> AsyncT m b)
    -> AsyncT m b
bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
    let run x = (runAsyncT x) Nothing stp yld
        yield a _ Nothing  = run $ f a
        yield a _ (Just r) = run $ f a `k` (bindWith k r f)
    in m Nothing stp yield

instance Monad m => Monad (AsyncT m) where
    return a = AsyncT $ \ctx _ yld -> yld a ctx Nothing

    AsyncT m >>= f = AsyncT $ \_ stp yld ->
        let run x = (runAsyncT x) Nothing stp yld
            yield a _ Nothing  = run $ f a
            yield a _ (Just r) = run $ f a <> (r >>= f)
        in m Nothing stp yield
    (>>=) = bindWith (<>)

I have seen this many times.

Change History (8)

comment:1 Changed 2 years ago by mpickering

Keywords: StaticArgumentTransformation Inlining added

bindWith is a self-recursive function so GHC won't inline it by itself.

If you rewrite bindWith to

    :: (forall c. AsyncT m c -> AsyncT m c -> AsyncT m c)                          
    -> AsyncT m a                                                                  
    -> (a -> AsyncT m b)                                                           
    -> AsyncT m b                                                                  
bindWith k m f = go m                                                              
    where go (AsyncT m) =                                                          
            AsyncT $ \_ stp yld ->                                                 
            let run x = (runAsyncT x) Nothing stp yld                              
                yield a _ Nothing  = run $ f a                                     
                yield a _ (Just r) = run $ f a `k` (go r)                          
            in m Nothing stp yield

so it delegates the recursion to the go function then it seems like the performance is the same.

This is the transformation the static argument transformation is meant to achieve (turned on by -fstatic-argument-transformation). It didn't seem to work just by turning on this flag but I don't have time to investigate more now.

comment:2 Changed 2 years ago by harendra

That information is a gem for optimization! I need to change how I code recursive functions. Every programmer must know that a static argument impacts performance significantly and can be easily optimized. It not only recovered the lost performance but gave a lot more because the manually inlined bind function also has a static argument which got removed in this version.

I think the problem is not really inlining. GHC won't be able to inline the recursive call of a function to itself, of course, but it can still inline the whole function. So bindWith is likely getting inlined, I can see that with -ddump-inlinings. I guess, the problem is that we have added one more static argument which degrades the performance. The compiler is able to inline the function but it cannot undo that static argument.

I see that the INLINE section in the GHC manual mentions mutually recursive and recursive functions, however it does not mention -fstatic-argument-transformation. It may be useful to mention because the programmer naively expects that inlining will give us equivalent code but in reality the added static argument is an overhead that cannot be reverted by the compiler.

I have a few questions, suggestions:

1) The INLINE section of the manual should mention -fstatic-argument-transformation and how factoring out of code can add a static argument that cannot be undone by the compiler unless we use this option. Also, an example of how the programmer can manually perform static argument transformation and achieve what (s)he wants in case this option does not work.

2) Can we have a warning option which warns the programmer about missed inlining? If I have placed an INLINE pragma and the compiler is unable to inline the function for whatever reason I want to know. The -ddump-inlinings shows what got inlined but what did not get inlined will be more useful information.

comment:3 Changed 2 years ago by mpickering

There is a warning in core lint which warns about INLINE pragmas on loops breakers but I don't know why it is not on by default.

comment:4 Changed 2 years ago by mpickering

I looked at this issue all of this afternoon and don't feel any closer to understanding what is going on but have some diagnosis.

Manually performing SAT makes the program much faster. The SAT pass itself does nothing to the definition when there is an INLINE pragma on the definition. Removing the INLINE pragma causes SAT is happen but makes the program much slower as it is not inlined. Adding -fexpose-all-unfoldings to the defining module again makes the program much faster.

I also noticed some interaction with SAT and inline pragmas, obviously the unsatted definition is included as the unfolding when an INLINE pragma is present even if it is a loop-breaker. SAT has the effect of changing a definition from a loop-breaker into an inlinable function but because we only export one unfolding this is then not usable across modules. All very unsatisfactory.

comment:5 Changed 2 years ago by harendra

Turn on heavyweight intra-pass sanity-checking within GHC, at Core level. (It checks GHC’s sanity, not yours.)

Someone who is not working on GHC would not even look at such flags and rightly so. We need a flag for regular programmers helping them in the optimization process, something like -Wmissed-specialisations or does that flag cover inlining as well?

comment:6 Changed 2 years ago by harendra

Thanks Matthew for looking into this. It looks like as of now inline and SAT are exclusive to each other. What's the solution to this - we either need to inline the SATed definition or can we do SAT again after inlining?

I observed that my package had several SAT opportunities that I fixed manually but -fstatic-argument-transformation did not make any difference, it could be because of this same reason perhaps.

comment:7 Changed 2 years ago by simonpj

I'm a bit lost in all this. Is there a small reproducible test case I can look at? What exactly is being compared with what?

I'm also lost with the SAT discussion. I think SAT is off by default; and it usually applies to recursive definitions, which in turn usually do not have ININE pragmas.

Sorry to be dim. Maybe start from zero and explain one bit at a time?

comment:8 Changed 2 years ago by mpickering

The reproducible example is in this github repo in the inlining-issue branch.

You can compile it with cabal new-build all and then run the benchmark.

The interesting point is the definition of bindWith which is a self-recursive function with 2 static arguments.

Harendra observes that by "manually inlining" it, ie replacing the arguments with the statically known values produces much faster code than GHC produces on its own.

I then observe that

  1. Removing the INLINE pragma and turning on -fstatic-argument-transformation has the same effect.
  2. The INLINE (or INLINABLE) pragma inhibits -fstatic-argument-transformation by creating a mutually recursive group (see
  3. However, there is no way to make sure the optimised satted unfolding for bindWith is exposed so it can be inlined across modules (hence the comment about -fexpose-all-unfoldings).

Harendra then has two questions about how this information should be communicated to users.

  1. Whether the INLINE section should mention about SAT.
  2. Whether we have a warning option which warns the programmer about missed inlining. This is only communicated by running -dcore-lint currently.

Does that clear things up?

Note: See TracTickets for help on using tickets.