Opened 20 months ago

Last modified 14 months ago

#14827 new feature request

Recognize when inlining would create a join point

Reported by: ersetzen Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.2.2
Keywords: JoinPoints Cc: sgraf, nh2
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by ersetzen)

This discussion revolved around a program that runs 10x faster under ghci.

One way to solve this is to remove a superfluous inline pragma which allows the following transformation to happen:

letrec {
f a = case e of {
    p1 -> f a';
    p2 -> (# l, r #);
}} in case f e2 of { (# l, r #) -> e3; }

into

joinrec {
f a = case e of {
    p1 -> jump f a';
    p2 -> e3;
}} in jump f e2

More generally a recursive let binding that is called exactly once from the outside. If all recursive calls are tail calls and the outside one isn't then we could safely replace the call with the binding and end up with join points. In this case it means a 10x speedup so it might be worth doing generally.

letrec { fi = ei; } in ... (fj e) ... => ... (joinrec { fi = ei; } in jump fj e) ...

Self contained example

Change History (14)

comment:1 Changed 20 months ago by ersetzen

Description: modified (diff)

comment:2 Changed 20 months ago by nomeata

This is a duplicate of #14068, isn’t it?

comment:3 Changed 20 months ago by nomeata

Can you check if my branch at (wip/14068) fixes the problem?

comment:4 Changed 20 months ago by sgraf

Seeing the problem so clearly spelled out, I'm pretty sure this can be solved by loopification.

comment:5 Changed 20 months ago by simonpj

Great example. Can someone explain to me why GHCi is faster? And what it has to do with INLINE pragmas?

Looking only at the Description, yes loopification should do a good job here. We'd get this

  letrec
     f a = case e of
             p1 -> f a'
             p2 -> (# l, r #)
  in case f e2 of (# l, r #) -> e3

==>  (loopify)

  let
     f a = joinrec jf a = case e of
                             p1 -> jump jf a'
                             p2 -> (# l, r #)
           in jump jf a
  in case f e2 of (# l, r #) -> e3

==> (inline `f`)

  case (let a = e2 in
        joinrec jf a = case e of
                             p1 -> jump jf a'
                             p2 -> (# l, r #)
        in jump jf a) of (# l, r #) -> e3

==> (move the outer case inwards, into the RHS of the joinrec)

  let a = e2 in
  joinrec jf a = case e of
                   p1 -> jump jf a'
                   p2 -> case (# l, r #) of (# l, r #) -> e3
  in jump jf a

==> (case of known constructor; and inline e2)

  joinrec jf a = case e of
                   p1 -> jump jf a'
                   p2 -> e3
  in jump jf e2

as desired.

comment:6 Changed 20 months ago by ersetzen

Oh, you are right, this totally falls under loopification. Curiously enough wip/T14068 doesn't solve this, though. I think it gets stuck at the inline f step. Here an excerpt of the simplifier dump. I was always under the impression that non-recursive let bound variables that are used exactly once (which is the case here) are inlined unconditionally but that doesn't happen here.

Here is a slightly more inlined version. Removing {-# INLINE indices #-} allows buildTable to be inlined which removes the call. Looking at the cmm, scan saves and restors 96 bytes of data when calling buildTable so that might be a large part of the performance drop?

Side note: Neither the loopification branch and ghc head run faster with ghci.

comment:7 Changed 20 months ago by sgraf

Cc: sgraf added

comment:8 Changed 20 months ago by simonpj

I'm too swamped to follow up why f doesn't get inlined on the branch, but I know that Joachim is planning to get back to loopification.

Side note: Neither the loopification branch and ghc head run faster with ghci.

The Description says that ghci is 20x faster (in 8.2.2). So in HEAD do we have the 20x fast performance in both cases (ie good), or do we have the 1x performance in both cases (ie bad)?

I still don't understand why GHCi wins so big in any version.

comment:9 Changed 20 months ago by ersetzen

Turns out there was an issue that's really easy to miss with liberate-case turned on:

useSite :: Int -> Int
useSite i = 10 * (delayedInline i)

delayedInline :: Int -> Int
delayedInline i = inline i
{-# INLINE [1] delayedInline #-}

inline :: Int -> Int
inline nlen = loop 0
  where
    shouldFloat i
      | i > 100  = i
      | otherwise = shouldFloat (i+1)
    loop i
        | i > 5 = 0
        | otherwise = loop (i + skip)
        where !skip = shouldFloat nlen
{-# INLINE inline #-}

First inline is optimized as expected and shouldFloat is floated out. Then delayedInline is inlined with the original code and in that copy shouldFloat remains in loop:

-- RHS size: {terms: 24, types: 9, coercions: 0, joins: 1/1}
inline
inline
  = \ nlen_a1u2 ->
      case nlen_a1u2 of { I# ww_s31T ->
      case $wshouldFloat_s31Y ww_s31T of ww_s31X { __DEFAULT ->
      joinrec {
        $wloop_s328
        $wloop_s328 ww_s326
          = case tagToEnum# (># ww_s326 5#) of {
              False -> jump $wloop_s328 (+# ww_s326 ww_s31X);
              True -> lvl_s2ZH
            }; } in
      jump $wloop_s328 0#
      }
      }

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
delayedInline
delayedInline = inline

-- RHS size: {terms: 37, types: 14, coercions: 0, joins: 2/3}
useSite
useSite
  = \ w_s32l ->
      case w_s32l of { I# ww_s32o ->
      joinrec {
        $wloop_s32k
        $wloop_s32k ww_s32i
          = let {
              lvl_s33v
              lvl_s33v = tagToEnum# (># ww_s32i 5#) } in
            joinrec {
              $wshouldFloat_s32e
              $wshouldFloat_s32e ww_s32c
                = case tagToEnum# (># ww_s32c 100#) of {
                    False -> jump $wshouldFloat_s32e (+# ww_s32c 1#);
                    True ->
                      case lvl_s33v of {
                        False -> jump $wloop_s32k (+# ww_s32i ww_s32c);
                        True -> lvl_s2ZH
                      }
                  }; } in
            jump $wshouldFloat_s32e ww_s32o; } in
      jump $wloop_s32k 0#
      }

Branch and Head ghci run with 1x performance for me i.e. bad.

comment:10 Changed 20 months ago by simonpj

Thanks. But that code does not look too bad to me. Yes shouldFloat is not floated, but it's just a loop, so floating would not really change its performance.

The badness is perhaps, that lvl_s33v :: Bool is allocated (as a thunk) on every iteration of wloop.

Can you show the 20x faster code? Perhaps by removing the pragmas or something?

Incidentally, Joachim's exitification pass (in HEAD and 8.4) will remove the inefficient thunk for lvl_s33v. Does it go 20x faster?

comment:11 Changed 20 months ago by ersetzen

Here are both versions and the corresponding dump-simpl output (only the last line is changed).

I had to rewrite it somewhat because the original created a 10k line core function. This version created ~450 lines of core when I compiled it with -O2 -ddump-simpl -ddump-stg -dsuppress-uniques -dsuppress-all -fno-liberate-case -ddump-to-file -fforce-recomp -fno-spec-constr -ticky -ticky-LNE, which admittedly is a bit of a mouthful.

I think ticky output first is probably simplest? Without inline pragma:

**************************************************

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
      15847     380328          0   0                      lvl2{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4DX
     302632   24210560          0   0                      lvl5{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4ER
   63931922          0          0   1 i                    $wcandidateMatch{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4ER
  135874515          0          0   0                      $j{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4ER
  136224692          0          0   1 i                    $wscan{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in s4Eu
   21810147   36418408          0   3 iwi                  $wbuildTable{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4E5
      63392          0          0   2 SC                   snoc'{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun)
     366024          0          0   1 L                    checkAll{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4DX
     142632    4057088          0   1 L                    go1{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in s4DR
      16029    1014272          0   1 L                    go{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in rj
        182          0          0   2 LS                   go1{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in rj
          1         16          0   1 L                    longestCommonSubstring{v} (fun)
          4         96          0   0                      main1{v} (fun)
          1          0          0   0                      main4{v} (fun)
          1          0          0   0                      main{v} (fun)

**************************************************

With inline pragma:

**************************************************

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
      15847     380328          0   0                      lvl2{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4Ho
     302632   24210560          0   0                      lvl3{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4Ih
   63931922          0          0   1 i                    $wcandidateMatch{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4Ih
  135874515          0          0   0                      $j{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4Ih
 8551263604          0          0   3 iwi                  $wbuildTable{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in s4Hw
  136224692          0          0   1 i                    $wscan{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in s4Hw
      63392          0          0   2 SC                   snoc'{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun)
     366024   47624072          0   1 L                    checkAll{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in s4Ho
     142632    4057088          0   1 L                    go1{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in s4Hi
      16029    1014272          0   1 L                    go{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (fun) in rj
        182          0          0   2 LS                   go1{v} (Profile-0.1.0.0-95VECWVHYo03uC96ITYgBw:Lib) (LNE) in rj
          1         16          0   1 L                    longestCommonSubstring{v} (fun)
          4         96          0   0                      main1{v} (fun)
          1          0          0   0                      main4{v} (fun)
          1          0          0   0                      main{v} (fun)

**************************************************

Removing the inline pragma moves the result allocation from checkAll to $wbuildTable and we don't have to allocate the $wbuildTable closure since it's a join point.

More drastically, the $wbuildTable entries go down from 8551263604 to 21810147! Perf also shows that in the INLINE version the shiftLeft in $wbuildTable is the hottest instruction by quite some margin.

Finally a diff between the two dump-simpl outputs. Notably all values that are floated out are unlifted so this doesn't save any heap allocations. Of those only lvl4 = +# dt2 1# and the $wbuildTable result are used multiple times.

Sorry that this got a bit long.

Last edited 20 months ago by ersetzen (previous) (diff)

comment:12 Changed 20 months ago by nh2

Cc: nh2 added

comment:13 Changed 15 months ago by ersetzen

Posting into this issue feels somewhat like necromancy but I found a similar issue and the cause is actually quite simple:

main :: IO ()
main = print foo
{-# INLINE[1] foo #-}
foo :: Int
foo = bar 3
{-# INLINE[~1] bar #-}
bar :: Int -> Int
bar i = i + 1

We inline bar into foo but not into the unfolding of foo:

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
foo [InlPrag=INLINE[1] (sat-args=0)] :: Int
[LclId,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False,
         Guidance=ALWAYS_IF(arity=0,unsat_ok=False,boring_ok=False)
         Tmpl= bar (GHC.Types.I# 3#)}]
foo = GHC.Types.I# 4#

Later we inline the unfolding of foo which still references bar - but the phase for inlining bar is over.

 main _s26f=
   ...
    case bar (GHC.Types.I# 3#) of { GHC.Types.I# ww3_a26u ->
   ...

So bar never gets inlined at the use side even though everything along the way had inline pragmas. There isn't really a single decision that was wrong but the result is very unintuitive and can break fusion. A compiler warning for this type of phase collision might be worthwhile if it is easily doable?

Last edited 15 months ago by ersetzen (previous) (diff)

comment:14 Changed 14 months ago by simonpj

I've got lost in this ticket.

  1. comment:13 shows (correctly) that if you specify phases wrong you can get bad results. Yes, it's possible we could warn about this.
  1. comment:11, I think, shows a case where an INLINE pragmas makes things worse. I have not dug into it.
  1. The original thing in the Description got tangled with stuff about loopification, so I'm not sure of its status.

Are (1) and (2) connected? If not, can you make a separate ticket for (1)?

Is (30 dealt with now? That is, are we left with (2) only?

Note: See TracTickets for help on using tickets.