Opened 13 months ago

Last modified 8 months ago

#15578 merge bug

Honour INLINE pragmas on 0-arity bindings

Reported by: simonpj Owned by:
Priority: normal Milestone: 8.6.2
Component: Compiler Version: 8.4.3
Keywords: Cc: sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #15519 Differential Rev(s): Phab:D5137
Wiki Page:

Description (last modified by simonpj)

Currently if we see

x = factorial 200
{-# INLINE x #-}

f y = ...x...

we won't inline x, lest we duplicate the work of factorial 200. But this caution has some bad consequences:

  • Suppose x is used exactly once, not inside a lambda, thus
    x = blah
    {-# INLINE x #-}
    g = ...x...
    

Then, if there is no INLINE pragma, x will get inlined (by preInlineUnconditionally. But if there is an INLINE pragma, currently x is not inlined by preInlineUnconditionally: see Note [Stable unfoldings and preInlineUnconditionally] in SimplUtils. This is insane!

  • INLINE says "inline me at every saturated call", where "saturated" is determined by the number of arguments syntactically to the left of the "=" in the source bindings. In this case, there are no arguments to the left, so every occurrence is saturated. So it's inconsistent not to inline.
  • Occasionally it's very important to inline x: see Trac #15519 for a real-world example. Ignoring the users explicit instruction to do so seems silly.

Bottom line: if a 0-ary binding has an INLINE pragma, I think we should inline it at every use site

  • Regardless of the work duplication
  • Including inside lambdas

Note, however, that there is a real risk that full laziness will float it right back out again. Consider again

x = factorial 200
{-# INLINE x #-}
f y = ...x...

After inlining we get

f y = ...(factorial 200)...

but it's entirely possible that full laziness will do

lvl23 = factorial 200
f y = ...lvl23...

That's a problem for another day. Presumably the reason the user wanted to inline it was to get some rule to fire, and this change gives at least some chance that will happen, and makes INLINE behave consistently.

Attachments (1)

nofib-analyse (211.7 KB) - added by tdammers 13 months ago.
nofib analyse output: "clean" is current HEAD, "patched" is the same with the proposed change applied

Download all attachments as: .zip

Change History (23)

comment:1 Changed 13 months ago by simonpj

Description: modified (diff)

comment:2 Changed 13 months ago by simonpj

Description: modified (diff)

comment:3 Changed 13 months ago by simonpj

Looking at CoreUnfold.mkInlineUnfolding, a function with an INLINE pragma has its unfolding built with mkInlineUnfoldingWithArity (the arity being the syntactic arity in the source code). This function gives the unfolding an UnfoldingGuidance of

    guide = UnfWhen { ug_arity = arity
                    , ug_unsat_ok = needSaturated
                    , ug_boring_ok = boring_ok }
    boring_ok = inlineBoringOk expr'

And in fact, in callSiteInline, you can see that if ug_arity = 0 then x will inline if boring_ok is True.

What is boring_ok? It says whether to inline in an utterly boring context. In

  \ y x -> Just (f x y)

we don't inline f even if f has an INLINE pragma because the context of the call us utterly boring: nothing is known about x or y or the calling context. So there is really no point in inlining.

I don't think this applies to 0-ary INLINE pragmas, as Trac #15519 shows.

So I think it may be enough to say, in mkInlineUnfoldingWithArity,

    boring_ok | arity == 0 = True
              | otherwise  = inlineBoringOk expr'

comment:4 Changed 13 months ago by sgraf

Cc: sgraf added

comment:5 Changed 13 months ago by sgraf

comment:6 Changed 13 months ago by tdammers

Perf test suite passes after applying the proposed change.

Nofib shows no significant overall differences except for 8.6% more allocations and total memory use for cacheprof, and 1.6% more for maillist.

maillist deviates 14.3% on GC(1) count, but this does not affect GC time.

cacheprof deviates 5.8% on GC Work.

Full nofib-analyse output attached.

Changed 13 months ago by tdammers

Attachment: nofib-analyse added

nofib analyse output: "clean" is current HEAD, "patched" is the same with the proposed change applied

comment:7 Changed 13 months ago by simonpj

Great! Can you also confirm that this change makes test3 (described in comment:11 of #15519, under "Workarounds") work as fast as test0; and does so

  • Without removing the INLINE pragma on testGrammar
  • Without removing the other uses of testGrammar1 in (say) test1
Last edited 13 months ago by simonpj (previous) (diff)

comment:8 Changed 13 months ago by simonpj

Great! Here's a Note to add, and refer to from the boring_ok change

Note [Honour INLINE on 0-ary bindings]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consider

   x = <expensive>
   {-# INLINE x #-}

   f y = ...x...

The semantics of an INLINE pragma is

  inline x at every call site, provided it is saturated;
  that is, applied to at least as many arguments as appear
  on the LHS of the Haskell source definition.

(This soure-code-derived arity is stored in the `ug_arity` field of
the `UnfoldingGuidance`.)

In the example, x's ug_arity is 0, so we should inline it at every use
site.  It's rare to have such an INLINE pragma (usually INLINE Is on
functions), but it's occasionally very important (Trac #15578, #15519).
In #15519 we had something like
   x = case (g a b) of I# r -> T r
   {-# INLINE x #-}
   f y = ...(h x)....

where h is strict.  So we got
   f y = ...(case g a b of I# r -> h (T r))...

and that in turn allowed SpecConstr to ramp up performance.

How do we deliver on this?  By adjusting the ug_boring_ok
flag in mkInlineUnfoldingWithArity; see
Note [INLINE pragmas and boring contexts]

NB: there is a real risk that full laziness will float it right back
out again. Consider again
  x = factorial 200
  {-# INLINE x #-}
  f y = ...x...

After inlining we get
  f y = ...(factorial 200)...

but it's entirely possible that full laziness will do
  lvl23 = factorial 200
  f y = ...lvl23...

That's a problem for another day.

Note [INLINE pragmas and boring contexts]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
An INLNE pragma uses mkInlineUnfoldingWithArity to build the
unfolding.  That sets the ug_boring_ok flag to False if the function
is not tiny (inlineBorkingOK), so that even INLINE functions are not
inlined in an utterly boring context.  E.g.
     \x y. Just (f y x)
Nothing is gained by inlining f here, even if it has an INLINE
pragma.

But for 0-ary bindings, we want to inline regardless; see
Note [Honour INLINE on 0-ary bindings].

I'm a bit worried that it's possible for the same kind of problem
to arise for non-0-ary functions too, but let's wait and see.

comment:9 in reply to:  7 Changed 13 months ago by tdammers

Replying to simonpj:

Great! Can you also confirm that this change makes test3 (described in comment:11 of #15519, under "Workarounds") work as fast as test0; and does so

  • Without removing the INLINE pragma on testGrammar
  • Without removing the other uses of testGrammar1 in (say) test1

Unfortunately, test3 is still almost as slow as without the patch:

HEAD / test0
377.779049
HEAD / test1
7812.787996
HEAD / test3
7996.584027
patched / test0
372.0234
patched / test1
7540.029795
patched / test3
7654.089574

comment:10 Changed 13 months ago by simonpj

OK here's a better patch

diff --git a/compiler/coreSyn/CoreUnfold.hs b/compiler/coreSyn/CoreUnfold.hs
index 68e7290..fe2ae62 100644
--- a/compiler/coreSyn/CoreUnfold.hs
+++ b/compiler/coreSyn/CoreUnfold.hs
@@ -159,7 +159,8 @@ mkInlineUnfoldingWithArity arity expr
     guide = UnfWhen { ug_arity = arity
                     , ug_unsat_ok = needSaturated
                     , ug_boring_ok = boring_ok }
-    boring_ok = inlineBoringOk expr'
+    boring_ok | arity == 0 = True
+              | otherwise  = inlineBoringOk expr'
 
 mkInlinableUnfolding :: DynFlags -> CoreExpr -> Unfolding
 mkInlinableUnfolding dflags expr
diff --git a/compiler/simplCore/Simplify.hs b/compiler/simplCore/Simplify.hs
index c8870c9..b0f6455 100644
--- a/compiler/simplCore/Simplify.hs
+++ b/compiler/simplCore/Simplify.hs
@@ -3402,14 +3402,18 @@ simplStableUnfolding env top_lvl mb_cont id unf rhs_ty
                            Just cont -> simplJoinRhs unf_env id expr cont
                            Nothing   -> simplExprC unf_env expr (mkBoringStop rhs_ty)
               ; case guide of
-                  UnfWhen { ug_arity = arity, ug_unsat_ok = sat_ok }  -- Happens for INLINE things
+                  UnfWhen { ug_arity = arity, ug_unsat_ok = sat_ok
+                          , ug_boring_ok = boring_ok }
+                     -- This branch happens for INLINE things
                      -> let guide' = UnfWhen { ug_arity = arity, ug_unsat_ok = sat_ok
-                                             , ug_boring_ok = inlineBoringOk expr' }
+                                             , ug_boring_ok = boring_ok || inlineBoringOk expr' }
                         -- Refresh the boring-ok flag, in case expr'
                         -- has got small. This happens, notably in the inlinings
                         -- for dfuns for single-method classes; see
                         -- Note [Single-method classes] in TcInstDcls.
                         -- A test case is Trac #4138
+                        -- But don't forget a boring_ok of True; e.g. see the
+                        -- way it is set in calcUnfoldingGuidanceWithArity
                         in return (mkCoreUnfolding src is_top_lvl expr' guide')
                             -- See Note [Top-level flag on inline rules] in CoreUnfold

I'd forgotten about the bit in Simplify.

Moreover, you do need

test3 = oneShot (runTokenParser testGrammar1)

Then I think we get good perf.

comment:11 Changed 12 months ago by tdammers

I have test3 src = runTokenParser testGrammar src, is that equivalent?

In any case, I'm getting good performance now with the patch applied as written. Given 3 binaries in the cwd, compiled with, unpatched and patched HEAD respectively (test-HEAD vs. test-patched), I get:

for TAG in HEAD patched; do for TEST in 0 1 3; do echo "$TAG / test$TEST"; ./test-$TAG $TEST; done; done
HEAD / test0
387.417648
HEAD / test1
7769.65168
HEAD / test3
7769.16221
patched / test0
385.068842
patched / test1
370.170719
patched / test3
396.022899

So the patched version outperforms the unpatched one by a factor of about 20 on the test1 and test3 test cases.

comment:12 Changed 12 months ago by simonpj

I have test3 src = runTokenParser testGrammar src, is that equivalent?

The oneShot makes it more robust.

Anyway this seems enough to commit the patch (having checked that nofib doesn't budge). I doubt you'll see any changes there at all.

Thanks!

comment:13 Changed 12 months ago by tdammers

Alright, had to change boring_ok=False to True in one test. Patch incoming, maybe you could take a few seconds to verify that this is OK?

comment:14 Changed 12 months ago by tdammers

Differential Rev(s): D5137
Status: newpatch

comment:15 Changed 12 months ago by simonpj

Differential Rev(s): D5137Phab:D5137

comment:16 Changed 12 months ago by Krzysztof Gogolewski <krz.gogolewski@…>

In b9b1f999/ghc:

Honor INLINE on 0-arity bindings (#15578)

Summary:
Fix test for #15578

By allowing 0-arity values to be inlined, we end up changing boringness
annotations, and this gets reflected in the Core output for this
particular test.

Add Notes for #15578

Test Plan: ./validate

Reviewers: simonpj, bgamari

Reviewed By: simonpj

Subscribers: rwbarton, carter

GHC Trac Issues: #15578

Differential Revision: https://phabricator.haskell.org/D5137

comment:17 Changed 12 months ago by monoidal

Status: patchmerge

comment:18 Changed 12 months ago by RyanGlScott

Milestone: 8.6.18.6.2

Moving to the 8.6.2 milestone, since these tickets were all recently marked as merge.

comment:19 Changed 12 months ago by bgamari

For the record, this patch put up resistance in the form of validation issues when I tried to merge it before 8.6.1. I can try again for 8.6.2.

comment:20 Changed 8 months ago by heisenbug

As suspected, the float-out thing is happening in the wild. See the (entire) mail thread: https://mail.haskell.org/pipermail/ghc-devs/2019-January/016835.html.

Maybe we should sometime look into deduplicating static data.

comment:21 in reply to:  19 Changed 8 months ago by heisenbug

Replying to bgamari:

For the record, this patch put up resistance in the form of validation issues when I tried to merge it before 8.6.1. I can try again for 8.6.2.

We need milestones e.g. for 8.6.4 :-)

Also, which kind of failures did you get? SEGVs? May we have a re-trial with the 8.6 branch?

comment:22 Changed 8 months ago by simonpj

I'm not sure what problem we are trying to solve here. The bug, as reported in the Description, is fixed. Yes, any INLINE pragma results in duplication, of code and/or data. Perhaps some of that duplication could be recovered by CSE; but the first thing is to remove INLINE pragmas that do nothing useful.

Note: See TracTickets for help on using tickets.