Opened 19 months ago

Last modified 10 months ago

#15056 new bug

Wrappers inlined too late

Reported by: simonpj Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler Version: 8.2.2
Keywords: Simplifier Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Consider these two modules

module Foo where

  test :: Int -> Bool
  test 0 = True
  test n = test (n-1)

  foo :: Foldable t => Int -> t Int -> Int
  foo n xs | test n
           = foldr (+) n xs
           | otherwise
           = n+7

and

module Bar where
  import Foo

  blam x = f 3 [1..x]

If we simply inline foo into blam, we get foldr/build fusion. And that is exactly what happens if you compile Foo with -fno-strictness.

But what actually happens is

  • When compiling Foo, strictness analysis does worker/wrapper for foo.
  • The wrapper currently gets an aciviation of "active only in phase 0"; see Note [Wrapper activation] in WorkWrap.
  • So foo is ultimately inlined (actually its worker is inlined too) but too late for fusion to take place.

This is bad: optimising Foo makes the overall runtime increase.

I have seen other examples of this. The general pattern is:

  • Without worker/wrapper, a function f may inline, which exposes opportunities for functions in its body to participate in rule rewrites
  • With worker/wrapper, no inlining happens until phase 0; and that is too late for the rules to fire.

Obvious thought: allow wrappers from strictness analysis to inline earlier. I'll try that.

Change History (7)

comment:1 Changed 19 months ago by simonpj

Another example comes from nofib/real/veritas. In Edlib.hs we have

(...) :: Xio_fn -> Xio_fn -> Xin -> Xio
(...) f g xin = f xin \\\ g

and in X_interface we get

x_send_argL argL = x_send x_multi_arg_esc_seq      ...
                   x_send [toEnum (length argL)]   ...
                   app ( map x_send_arg argL )     ...
                   x_send x_end_arg_esc_seq

It turns out that ... gets w/w'd (because it is CPR'd in fact), and that means it doesn't inline until phase 0. If we inline in phase 2 we get lots of fold/build fusion, which leads to a whopping 25% improvement in the allocation of the veritas benchmark.

comment:2 Changed 19 months ago by Simon Peyton Jones <simonpj@…>

In 8b10b896/ghc:

Inline wrappers earlier

This patch has a single significant change:

  strictness wrapper functions are inlined earlier,
  in phase 2 rather than phase 0.

As shown by Trac #15056, this gives a better chance for RULEs to fire.
Before this change, a function that would have inlined early without
strictness analyss was instead inlining late. Result: applying
"optimisation" made the program worse.

This does not make too much difference in nofib, but I've stumbled
over the problem more than once, so even a "no-change" result would be
quite acceptable.  Here are the headlines:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          -0.5%     -0.5%     +2.5%     +2.5%      0.0%
         fulsom          -1.0%     +2.6%     -0.1%     -0.1%      0.0%
           mate          -0.6%     +2.4%     -0.9%     -0.9%      0.0%
        veritas          -0.7%    -23.2%     0.002     0.002      0.0%
--------------------------------------------------------------------------------
            Min          -1.4%    -23.2%    -12.5%    -15.3%      0.0%
            Max          +0.6%     +2.6%     +4.4%     +4.3%    +19.0%
 Geometric Mean          -0.7%     -0.2%     -1.4%     -1.7%     +0.2%

* A worthwhile reduction in binary size.

* Runtimes are not to be trusted much but look as if they
  are moving the right way.

* A really big win in veritas, described in comment:1 of
  Trac #15056; more fusion rules fired.

* I investigated the losses in 'mate' and 'fulsom'; see #15056.

comment:3 Changed 19 months ago by simonpj

The mate regression was very delicate.

In Move.moveDetailsFor, we inline

  bd = Board.rmPieceAt c_a25I sq_a2np bd_a25J

into two branches of a case (to hopefully avoid allocating it). I think this is postInlineUnconditionally. But then it is floated out and has to be laboriously CSE'd again!

(Idea: this aggressive inlining could be done rather late, and perhaps be a late float-in pass. The only gain is less alloc.)

However, more delicately we end up with this, where rPA1 is short for rmPieceAt1:

$j1 (rPA1 x) --inline-> let v = rPA1 in ...x...x.. $j2 (rPA1 x) --no inline-> $j2 (rPA1 x)

In the first case (rPA1 x) gets RhsCtxt, and hence is inlined (not boring). But in the second, the context is boring, so not inlined. Then both these expressions are floated out, but they do not get CSEd togehter becuase they look different (one has been inlined and the other hasn't). So then we get

   lvl1 = ...x..x..
   lvl2 = rPA1 x

And now the rPA1 x is in an RhsCtxt and hence is inlined, giving two identical expressions. But it's too late for CSE. (Indeed, CSE is done immediately after float-out, but before any simplification.

All very delicate, and not the fault of this patch. Makes me wonder if we should should really distinguish between BoringCtxt and RhsCtxt. Idea: just try collapsing them.

comment:4 Changed 19 months ago by simonpj

The fulsom regression was in Csg.calc, which has nested join points. Several of them end up like this:

 join j (x :: Double) = Board (case x of { D# y -> ... })
                              ...
 in ....(j (D# v))....

So the call to $j$ ends up with an explicit (D# v) argument, and SpecConstr should catch it. But alas it does not because this happens nestedly, and SpecConstr is worried about exponential blowup.

Why is this a regression? Because something got inlined, that made a join point a bit bigger, which mean that the join point wasn't inlined. A classic inline-cascade problem.

comment:5 Changed 17 months ago by bgamari

Milestone: 8.6.18.8.1

It looks like you intend to continue pondering the regressions, simonpj, so I'll leave this open. However, it's unlikely much else will happen for 8.6 so I'm bumping the milestone.

comment:6 Changed 11 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

comment:7 Changed 10 months ago by simonpj

Keywords: Simplifier added
Note: See TracTickets for help on using tickets.