#14737 closed bug (fixed)

Improve performance of Simplify.simplCast

Reported by: tdammers Owned by: goldfire
Priority: highest Milestone: 8.6.1
Component: Compiler Version: 8.2.2
Keywords: Cc: tdammers, goldfire, simonpj
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #11735 #14683 Differential Rev(s): Phab:D4385 Phab:D4568
Wiki Page:

Description

Splitting off task 3 from #11735. When compiling https://ghc.haskell.org/trac/ghc/attachment/ticket/14683/Grammar.hs, simplCast eats up more execution time than we think it should.

From https://ghc.haskell.org/trac/ghc/ticket/11735#comment:10:

Something is clearly wrong with Simplify.simplCast. I think I know what it is. Given

(fun |> co) @t1 @t2 ... @tn

we will call pushCoTyArg n times, and hence does n singleton substitutions, via the n calls to piResultTy.

Solution: gather up those type arguments (easy) and define

pushCoTyArgs :: Coercion -> [Type] -> Maybe ([Type], Coercion)

And https://ghc.haskell.org/trac/ghc/ticket/11735#comment:41:

OK. I looked at pushCoTyArg and friends, and I have a very simple solution: just move the isReflexiveCo case in addCoerce (a local function within Simplify.simplCast) to the top. That should do it. Then pushCoTyArg is never called with a reflexive coercion, and so the piResultTy case won't happen.

Now, pushCoArgs might still call pushCoTyArg with a reflexive coercion, but it can be taught not to as well: Have pushCoArgs return a Maybe ([CoreArg], Maybe Coercion) and pushCoArg return a Maybe (CoreArg, Maybe Coercion). If the second return values are Nothing, that means that there is no cast (i.e., that the cast would have been reflexive). The only client of pushCoArg(s) is exprIsConApp_maybe, which simply omits a cast if pushCoArgs returns Nothing. Then, we never have to bother creating the reflexive coercions.

This should be an easy win all around.

Change History (34)

comment:2 Changed 22 months ago by tdammers

Resolution: fixed
Status: newclosed

Fixing via #11735.

comment:3 Changed 22 months ago by goldfire

Resolution: fixed
Status: closednew

comment:4 Changed 22 months ago by goldfire

Differential Rev(s): Phab:D4385
Status: newpatch

comment:5 Changed 22 months ago by simonpj

Implemented in: ​https://phabricator.haskell.org/D4385

I'd much prefer a separate patch. (Not hard.)

comment:6 Changed 20 months ago by tdammers

Now that we have implemented the fix as outlined in D4395, it turns out that while the change is desirable overall, it does not actually improve performance significantly; simplCast still appears at the top of the profile, as evidenced here:

	Mon Mar 26 15:24 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/D4395-modified/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

	total time  =       19.66 secs   (19655 ticks @ 1000 us, 1 processor)
	total alloc = 24,638,084,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE     SRC                                                 %time %alloc

simplCast       Simplify   compiler/simplCore/Simplify.hs:(1213,5)-(1215,37)    73.8   76.0
tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)      8.3    8.4
CoreTidy        HscMain    compiler/main/HscMain.hs:1253:27-67                   2.8    2.2
SimplTopBinds   SimplCore  compiler/simplCore/SimplCore.hs:770:39-74             2.7    1.9
coercionKind    Coercion   compiler/types/Coercion.hs:1698:3-7                   1.8    3.4
zonkTopDecls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    1.3    1.3
deSugar         HscMain    compiler/main/HscMain.hs:511:7-44                     1.1    0.8

We clearly shouldn't be spending 15 seconds in simplCast, so more digging is required.

comment:7 Changed 20 months ago by tdammers

Preliminary result: digging into simplCast some deeper, one of the biggest contributors is the call to isReflexiveCo. By rewriting it from this:

isReflexiveCo = isJust . isReflexiveCo_maybe

...to this:

isReflexiveCo (Refl {}) = True
isReflexiveCo co = eqType ty1 ty2
  where Pair ty1 ty2 = coercionKind co

...cuts execution time for compiling Grammar.hs down from 20 seconds to 12.

For reference, isReflexiveCo_maybe is defined as:

isReflexiveCo_maybe (Refl r ty) = Just (ty, r)
isReflexiveCo_maybe co
  | ty1 `eqType` ty2
  = Just (ty1, r)
  | otherwise
  = Nothing
  where (Pair ty1 ty2, r) = coercionKindRole co

So we're really just replicating the logic here, with two differences that seem to improve performance drastically:

  • We skip calculating the role, using coercionKind rather than coercionKindRole.
  • Instead of Maybe, we use boolean logic directly, since we are not interested in the actual roles and types, we just want to know if there are any.

In theory, neither of these should matter, because the expensive calculations should mostly just thunk up and never get evaluated, but we still see a huge improvement. So this could be due to one or more of the following:

  • We might evaluate more deeply into coercionKindRole than expected
  • The boolean logic might unbox and optimize into more efficient code requiring fewer allocations
  • coercionKindRole could have some unexpected inefficiencies compared to coercionKind
  • Writing it like this might lead to more beneficial inlining behavior

comment:8 Changed 20 months ago by simonpj

Yes, use that defn of isReflexiveCo; and then delete isReflexiveCo_maybe because it is not otherwise called.

However, that may still leave simplCast at the top... does it? If it was taking 78% of 20sec before, that's 16 sec. So reducing that to near-zero (which it should be) would take use to 5s, not 12s.

What happens if you leave out the call to isReflexiveCo altogether? It'll get done in the next round anyway, by optCoercion.

comment:9 Changed 20 months ago by tdammers

However, that may still leave simplCast at the top... does it? If it was taking 78% of 20sec before, that's 16 sec. So reducing that to near-zero (which it should be) would take use to 5s, not 12s.

Correct; isReflexiveCo accounted for about half the execution time spent in simplCast. With this change, isReflexiveCo drops to 1.1% of overall execution time, and simplCast to about 61%. So this is a big improvement, but we're not done yet. The remaining main culprit is pushCoTyArg.

What happens if you leave out the call to isReflexiveCo altogether? It'll get done in the next round anyway, by optCoercion.

I'll try that, but considering that isReflexiveCo is no longer critical, I don't expect it to make much of a difference.

For reference, here's the current profiler output:

        Wed Mar 28 20:44 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/T14737/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

        total time  =       12.35 secs   (12354 ticks @ 1000 us, 1 processor)
        total alloc = 14,410,284,936 bytes  (excludes profiling overheads)

COST CENTRE     MODULE     SRC                                                 %time %alloc

pushCoTyArg     Simplify   compiler/simplCore/Simplify.hs:1229:63-83            54.6   55.5
tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)     13.1   14.3
CoreTidy        HscMain    compiler/main/HscMain.hs:1253:27-67                   4.8    3.9
SimplTopBinds   SimplCore  compiler/simplCore/SimplCore.hs:770:39-74             4.2    3.2
coercionKind    Coercion   compiler/types/Coercion.hs:1701:3-7                   2.8    6.0
simplCast       Simplify   compiler/simplCore/Simplify.hs:(1213,5)-(1215,37)     2.5    2.5
zonkTopDecls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    2.3    2.2
deSugar         HscMain    compiler/main/HscMain.hs:511:7-44                     1.5    1.3
isReflexiveCo   Simplify   compiler/simplCore/Simplify.hs:1260:40-55             1.1    1.0
Parser          HscMain    compiler/main/HscMain.hs:(316,5)-(384,20)             1.1    1.6

comment:10 Changed 20 months ago by simonpj

Try getting rid of the first equation for puchCoTyArg

pushCoTyArg co ty
  | tyL `eqType` tyR
  = Just (ty, mkRepReflCo (piResultTy tyR ty))

This is another big pile of type-equalities, rather like calling isReflexiveCo at the wrong moment.

Claim: if it happens that tyL = tyR, but we go ahead with all that mkCoherenceLeftCo stuff anyway, then the coercion optimiser will get rid of it later. Richard: will it?

But try that change anyway. NO WAY should pushCoTyArg take 54% of compile time!

comment:11 Changed 20 months ago by goldfire

Yes, I think Simon's comment:10 is correct. Try removing that.

Note that Phab:D4395 currently removes the piResultTy from that case, but it's quite possible that the eqType call is what's taking up the time.

You might also try removing the similar clause from pushCoValArg.

comment:12 in reply to:  10 Changed 20 months ago by tdammers

Replying to simonpj:

Try getting rid of the first equation for puchCoTyArg

pushCoTyArg co ty
  | tyL `eqType` tyR
  = Just (ty, mkRepReflCo (piResultTy tyR ty))

This is another big pile of type-equalities, rather like calling isReflexiveCo at the wrong moment.

Claim: if it happens that tyL = tyR, but we go ahead with all that mkCoherenceLeftCo stuff anyway, then the coercion optimiser will get rid of it later. Richard: will it?

But try that change anyway. NO WAY should pushCoTyArg take 54% of compile time!

Plain out removing that case branch gets us down by another 4 seconds:

        Tue Apr  3 11:09 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/T14737/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

        total time  =        7.86 secs   (7864 ticks @ 1000 us, 1 processor)
        total alloc = 10,150,661,432 bytes  (excludes profiling overheads)

COST CENTRE     MODULE     SRC                                                 %time %alloc

mkInstCo        CoreOpt    compiler/coreSyn/CoreOpt.hs:982:33-84                31.7   40.6
tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)     20.6   20.4
CoreTidy        HscMain    compiler/main/HscMain.hs:1253:27-67                   7.2    5.5
SimplTopBinds   SimplCore  compiler/simplCore/SimplCore.hs:770:39-74             6.6    4.6
simplCast       Simplify   compiler/simplCore/Simplify.hs:(1213,5)-(1215,37)     3.7    3.5
zonkTopDecls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    3.5    3.1
deSugar         HscMain    compiler/main/HscMain.hs:511:7-44                     2.4    1.9
coercionKind    Coercion   compiler/types/Coercion.hs:1716:3-7                   1.9    4.6
isReflexiveCo   Simplify   compiler/simplCore/Simplify.hs:1260:40-55             1.8    1.4
Parser          HscMain    compiler/main/HscMain.hs:(316,5)-(384,20)             1.8    2.3
StgCmm          HscMain    compiler/main/HscMain.hs:(1428,13)-(1429,62)          1.6    0.7

I've added a few more SCC's to trace more deeply into simplCast, which is why simplCast itself has seemingly dropped to 3.7% - this isn't accurate, because mkInstCo makes up most of the rest of the simplCast call.

So I suggest committing the branch deletion (assuming that it won't break anything).

From here, I'm not 100% sure which is more promising: digging into mkInstCo to see if we can make it more efficient, or looking at simplCast to see if we can make it call mkInstCo less often.

Also:

Note that ​Phab:D4395 currently removes the piResultTy from that case, but it's quite possible that the eqType call is what's taking up the time.

The full profile from before the deletion (which, unfortunately, I no longer have around) clearly shows that eqType is what consumes all that time, not piResultTy.

comment:13 Changed 20 months ago by simonpj

Yes, delete the eqType branches in both pushCoTyArg and pushValArg. Leave a Note to say "don't do eqType here! (see Trac #14737)".

On to mkInstCo. I see that it does an inefficient single-variable substitution. So if we have

mkInstCo (mkInstCo (mkInstCo co (Refl t1)) 
                   (Refl t2))
         (Refl t3)

we will traverse co three times (and t1 twice etc). Bad bad.

Idea: leave that to the coercion optimiser. Try simply removing the first equation for mkInstCo leaving

mkInstCo = InstCo

Richard: I think that the coercion optimiser will do a good job here, right?

But I do see this in the InstCo case of opt_co4:

    -- See if it is a forall after optimization
    -- If so, do an inefficient one-variable substitution, then re-optimize

Are you sure that's a good idea?

comment:14 in reply to:  13 Changed 20 months ago by tdammers

Replying to simonpj:

On to mkInstCo. I see that it does an inefficient single-variable substitution. So if we have

mkInstCo (mkInstCo (mkInstCo co (Refl t1)) 
                   (Refl t2))
         (Refl t3)

we will traverse co three times (and t1 twice etc). Bad bad.

Idea: leave that to the coercion optimiser. Try simply removing the first equation for mkInstCo leaving

mkInstCo = InstCo

That doesn't seem to work well at all - I don't know how long exactly it takes, but it's been compiling Grammar.hs for about half an hour now, so I think it's safe to say that it's much slower now than the 8 seconds we got previously. Unfortunately it's still running, so I can't say where the time is going yet.

comment:15 Changed 20 months ago by goldfire

Interestingly, the coercion optimizer will not work on InstCo ... (Refl ...).

Other points of interest here:

  • The first line of the opt_co4 case for InstCo looks plain wrong, extending the substitution with a use of kind_co, where kind_co has not, itself, been substituted.
  • opt_co4 uses splitForAllCo_maybe, which doesn't look for Refls. Perhaps it should.
  • The only way InstCos can come into being is in the coercion optimizer. There is no call to mkInstCo beyond it. So perhaps we can take that into account when designing these functions.

comment:16 Changed 20 months ago by simonpj

Interestingly, the coercion optimizer will not work on InstCo ... (Refl ...).

At first I disagreed. The first equation for opt_co4 is this

opt_co4 env sym rep r (InstCo co1 arg)
    -- forall over type...
  | Just (tv, kind_co, co_body) <- splitForAllCo_maybe co1
  = opt_co4_wrap (extendLiftingContext env tv
                    (arg' `mkCoherenceRightCo` mkSymCo kind_co))
                 sym rep r co_body

If arg is Refl then kind_co is also Refl, so mkCoherenceRightCo is a no-op. So the argument to extendLiftingContext is Refl; and it looks like this

extendLiftingContext (LC subst env) tv (Refl _ ty)
  = LC (extendTvSubst subst tv ty) env

That is, it just extends the TvSubst. And that looks exactly what happens in mkInstCo. So I think that opc_co4 does in fact do exactly the same. But perhaps spotting a Refl argument would be a bit more direct?

Meanwhile

  • I agree that kind_co should be substituted. But how? By calling ope_co4 on it? Or opt_co3? I don't understand the hierarchy of opt_co functions.
  • I find that code for extendLiftingContext hard to grok. In the Refl case we extend the TvSubst; otherwise we extend the LiftCoEnv. Very mysterious. I looked at it a bit, but got lost in the deeply cryptic liftEnvSubst.

Returning to the main event, however, I agree we won't get an efficient substitution, because in the example in comment:13 the splitForAllCo_maybe will fail -- because co1 is another InstCo! What we need, as usual, is to accumulate those arguments in a list; then in the middle we should find a stack of ForAllCos; and we can extend the substitution as we pair them up. Would that be right? Perhaps that's why Tobias saw no improvement -- though I'm a bit surprised that it got a lot worse.

opt_co4 uses splitForAllCo_maybe, which doesn't look for Refls. Perhaps it should.

You mean that Refl (forall a. ty) can be regarded as a form of ForAllCo? Especially since mkForAllCo goes to some trouble to build a Refl if it can. So surely yes, splitForAllCo_maybe should split a Refl (forall a.ty).

The only way InstCos can come into being is in the coercion optimizer. There is no call to mkInstCo beyond it. So perhaps we can take that into account when designing these functions.

Not true: pushCoTyArg calls mkInstCo; that's where this entire conversation started! And I don't know what it means to "take it into account"

Things to do

  • Fix the missing substitution in opt_co4
  • Fix splitForAllCo_maybe
  • Fix opt_co4 to behave well on deeply nsted InstCos

Might you do these -- you are more likely to get them right than me.

comment:17 in reply to:  16 Changed 20 months ago by goldfire

Replying to simonpj:

If arg is Refl then kind_co is also Refl,

Why do you say this? kind_co must be reflexive, but it needn't be Refl. Especially because it hasn't been optimized yet!

But perhaps spotting a Refl argument would be a bit more direct?

I think it would.

Meanwhile

  • I agree that kind_co should be substituted. But how? By calling ope_co4 on it? Or opt_co3? I don't understand the hierarchy of opt_co functions.

You call the highest-numbered one for which you can meet the preconditions. In this case, that's opt_co4 because we know kind_co's role (nominal), and it should keep that role after optimization.

  • I find that code for extendLiftingContext hard to grok. In the Refl case we extend the TvSubst; otherwise we extend the LiftCoEnv. Very mysterious. I looked at it a bit, but got lost in the deeply cryptic liftEnvSubst.

You're right. This needs more documentation -- but a quick glance at this all suggests the code is currently correct.

What we need, as usual, is to accumulate those arguments in a list;

Yes, of course.

You mean that Refl (forall a. ty) can be regarded as a form of ForAllCo? Especially since mkForAllCo goes to some trouble to build a Refl if it can. So surely yes, splitForAllCo_maybe should split a Refl (forall a.ty).

Yes, that's exactly what I meant.

The only way InstCos can come into being is in the coercion optimizer.

I was searching over a very old checkout of the codebase, and I wrongly assumed this hadn't changed. You're right of course.

Things to do

  • Fix the missing substitution in opt_co4
  • Fix splitForAllCo_maybe
  • Fix opt_co4 to behave well on deeply nsted InstCos

Might you do these -- you are more likely to get them right than me.

OK. Hopefully can do in the next two weeks -- but I can't promise sooner. :(

comment:18 Changed 20 months ago by simonpj

Owner: set to goldfire

Fine, thanks. Let's park this for a couple of weeks then. It's not urgent -- but there is money on the table here that is not hard to get at.

comment:19 Changed 20 months ago by simonpj

Priority: normalhighest

comment:20 Changed 20 months ago by simonpj

Milestone: 8.6.1

comment:21 in reply to:  13 Changed 20 months ago by tdammers

Replying to simonpj:

Yes, delete the eqType branches in both pushCoTyArg and pushValArg. Leave a Note to say "don't do eqType here! (see Trac #14737)".

Patch here: https://phabricator.haskell.org/D4568

comment:22 Changed 20 months ago by tdammers

Differential Rev(s): Phab:D4385Phab:D4385 Phab:D4568

comment:23 Changed 20 months ago by simonpj

I have commented on D4568

comment:24 Changed 20 months ago by tdammers

This patch actually causes a few performance regression tests to fail. I bumped the expected values for now, but they need looking into at some point.

comment:25 Changed 20 months ago by Ben Gamari <ben@…>

In 7bb7f99/ghc:

Discard reflexive casts during Simplify

Trac #14735 (derived from Trac #11735) found that 75% of compile
time was being spent in simplCast. This patch is the first in a series
to deal with that problem.

This particular patch actually has very little effect on performance; it
just refactors simplCast so that it builds Refl coercions less often.
Refl coercions require us to compute the type to put inside them, and
even if that's done lazily it is still work and code. Instead we use
Maybe Coercion with Nothing for Refl. This change also percolates to
pushCoTyArg and pushValArg.

Reviewers: goldfire, bgamari, simonpj

Reviewed By: simonpj

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #14737

Differential Revision: https://phabricator.haskell.org/D4395

comment:26 Changed 19 months ago by tdammers

Add a stripped-down version of Grammar.hs as a regression test.

comment:27 Changed 19 months ago by tdammers

Simon PJ suggested: Try isReflCo rather than discarding the tyL=tyR guard altogether.

comment:28 Changed 19 months ago by tdammers

Unfortunately, the isReflCo approach does not make a difference. Which suggests that the cases for which the type equality check avoids bad things are not the trivial ones that isReflCo can find.

comment:29 Changed 19 months ago by simonpj

OK, well commit the patch, and let's investigate the regressions. The places I'd start would be (distinct alternatives)

  • Try a cheapEqType, so we get something in between isReflCo (dirt cheap) and isReflexiveCo (which uses eqType, can be expensive).
  • Take a regression and see what the (presumably) bigger coercion look like. Then we can see if there's an easy way to kill them off.

comment:30 Changed 19 months ago by Ben Gamari <ben@…>

In 2a5bdd9/ghc:

Remove unnecessary check in simplCast

The coercion optimizer will take care of it anyway, and the check is
prohibitively expensive.

See Trac #14737.

Reviewers: bgamari

Subscribers: simonpj, thomie, carter

Differential Revision: https://phabricator.haskell.org/D4568

comment:31 Changed 18 months ago by Ben Gamari <ben@…>

In d92c755/ghc:

Fix performance regressions from #14737

See #15019. When removing an unnecessary type equality check in #14737,
several regression tests failed. The cause was that some coercions that
are actually Refl coercions weren't passed in as such, which made the
equality check needlessly complex (Refl coercions can be discarded in
this particular check immediately, without inspecting the types at all).

We fix that, and get additional performance improvements for free.

Reviewers: goldfire, bgamari, simonpj

Reviewed By: bgamari, simonpj

Subscribers: simonpj, thomie, carter

Differential Revision: https://phabricator.haskell.org/D4635

comment:32 Changed 18 months ago by bgamari

What is the status of this, tdammers?

comment:33 Changed 18 months ago by tdammers

AFAICT, 2a5bdd9 fixes the original problem, and d92c755 fixes the regressions we caused. So I think we're done here.

comment:34 Changed 17 months ago by bgamari

Resolution: fixed
Status: patchclosed
Note: See TracTickets for help on using tickets.