Opened 3 years ago

Last modified 16 months ago

#11735 new task

Optimize coercionKind

Reported by: goldfire Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.3
Keywords: Cc: tdammers, goldfire
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #14683 #14975 #14737 Differential Rev(s): D4394 D4395
Wiki Page:

Description

Currently, the ForAllCo case of coercionKind does an inefficient one-variable substitution. This could be improved by looking for nested ForAllCos.

Furthermore, perhaps we don't need an in-scope set and the full substitution machinery here. The subst is simply propagating the update of a tyvar's kind. No structural changes at all. No need for smart coercion constructors or other processing. So if we're going to optimize this, it might be worth making a specialized version of subst_ty and subst_co that operate over a VarEnv Var instead of a full substitution.

Change History (87)

comment:1 Changed 3 years ago by goldfire

This originated from commentary on Phab:D2024, though you don't have to read that to understand this ticket.

comment:2 Changed 3 years ago by simonpj

see #11598, which is closely related.

comment:3 Changed 19 months ago by tdammers

Cc: tdammers added

comment:4 Changed 19 months ago by tdammers

This could be a useful building block, if not the solution, for #14683. Any concrete ideas / plans on how to implement this?

comment:5 Changed 19 months ago by simonpj

Let's just do this. Pretty easy. Roughly like the InstCo case. Here is an untested patch for coercionKind.

diff --git a/compiler/types/Coercion.hs b/compiler/types/Coercion.hs
index 3f83b09..daebf35 100644
--- a/compiler/types/Coercion.hs
+++ b/compiler/types/Coercion.hs
@@ -1707,17 +1707,13 @@ coercionKind co = go co
     go (Refl _ ty)          = Pair ty ty
     go (TyConAppCo _ tc cos)= mkTyConApp tc <$> (sequenceA $ map go cos)
     go (AppCo co1 co2)      = mkAppTy <$> go co1 <*> go co2
-    go (ForAllCo tv1 k_co co)
-      = let Pair _ k2          = go k_co
-            tv2                = setTyVarKind tv1 k2
-            Pair ty1 ty2       = go co
-            subst = zipTvSubst [tv1] [TyVarTy tv2 `mk_cast_ty` mkSymCo k_co]
-            ty2' = substTyAddInScope subst ty2 in
-            -- We need free vars of ty2 in scope to satisfy the invariant
-            -- from Note [The substitution invariant]
-            -- This is doing repeated substitutions and probably doesn't
-            -- need to, see #11735
-        mkInvForAllTy <$> Pair tv1 tv2 <*> Pair ty1 ty2'
+
+    go co@(ForAllCo tv1 k_co co1)
+       | isReflCo k_co            = mkInvForAllTy tv1 <$> go co1
+       | otherwise                = go_forall empty_subst co
+       where
+         empty_subst = mkEmptyTCvSubst (mkInScopeSet $ tyCoVarsOfCo co)
+
     go (FunCo _ co1 co2)    = mkFunTy <$> go co1 <*> go co2
     go (CoVarCo cv)         = coVarTypes cv
     go (HoleCo h)           = coVarTypes (coHoleCoVar h)
@@ -1769,10 +1765,16 @@ coercionKind co = go co
     go_app (InstCo co arg) args = go_app co (arg:args)
     go_app co              args = piResultTys <$> go co <*> (sequenceA $ map go args)
 
-    -- The real mkCastTy is too slow, and we can easily have nested ForAllCos.
-    mk_cast_ty :: Type -> Coercion -> Type
-    mk_cast_ty ty (Refl {}) = ty
-    mk_cast_ty ty co        = CastTy ty co
+    go_forall subst (ForAllCo tv1 k_co co)
+      = mkInvForAllTy <$> Pair tv1 tv2 <*> go_forall subst' co
+      where
+        Pair _ k2 = go k_co
+        tv2       = setTyVarKind tv1 (substTy subst k2)
+        subst' | isReflCo k_co = extendTCvInScope subst tv1
+               | otherwise     = extendTvSubst (extendTCvInScope subst tv2) tv1 $
+                                 TyVarTy tv2 `mkCastTy` mkSymCo k_co
+    go_forall subst other_co
+      = substTy subst `pLiftSnd` go other_co
 
 -- | Apply 'coercionKind' to multiple 'Coercion's
 coercionKinds :: [Coercion] -> Pair [Type]

Richard: can you check.

We should do the same thing to coercionKindRole. (Tiresomely.)

comment:6 Changed 19 months ago by simonpj

We should do the same thing to coercionKindRole. (Tiresomely.)

Actually, looking at it, I think it'd be better to define

coercionRole :: Coercion -> Role

directly (a simple, fast recursive function), and then define

coercionKindRole :: Coercion -> (Pair Type, Role)
coercionKindRole co = (coercionKind co, coercionRole co)

Less duplication, and (I strongly suspect) faster. Could you try that and check perf?

comment:7 Changed 19 months ago by tdammers

Applying the coercionKind patch brings down compilation time for the Grammar.hs example from #14683 to about 30 seconds. Relevant profiling data:

	Wed Jan 24 14:33 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to-file -fforce-recomp

	total time  =       32.48 secs   (32475 ticks @ 1000 us, 1 processor)
	total alloc = 43,424,804,608 bytes  (excludes profiling overheads)

COST CENTRE           MODULE     SRC                                                 %time %alloc

CoreTidy              HscMain    compiler/main/HscMain.hs:1253:27-67                  20.3   22.6
subst_ty              TyCoRep    compiler/types/TyCoRep.hs:2225:28-32                 17.7   23.9
Stg2Stg               HscMain    compiler/main/HscMain.hs:1489:12-44                  13.9   16.3
simplCast             Simplify   compiler/simplCore/Simplify.hs:871:62-87             12.8   10.7
simplCast-addCoerce   Simplify   compiler/simplCore/Simplify.hs:1225:53-71            11.6    9.2
tc_rn_src_decls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)      6.4    4.6
addCoerce-pushCoTyArg Simplify   compiler/simplCore/Simplify.hs:(1236,12)-(1237,72)    5.0    4.3
preprocessFile        GhcMake    compiler/main/GhcMake.hs:(2416,1)-(2443,37)           1.0    0.0
coercionKind          Coercion   compiler/types/Coercion.hs:1707:3-7                   1.0    1.9

comment:8 in reply to:  6 Changed 19 months ago by tdammers

Replying to simonpj:

We should do the same thing to coercionKindRole. (Tiresomely.)

Actually, looking at it, I think it'd be better to define

coercionRole :: Coercion -> Role

directly (a simple, fast recursive function), and then define

coercionKindRole :: Coercion -> (Pair Type, Role)
coercionKindRole co = (coercionKind co, coercionRole co)

Less duplication, and (I strongly suspect) faster. Could you try that and check perf?

I'm not sure I understand what coercionRole would look like. Also, does that mean that the inline notes in the existing coercionRoleKinds and coercionRole functions don't apply anymore?

comment:9 Changed 19 months ago by tdammers

Oh wait, I think I get it. Never mind.

comment:10 Changed 19 months ago by simonpj

Cc: goldfire added

Great. Next step:

  1. We should not spend 20% our time in CoreTidy, I think. Drill down.
  1. Something is clearly wrong with Simplify.simplCast. I think I know what it is. Given
    (fun |> co) @t1 @t2 ... @tn
    
    we will call pushCoTyArg n times, and hence does n singleton substitutions, via the n calls to piResultTy.

Solution: gather up those type arguments (easy) and define

pushCoTyArgs :: Coercion -> [Type] -> Maybe ([Type], Coercion)

using an accumulating substitution. I spent a few minutes trying to write down pushCoTyArgs but my brain melted in a mess of kind casts, and I ran out of time.

Richard can you help with (2), please?

comment:11 Changed 19 months ago by simonpj

Type of failure: None/UnknownCompile-time performance bug

comment:12 Changed 19 months ago by tdammers

Refactoring coercionRoleKinds / coercionRole gives us another performance boost:

	Wed Jan 24 17:17 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to-file -fforce-recomp

	total time  =       24.18 secs   (24176 ticks @ 1000 us, 1 processor)
	total alloc = 29,250,375,752 bytes  (excludes profiling overheads)

COST CENTRE           MODULE     SRC                                                 %time %alloc

CoreTidy              HscMain    compiler/main/HscMain.hs:1253:27-67                  24.8   28.5
Stg2Stg               HscMain    compiler/main/HscMain.hs:1489:12-44                  20.1   24.2
simplCast             Simplify   compiler/simplCore/Simplify.hs:871:62-87             18.6   15.9
tc_rn_src_decls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)      8.1    6.9
addCoerce-pushCoTyArg Simplify   compiler/simplCore/Simplify.hs:(1236,12)-(1237,72)    7.3    6.4
subst_ty              TyCoRep    compiler/types/TyCoRep.hs:2225:28-32                  4.2    5.1
preprocessFile        GhcMake    compiler/main/GhcMake.hs:(2416,1)-(2443,37)           1.4    0.0
coercionKind          Coercion   compiler/types/Coercion.hs:1707:3-7                   1.4    3.0
zonkTopDecls          TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    1.3    1.1
simplExprF1-Lam       Simplify   compiler/simplCore/Simplify.hs:896:5-39               1.0    1.1

Now on to the 25% spent on CoreTidy.

comment:13 Changed 19 months ago by simonpj

Refactoring coercionRoleKinds / coercionRole gives us another performance boost

That is totally bonkers (in a good way). From 43G down to 29G from that one change. Wow.

comment:14 Changed 19 months ago by goldfire

I'm dubious of the coercionRole refactor, as it undoes a refactoring I put in a few years ago, for performance reasons. The problem is that coercionRole depends on coercionKind in the NthCo case. So, if these functions are separated, then coercionRole recurs via both coercionKind and coercionRole sometimes, causing a lot of extra work.

Note that a standalone coercionKind does exist, also for performance reasons.

To be clear, I'm not doubting your numbers on your particular test case, but I'm not sure how far this would generalize. That said, the degree to which your case improved suggests there's some other inefficiency here. Maybe it's all the tuples? If we make them unboxed, does that fix the problem?

comment:15 Changed 19 months ago by goldfire

I can take a look at comment:10 on Friday, but not before.

Also, tdammers, thanks thanks thanks for doing this!

comment:16 Changed 19 months ago by simonpj

I'm dubious of the coercionRole refactor, as it undoes a refactoring I put in a few years ago, for performance reasons.

I looked at what code would be needed for coercionRole and its a remarkably short and simple function. Tobias, would to like to post the code?

Richard, did you have a concrete reason for that refactoring? Or just a general worry?

comment:17 in reply to:  16 Changed 19 months ago by tdammers

Replying to simonpj:

I'm dubious of the coercionRole refactor, as it undoes a refactoring I put in a few years ago, for performance reasons.

I looked at what code would be needed for coercionRole and its a remarkably short and simple function. Tobias, would to like to post the code?

Of course. Here's the patch:

diff --git a/compiler/types/Coercion.hs b/compiler/types/Coercion.hs
index e1a5b7cde0..36874a4c4d 100644
--- a/compiler/types/Coercion.hs
+++ b/compiler/types/Coercion.hs
@@ -1783,77 +1783,53 @@ coercionKinds tys = sequenceA $ map coercionKind tys
 -- | Get a coercion's kind and role.
 -- Why both at once?  See Note [Computing a coercion kind and role]
 coercionKindRole :: Coercion -> (Pair Type, Role)
-coercionKindRole = go
+coercionKindRole co = (coercionKind co, coercionRole co)
+
+-- | Retrieve the role from a coercion.
+coercionRole :: Coercion -> Role
+coercionRole = go
   where
-    go (Refl r ty) = (Pair ty ty, r)
-    go (TyConAppCo r tc cos)
-      = (mkTyConApp tc <$> (sequenceA $ map coercionKind cos), r)
-    go (AppCo co1 co2)
-      = let (tys1, r1) = go co1 in
-        (mkAppTy <$> tys1 <*> coercionKind co2, r1)
-    go (ForAllCo tv1 k_co co)
-      = let Pair _ k2          = coercionKind k_co
-            tv2                = setTyVarKind tv1 k2
-            (Pair ty1 ty2, r)  = go co
-            subst = zipTvSubst [tv1] [TyVarTy tv2 `mkCastTy` mkSymCo k_co]
-            ty2' = substTyAddInScope subst ty2 in
-            -- We need free vars of ty2 in scope to satisfy the invariant
-            -- from Note [The substitution invariant]
-            -- This is doing repeated substitutions and probably doesn't
-            -- need to, see #11735
-        (mkInvForAllTy <$> Pair tv1 tv2 <*> Pair ty1 ty2', r)
-    go (FunCo r co1 co2)
-      = (mkFunTy <$> coercionKind co1 <*> coercionKind co2, r)
+    go (Refl r _) = r
+    go (TyConAppCo r _ _) = r
+    go (AppCo co1 _) = go co1
+    go (ForAllCo _ _ co) = go co
+    go (FunCo r _ _) = r
     go (CoVarCo cv) = go_var cv
     go (HoleCo h)   = go_var (coHoleCoVar h)
-    go co@(AxiomInstCo ax _ _) = (coercionKind co, coAxiomRole ax)
-    go (UnivCo _ r ty1 ty2)  = (Pair ty1 ty2, r)
-    go (SymCo co) = first swap $ go co
-    go (TransCo co1 co2)
-      = let (tys1, r) = go co1 in
-        (Pair (pFst tys1) (pSnd $ coercionKind co2), r)
+    go (AxiomInstCo ax _ _) = coAxiomRole ax
+    go (UnivCo _ r _ _)  = r
+    go (SymCo co) = go co
+    go (TransCo co1 co2) = go co1
     go (NthCo d co)
       | Just (tv1, _) <- splitForAllTy_maybe ty1
       = ASSERT( d == 0 )
-        let (tv2, _) = splitForAllTy ty2 in
-        (tyVarKind <$> Pair tv1 tv2, Nominal)
+        Nominal
 
       | otherwise
       = let (tc1,  args1) = splitTyConApp ty1
             (_tc2, args2) = splitTyConApp ty2
         in
         ASSERT2( tc1 == _tc2, ppr d $$ ppr tc1 $$ ppr _tc2 )
-        ((`getNth` d) <$> Pair args1 args2, nthRole r tc1 d)
+        (nthRole r tc1 d)
 
       where
-        (Pair ty1 ty2, r) = go co
-    go co@(LRCo {}) = (coercionKind co, Nominal)
+        (Pair ty1 ty2, r) = coercionKindRole co
+    go (LRCo {}) = Nominal
     go (InstCo co arg) = go_app co [arg]
-    go (CoherenceCo co1 co2)
-      = let (Pair t1 t2, r) = go co1 in
-        (Pair (t1 `mkCastTy` co2) t2, r)
-    go co@(KindCo {}) = (coercionKind co, Nominal)
-    go (SubCo co) = (coercionKind co, Representational)
-    go co@(AxiomRuleCo ax _) = (coercionKind co, coaxrRole ax)
+    go (CoherenceCo co1 _) = go co1
+    go (KindCo {}) = Nominal
+    go (SubCo _) = Representational
+    go (AxiomRuleCo ax _) = coaxrRole ax
 
     -------------
-    go_var cv = (coVarTypes cv, coVarRole cv)
+    go_var = coVarRole
 
     -------------
-    go_app :: Coercion -> [Coercion] -> (Pair Type, Role)
+    go_app :: Coercion -> [Coercion] -> Role
     -- Collect up all the arguments and apply all at once
     -- See Note [Nested InstCos]
     go_app (InstCo co arg) args = go_app co (arg:args)
-    go_app co              args
-      = let (pair, r) = go co in
-        (piResultTys <$> pair <*> (sequenceA $ map coercionKind args), r)
-
--- | Retrieve the role from a coercion.
-coercionRole :: Coercion -> Role
-coercionRole = snd . coercionKindRole
-  -- There's not a better way to do this, because NthCo needs the *kind*
-  -- and role of its argument. Luckily, laziness should generally avoid
-  -- the need for computing kinds in other cases.
+    go_app co              args = go co
 
 {-
 Note [Nested InstCos]

Or, for increased clarity, just the new version of coercionRole:

-- | Retrieve the role from a coercion.
coercionRole :: Coercion -> Role
coercionRole = go
  where
    go (Refl r _) = r
    go (TyConAppCo r _ _) = r
    go (AppCo co1 _) = go co1
    go (ForAllCo _ _ co) = go co
    go (FunCo r _ _) = r
    go (CoVarCo cv) = go_var cv
    go (HoleCo h)   = go_var (coHoleCoVar h)
    go (AxiomInstCo ax _ _) = coAxiomRole ax
    go (UnivCo _ r _ _)  = r
    go (SymCo co) = go co
    go (TransCo co1 co2) = go co1
    go (NthCo d co)
      | Just (tv1, _) <- splitForAllTy_maybe ty1
      = ASSERT( d == 0 )
        Nominal

      | otherwise
      = let (tc1,  args1) = splitTyConApp ty1
            (_tc2, args2) = splitTyConApp ty2
        in
        ASSERT2( tc1 == _tc2, ppr d $$ ppr tc1 $$ ppr _tc2 )
        (nthRole r tc1 d)

      where
        (Pair ty1 ty2, r) = coercionKindRole co
    go (LRCo {}) = Nominal
    go (InstCo co arg) = go_app co [arg]
    go (CoherenceCo co1 _) = go co1
    go (KindCo {}) = Nominal
    go (SubCo _) = Representational
    go (AxiomRuleCo ax _) = coaxrRole ax

    -------------
    go_var = coVarRole

    -------------
    go_app :: Coercion -> [Coercion] -> Role
    -- Collect up all the arguments and apply all at once
    -- See Note [Nested InstCos]
    go_app (InstCo co arg) args = go_app co (arg:args)
    go_app co              args = go co

I didn't make an effort at completely understanding what this is supposed to do, really; what I did, in a nutshell, was:

  • Rename coercionKindRole to coercionRole and change its type
  • Remove the first tuple element from each returned value
  • Fix recursive calls and local functions to not take the now-redundant first parameter and not return the first tuple element
  • Write a new coercionKindRole function that simply calls coercionKind and coercionRole and tuples up the results

So it's entirely possible that I did something wrong somewhere. Anyway, I don't know if this would qualify as "remarkably short", but it is fairly simple.

Richard, did you have a concrete reason for that refactoring? Or just a general worry?

I believe the explanation is in the note at the bottom:

Note [Nested InstCos]
~~~~~~~~~~~~~~~~~~~~~
In Trac #5631 we found that 70% of the entire compilation time was
being spent in coercionKind!  The reason was that we had
   (g @ ty1 @ ty2 .. @ ty100)    -- The "@s" are InstCos
where
   g :: forall a1 a2 .. a100. phi
If we deal with the InstCos one at a time, we'll do this:
   1.  Find the kind of (g @ ty1 .. @ ty99) : forall a100. phi'
   2.  Substitute phi'[ ty100/a100 ], a single tyvar->type subst
But this is a *quadratic* algorithm, and the blew up Trac #5631.
So it's very important to do the substitution simultaneously;
cf Type.piResultTys (which in fact we call here).

I'm not entirely sure whether this still applies though; I would expect the separate coercionRole and coercionKinds functions to perform better individually than the combined one, except when both are actually needed in concert. And even then, I'm skeptical; the recursive calls would drag along both data structures (kinds and role). Is it possible that this particular bit of code looked crucially different back when #5631 was ongoing? Or did I un-refactor in a way that doesn't replicate a crucial problem from the original code?

comment:18 Changed 19 months ago by tdammers

Oh, and also this note further up:

Note [Computing a coercion kind and role]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To compute a coercion's kind is straightforward: see coercionKind.
But to compute a coercion's role, in the case for NthCo we need
its kind as well.  So if we have two separate functions (one for kinds
and one for roles) we can get exponentially bad behaviour, since each
NthCo node makes a separate call to coercionKind, which traverses the
sub-tree again.  This was part of the problem in Trac #9233.

Solution: compute both together; hence coercionKindRole.  We keep a
separate coercionKind function because it's a bit more efficient if
the kind is all you want.

Which leads me to believe that maybe this particular input I'm testing doesn't hit the NthCo branch enough to make a dent there, and thus fares better with the separated-out implementation, but other code might.

comment:19 Changed 19 months ago by simonpj

  • In your coercionRole, get_app never uses its second argument! So you can get rid of it entirely I think
        go (InstCo co _) = go co
    
  • Note [Nested InstCos] applies to coercionKind but not, I think, to coercionRole
  • Note [Computing a coercion kind and role] claims that computing the result of coercionRole (NthCo d co) requires coercionKind. But it manifestly does not. Richard is this note simply wrong?

comment:20 Changed 19 months ago by goldfire

Thanks for finding Note [Computing a coercion kind and role]. I believe this note is correct and is as relevant today as it was when I wrote it. And, I'm afraid to say that coercionRole (NthCo ...) manifestly does require coercionKind. See the call in the where clause in the suggested patch.

However, perhaps the solution lies in including the role of an NthCo in the NthCo, essentially caching this result. I think this would be easy to do, especially as the caller of mkNthCo generally has to know what role its getting -- mkNthCo could take this role as a parameter. Lint could check that the supplied role is the same as what would have been calculated. With this change to NthCo, then coercionRole really would be independent of coercionKind and we could separate the functions.

comment:21 in reply to:  20 Changed 19 months ago by tdammers

Replying to goldfire:

Thanks for finding Note [Computing a coercion kind and role]. I believe this note is correct and is as relevant today as it was when I wrote it.

And, I'm afraid to say that coercionRole (NthCo ...) manifestly does require coercionKind. See the call in the where clause in the suggested patch.

Correct. The original coercionKindRole function recursed via the go worker directly, but now that I have rewritten it as coercionRole, the NthCo case still required the coercion kind to do its thing, so it calls coercionKindRole, which, post-unrefactoring, calls both coercionKind and coercionRole.

However, perhaps the solution lies in including the role of an NthCo in the NthCo, essentially caching this result. I think this would be easy to do, especially as the caller of mkNthCo generally has to know what role its getting -- mkNthCo could take this role as a parameter. Lint could check that the supplied role is the same as what would have been calculated. With this change to NthCo, then coercionRole really would be independent of coercionKind and we could separate the functions.

Sounds like a reasonable plan; however, before I venture into implementing this, I would love to have some hard evidence that we actually do have a problem. Do you have any single-module example code ready that I could use to prove that the "un-refactored" code performs significantly worse than current HEAD on some inputs?

comment:22 in reply to:  19 Changed 19 months ago by tdammers

Replying to simonpj:

  • In your coercionRole, get_app never uses its second argument! So you can get rid of it entirely I think
        go (InstCo co _) = go co
    

Good catch, hadn't noticed. Thanks!

  • Note [Computing a coercion kind and role] claims that computing the result of coercionRole (NthCo d co) requires coercionKind. But it manifestly does not. Richard is this note simply wrong?

No, it's not wrong, see my earlier comment - coercionRole now recurses via coercionKindRole in the where clause.

comment:23 Changed 19 months ago by simonpj

I'm afraid to say that coercionRole (NthCo ...) manifestly does require coercionKind

Aha. So it does.

Very well, yes let's do this

  • Cache the role (of the result of NthCo i co) in the NthCo data constructor
  • Make mkNth compute that cached role by calling coercionKind and coercionRole (in the final equation where we build NthCo).
  • Possible gloss: in some calls to mkNthCo we know the role, and we even know that the various short-cuts in mkNthCo won't succeed. So we can just use NthCo directly -- or define mkNthCoDirect = NthCo and call mkNthCoDirect. Perhaps not worth it; but when we do know the role it seems a bit silly not to use it.

comment:24 Changed 19 months ago by simonpj

I would love to have some hard evidence that we actually do have a problem.

Well, without caching we know that getting the role of NthCo i1 (NthCo i2 (NthCo i3 ...)) will take time at least quadratic in the nesting depth. Caching cheaply avoids nasty corner case. And not much code is involved.

comment:25 in reply to:  24 Changed 19 months ago by tdammers

Replying to simonpj:

I would love to have some hard evidence that we actually do have a problem.

Well, without caching we know that getting the role of NthCo i1 (NthCo i2 (NthCo i3 ...)) will take time at least quadratic in the nesting depth. Caching cheaply avoids nasty corner case. And not much code is involved.

Fair enough, I'll get to it. Just wanted a test case for practical verification.

comment:26 Changed 19 months ago by tdammers

The caching approach implemented directly (i.e., adding a Role field to the NthCo constructor) is a fairly pervasive one; it turns out that many modules depend on that constructor, either directly o via the isomorphic IfaceCoercion type (which, I believe, should inherit the additional field, correct me if I'm wrong), so adding the extra fields is going to touch a lot of source files.

Not a fundamental problem per se, but it means that this will take a little while and will deserve some extra scrutinity before unleashing it.

comment:27 Changed 19 months ago by tdammers

OK, new profiling result:

	Thu Jan 25 13:11 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to-file -fforce-recomp

	total time  =       20.99 secs   (20989 ticks @ 1000 us, 1 processor)
	total alloc = 29,250,375,256 bytes  (excludes profiling overheads)

COST CENTRE           MODULE     SRC                                                 %time %alloc

CoreTidy              HscMain    compiler/main/HscMain.hs:1253:27-67                  24.2   28.5
Stg2Stg               HscMain    compiler/main/HscMain.hs:1489:12-44                  20.3   24.2
simplCast             Simplify   compiler/simplCore/Simplify.hs:871:62-87             18.7   15.9
tc_rn_src_decls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)      9.0    6.9
addCoerce-pushCoTyArg Simplify   compiler/simplCore/Simplify.hs:(1236,12)-(1237,72)    7.4    6.4
subst_ty              TyCoRep    compiler/types/TyCoRep.hs:2237:28-32                  4.3    5.1
zonkTopDecls          TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    1.4    1.1
coercionKind          Coercion   compiler/types/Coercion.hs:1725:3-7                   1.3    3.0
simplExprF1-Lam       Simplify   compiler/simplCore/Simplify.hs:896:5-39               1.0    1.1

This is the same Grammar.hs, compiled with the same GHC code as before, but with NthCo extended with an extra Role field, and the kind calculation from coercionRole moved out into mkNthCo. I ended up having to make changes in 16 modules, but most of them were straightforward, discarding or forwarding the extra field in a pattern match. I think this shouldn't have any negative impact, because forwarding the role can only make things better (avoiding future calls to coercionRole), and discarding it retains the old status.

Conclusions:

  • We're shaving off another 4 seconds of execution time, and allocations remain the same. So this doesn't seem to make things worse for the Grammar.hs case.
  • We are not actually reducing allocations any further.
  • CoreTidy is worth looking into.
  • In order to verify that this change really makes an impact for the better, I would still love to test this against source code that would perform badly without it. Test cases very welcome.
  • 20 seconds is still awfully long.

comment:28 Changed 19 months ago by simonpj

You don't need to propagate the change into IfaceCoercion: it's just a cache, so you can recompute it when turning IfaceNthCo into NthCo. On the whole I think it'd be better not to change IfaceCoercion: less redundancy in the Iface data types.

Shaving off 4 seconds is fantastic. Take it!

Also:

  • Yes CoreTidy is worth looking into
  • But also so is simplCast. See comment:10. Richard I need your help!

comment:29 Changed 19 months ago by goldfire

I still say that most uses of mkNthCo already know the right role -- though perhaps this role may not be as obvious to someone who isn't me. Just supplying the role would be much better than computing it, of course.

I don't agree with Simon's suggestion about mkNthCoDirect -- this is a first step toward invariants that are not upheld. Instead, I would have mkNthCo require a role and make a new mkNthCoNoRole that computes it. The naming of the functions should discourage the use of the second.

The Note explaining why I originally bundled coercionKind and coercionRole points to a test case. Does that give a concrete testing ground? You might also look in the git history around that note to see if that gives you any pointers.

Agreed about not cluttering IfaceCoercion.

And, yes, I will take a look at simplCast tomorrow. (Though it's deeply scary when Simon says "My brain melted" and then asks me to take a look!)

comment:30 Changed 19 months ago by goldfire

By the way, if @tdammers can make the patch available (on GitHub, say), I could look into patching the patch. That might be easier.

comment:31 Changed 19 months ago by simonpj

Instead, I would have mkNthCo require a role and make a new mkNthCoNoRole that computes it.

Yes, but then mkNthCo relies on the caller to guarantee the invariant. That's fine; Lint will check. That's all I meant mkNthCoDirect to do. But I'm not fussy about naming, so long as the comments are clear.

comment:32 in reply to:  29 Changed 19 months ago by tdammers

Replying to goldfire:

I still say that most uses of mkNthCo already know the right role -- though perhaps this role may not be as obvious to someone who isn't me. Just supplying the role would be much better than computing it, of course.

It is indeed a bit less obvious to me; I have found a few places (2 or 3 I think) where we have a role ready that could be a candidate, but not being sure whether it would be the right thing, I opted for the conservative option.

I don't agree with Simon's suggestion about mkNthCoDirect -- this is a first step toward invariants that are not upheld. Instead, I would have mkNthCo require a role and make a new mkNthCoNoRole that computes it. The naming of the functions should discourage the use of the second.

I have absolutely no opinion on this one; happy to implement it either way.

The Note explaining why I originally bundled coercionKind and coercionRole points to a test case. Does that give a concrete testing ground? You might also look in the git history around that note to see if that gives you any pointers.

OK, will take a look.

Agreed about not cluttering IfaceCoercion.

Yes, makes total sense. Undoing as we speak.

comment:33 in reply to:  30 Changed 19 months ago by tdammers

Replying to goldfire:

By the way, if @tdammers can make the patch available (on GitHub, say), I could look into patching the patch. That might be easier.

Will do ASAP. I made a bit of a mess with git, but once that's cleaned up I'll upload somewhere (phab?)

comment:35 Changed 19 months ago by goldfire

I have a lingering concern: why did the old coercionKindRole perform so miserably? In a call to coercionRole, the kind calculations should never be forced. So what takes up all the memory? Is it really just the tuples? If so, then we've discovered a major way to speed up other areas of GHC: convert tuples to be unboxed. Even better, we've discovered a major missing optimization, which could probably automate the unboxing somehow.

So I wonder if there are more opportunities here. None of this changes the current direction of travel (caching is a good idea, regardless of my question here), but perhaps suggests another future line of inquiry.

comment:36 Changed 19 months ago by simonpj

The GHC branch is wip/tdammers/T11735

comment:37 in reply to:  35 Changed 19 months ago by tdammers

Replying to goldfire:

I have a lingering concern: why did the old coercionKindRole perform so miserably? In a call to coercionRole, the kind calculations should never be forced. So what takes up all the memory? Is it really just the tuples? If so, then we've discovered a major way to speed up other areas of GHC: convert tuples to be unboxed. Even better, we've discovered a major missing optimization, which could probably automate the unboxing somehow.

I think the hypothesis was something like this: Suppose we have a deeply nested NthCo, and we want to calculate its coercionKind. In order to do that, we need its coercionRole, which involves recursing through the whole thing (O(n)), but once we have that result, we only use it to decide whether we need to keep recursing, and then throw it away. And then in the next step, we calculate the coercionRole again. So the whole thing ends up as O(n²). Whereas if we cache the roles throughout, we only ever calculate each of them once, at construction time, so we never get the quadratic badness. So it's not that the role calculation forces the kind calculation, but the other way around - in order to calculate the correct kind for an NthCo, we need to know its role, but potentially also the roles of all of its children. So in a way, the caching serves as memoization.

So I wonder if there are more opportunities here. None of this changes the current direction of travel (caching is a good idea, regardless of my question here), but perhaps suggests another future line of inquiry.

Considering how the Grammar.hs example still takes about 20 seconds to compile, and there are a few rather whopping candidates popping up in the profile, yes, I think it is very likely that we can find other opportunities. I will definitely look into the tuple unboxing thing, and also try to get to the bottom of the CoreTidy and simplCast cost centres. Who knows, maybe they're somehow related, even.

comment:38 Changed 19 months ago by goldfire

I'm sorry -- I don't understand the first part of comment:37. Getting a kind should never require getting a role. That's why there is a version of coercionKind that's a standalone function. Let's assume you got these two swapped. Even then, I'm not sure what you're describing; it seems you're describing your 'un-refactored" version keeping roles and kinds separate. If they are together (as in HEAD), I don't see the quadratic behavior.

And yet, something goes terribly wrong in HEAD, even without this quadratic behavior. But what?? Or maybe I'm completely missing something here.

comment:39 in reply to:  38 Changed 19 months ago by tdammers

Replying to goldfire:

I'm sorry -- I don't understand the first part of comment:37. Getting a kind should never require getting a role. That's why there is a version of coercionKind that's a standalone function. Let's assume you got these two swapped. Even then, I'm not sure what you're describing; it seems you're describing your 'un-refactored" version keeping roles and kinds separate.

I'm sorry, yes, I was being confused there; coercionKind and coercionRole are mutually recursive in the "un-refactored" version only. However, the un-refactoring *does* produce a performance improvement, so there must be *something* going on here - I assumed that the original coercionKindRole would ultimately amount to a similar recursion pattern, but it probably doesn't.

If they are together (as in HEAD), I don't see the quadratic behavior. And yet, something goes terribly wrong in HEAD, even without this quadratic behavior. But what?? Or maybe I'm completely missing something here.

I think I have found it. For clarity, this is the relevant code on HEAD:

    go (NthCo d co)
      | Just (tv1, _) <- splitForAllTy_maybe ty1
      = ASSERT( d == 0 )
        let (tv2, _) = splitForAllTy ty2 in
        (tyVarKind <$> Pair tv1 tv2, Nominal)

      | otherwise
      = let (tc1,  args1) = splitTyConApp ty1
            (_tc2, args2) = splitTyConApp ty2
        in
        ASSERT2( tc1 == _tc2, ppr d $$ ppr tc1 $$ ppr _tc2 )
        ((`getNth` d) <$> Pair args1 args2, nthRole r tc1 d)

      where
        (Pair ty1 ty2, r) = go co

So the rough shape of the recursion is simple - we hit the otherwise case repeatedly until we get to the d == 0 case; O(n). But inside the otherwise branch, there's this pesky getNth call, which is linear in d (being essentially a linked-list lookup), and another one via nthRole.

The problem goes away when we calculate the role at construction time, because we are either constructing an NthCo that doesn't wrap another NthCo, which makes the role calculation constant-time, or we are constructing one that *does* wrap another NthCo, but that one already has its role calculated, so it is also constant.

Hope I'm making sense now.

comment:40 Changed 19 months ago by goldfire

Sorry, still confused. :(

How are coercionKind and coercionRole mutually recursive? I see that coercionRole calls coercionKind but not the other way around.

But you're right that I'm trying to understand better why there's a performance improvement in this patch (even before any caching). In the nested NthCo case, I'm pretty sure your refactor would be worse. But in the test case at hand (which I assume doesn't have nested NthCos -- haven't looked), your change is clearly an improvement.

However, I don't think your analysis above is really the problem. I would expect that the running time of coercionKind or coercionRole on nested NthCos to be linear in the sum of the ds -- that is, we'll have to add together all the indices. You've shown above that the old recursion pattern (from coercionKindRole) traverses down the linked list twice (once in getNth and once in nthRole), but this shouldn't change asymptotic complexity. And, usually, d is quite small, and so I wouldn't expect this to show up at all, really. I still don't think we've quite gotten to the bottom of why separating out coercionKind and coercionRole should effect a performance improvement.

On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum.

Please don't let my nit-picking slow you down or discourage you. It's just that I think you've hit something quite interesting, and, well, I'm interested. :)

comment:41 Changed 19 months ago by goldfire

OK. I looked at pushCoTyArg and friends, and I have a very simple solution: just move the isReflexiveCo case in addCoerce (a local function within Simplify.simplCast) to the top. That should do it. Then pushCoTyArg is never called with a reflexive coercion, and so the piResultTy case won't happen.

Now, pushCoArgs might still call pushCoTyArg with a reflexive coercion, but it can be taught not to as well: Have pushCoArgs return a Maybe ([CoreArg], Maybe Coercion) and pushCoArg return a Maybe (CoreArg, Maybe Coercion). If the second return values are Nothing, that means that there is no cast (i.e., that the cast would have been reflexive). The only client of pushCoArg(s) is exprIsConApp_maybe, which simply omits a cast if pushCoArgs returns Nothing. Then, we never have to bother creating the reflexive coercions.

This should be an easy win all around.

comment:42 Changed 19 months ago by goldfire

I've taken the liberty of pushing directly to the wip/tdammers/T11735 branch. I've done a bit of testing, and things seem OK, but a full testsuite run is surely warranted. Also, I've done no performance study of any kind, so that will surely need to be done, too.

I'm curious to see what you find!

comment:43 Changed 19 months ago by goldfire

OK. I've fixed a few testsuite failures. I think this is probably correct now. (CircleCI reports a bunch of errors in profiling, but I can't repro locally.)

What happens when you benchmark?

comment:44 in reply to:  43 Changed 19 months ago by tdammers

Replying to goldfire:

OK. I've fixed a few testsuite failures. I think this is probably correct now. (CircleCI reports a bunch of errors in profiling, but I can't repro locally.)

What happens when you benchmark?

Curiously, the latest version clocks in at about 25 seconds for Grammar.hs, so almost as fast, but not quite, as the fastest we've seen so far (~21 seconds):

	Mon Jan 29 09:43 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib Grammar.hs -ddump-stg -ddump-simpl -ddump-to-file -fforce-recomp

	total time  =       25.95 secs   (25950 ticks @ 1000 us, 1 processor)
	total alloc = 33,815,717,032 bytes  (excludes profiling overheads)

COST CENTRE           MODULE     SRC                                                 %time %alloc

simplCast             Simplify   compiler/simplCore/Simplify.hs:871:62-87             26.0   21.8
CoreTidy              HscMain    compiler/main/HscMain.hs:1253:27-67                  19.9   24.7
Stg2Stg               HscMain    compiler/main/HscMain.hs:1489:12-44                  17.6   20.9
addCoerce-pushCoTyArg Simplify   compiler/simplCore/Simplify.hs:(1251,12)-(1254,36)   10.5    9.0
tc_rn_src_decls       TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)      7.3    6.0
subst_ty              TyCoRep    compiler/types/TyCoRep.hs:2240:28-32                  3.5    4.4
coercionKind          Coercion   compiler/types/Coercion.hs:1711:3-7                   2.1    4.6
zonkTopDecls          TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    1.1    0.9

comment:45 in reply to:  40 ; Changed 19 months ago by tdammers

Replying to goldfire:

Sorry, still confused. :(

How are coercionKind and coercionRole mutually recursive? I see that coercionRole calls coercionKind but not the other way around.

Sorry for the confusion, I shouldn't be trying to explain things I only half understand myself when tired.

Looking at HEAD on master, we have the coercionKindRole function that is simply recursive for NthCo, and has the nth call embedded. So for nested NthCo's, its behavior should be quadratic.

The un-refactored version has coercionKind, which is simply recursive (O(n)), and we have coercionRole, which recurses via coercionKindRole. The latter is really just a matter of calling coercionKind and coercionRole individually though; the coercionRole call makes coercionRole simply recursive (O(n)), but the coercionKind call introduces another O(n), making the entire thing also quadratic.

Calling this "mutually recursive" is of course a brainfart on my side, since coercionKind never calls back into coercionRole.

So in terms of big-O, HEAD and "un-refactored" should be the same. Why one perfoms better than the other is somewhat unclear to me though.

But you're right that I'm trying to understand better why there's a performance improvement in this patch (even before any caching). In the nested NthCo case, I'm pretty sure your refactor would be worse. But in the test case at hand (which I assume doesn't have nested NthCos -- haven't looked), your change is clearly an improvement.

However, I don't think your analysis above is really the problem. I would expect that the running time of coercionKind or coercionRole on nested NthCos to be linear in the sum of the ds -- that is, we'll have to add together all the indices. You've shown above that the old recursion pattern (from coercionKindRole) traverses down the linked list twice (once in getNth and once in nthRole), but this shouldn't change asymptotic complexity. And, usually, d is quite small, and so I wouldn't expect this to show up at all, really. I still don't think we've quite gotten to the bottom of why separating out coercionKind and coercionRole should effect a performance improvement.

Indeed, it's not the problem, or rather, the un-refactoring alone doesn't fix anything - if it makes a difference at all, then it's most likely just a constant-factor improvement, nothing fundamental. But that's not why we did it anyway; the reason we did it was so that we could more easily implement the caching (storing precalculated roles in the NthCo itself), which breaks down one of the linear terms to constant.

On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum.

Yes, but we don't know if it's actually big-O-faster, or just happens have a more favorable constant factor.

Please don't let my nit-picking slow you down or discourage you. It's just that I think you've hit something quite interesting, and, well, I'm interested. :)

Absolutely not, your input has been super helpful, and I much prefer hard criticism over empty praise. Bring it on :)

comment:46 Changed 19 months ago by tdammers

I ran a selection of commits against two test inputs: Grammar.hs, as above, and the test_rules2.hs file from #5631. Here's a list of execution times for both:

--- ./cases/Grammar.hs ---
452dee3ff4: 245.51
4572849929: 19.62
4eb140f564: 12.25
8a6aa5030d: 12.26
d74b37d565: 16.14
--- ./cases/test_rules2.hs ---
452dee3ff4: 1.20
4572849929: 1.19
4eb140f564: 1.19
8a6aa5030d: 1.23
d74b37d565: 1.18

The commits are:

452dee3ff4: GHC `master` before branching off on this issue (should be the same as GHC HEAD)
4572849929: Simon's patch from [comment:5] applied.
4eb140f564: After "un-refactoring"
8a6aa5030d: With coercion roles cached, and mkNthCo taking an extra Role argument
d74b37d565: current HEAD of wip/tdammers/T11735

(Note that the execution times are a bit faster overall in this run because I didn't pass any -ddump flags, so GHC spends no time pretty-printing the dumps).

So, conclusions:

  • The 5 patch makes a huge difference for Grammar.hs (down to less than 10% in execution time)
  • "un-refactoring", strangely enough, improves performance on Grammar.hs by roughly another 25% or so
  • caching coercion rules in NthCo doesn't seem to make a difference at all, and even makes things slightly worse (this one is truly baffling IMO)
  • improvements from 41 make the Grammar.hs test case worse (this, too, is unexpected)
  • none of the changes here seem to affect performance for the test_rules2.hs case much, if at all

comment:47 Changed 19 months ago by tdammers

Currently doing a full test run, and I don't have the full result yet, but it seems that many tests are failing. Not sure if this is due to these particular code changes though, I will look into it once the run finishes.

comment:48 Changed 19 months ago by simonpj

There are a number of different things going on.

  1. comment:5 gets rid of a non-linearity in coercionKind for the ForallCo case. This is a big win. We definitely want it.
  1. The same non-linearity is present in coercionKindRole so I suggested getting rid of the duplication between the two. That seems possible by caching the role in NthCo. Doing so doesn't solve any known performance problems, but seems to be nicer code.
  1. We have a clear problem in simplCast, which comes from transforming
    (f |> co) @t1 @t2 ... @tn
    ===>
    ((f @t1') |> co1) @t2 ... @tn
    ===> (etc)
    (f @t1' @t2' ... @tn') |> con
    
    Doing these steps one at a time gives clearly non-linear behaviour. Hence comment:10. I think comment:41 is in this territory.
  1. Spending too much time in CoreTidy: to be investigated.

I think it'd be helpful to separate these issues, possibly into separate tickets.

Last edited 19 months ago by simonpj (previous) (diff)

comment:49 in reply to:  45 ; Changed 19 months ago by goldfire

Replying to tdammers:

Looking at HEAD on master, we have the coercionKindRole function that is simply recursive for NthCo, and has the nth call embedded. So for nested NthCo's, its behavior should be quadratic.

I disagree here. I think the old coercionKindRole was not quadratic in this way (ignoring the ForAllCo change). Every coercionKindRole recurrence calls getNth (twice), but that doesn't lead to quadratic behavior in the depth of NthCo nesting, which is what we're worried about. It does mean that running time will be proportional to the sum of the indices in the NthCos, but that's to be expected. Let's put this another way: pretend all the indices in the NthCos are 1, a constant. (This is fairly close to reality, anyway.) Then, we're linear, not quadratic.

The un-refactored version has coercionKind, which is simply recursive (O(n)), and we have coercionRole, which recurses via coercionKindRole. The latter is really just a matter of calling coercionKind and coercionRole individually though; the coercionRole call makes coercionRole simply recursive (O(n)), but the coercionKind call introduces another O(n), making the entire thing also quadratic.

Agreed here. But note that this would remain quadratic even if all the indices in the NthCos were 1. So, if you call the original coercionKindRole quadratic, then this would be cubic (but I don't think that's a fair characterization -- there's nothing raised to the third power here).

So in terms of big-O, HEAD and "un-refactored" should be the same. Why one perfoms better than the other is somewhat unclear to me though.

Disagree here, as explained above. The old coercionKindRole was asymptotically better. But now that we cache roles, it should be good again.

On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum.

Yes, but we don't know if it's actually big-O-faster, or just happens have a more favorable constant factor.

Fair enough.

I'm stymied as to why my patches make things worse. Maybe moving the isReflexiveCo check in addCoerce to the top was a bad idea? Try moving that back to where it was and then try again. The simplCast stuff should be vastly better than it was!

Thanks!

comment:50 Changed 19 months ago by simonpj

Let's get (1) and (2) nailed before starting in on (3) and (4). Maybe move (3) and (4) to fresh tickets.

comment:51 in reply to:  49 Changed 19 months ago by tdammers

Replying to goldfire:

It does mean that running time will be proportional to the sum of the indices in the NthCos, but that's to be expected.

Ah, you are right, this is where I went wrong - the nth calls follow the indices, not the nested NthCos themselves. Also explains why I'm not seeing the kind of differences in the profiles that I expected.

The un-refactored version has coercionKind, which is simply recursive (O(n)), and we have coercionRole, which recurses via coercionKindRole. The latter is really just a matter of calling coercionKind and coercionRole individually though; the coercionRole call makes coercionRole simply recursive (O(n)), but the coercionKind call introduces another O(n), making the entire thing also quadratic.

Agreed here. But note that this would remain quadratic even if all the indices in the NthCos were 1. So, if you call the original coercionKindRole quadratic, then this would be cubic (but I don't think that's a fair characterization -- there's nothing raised to the third power here).

So in terms of big-O, HEAD and "un-refactored" should be the same. Why one perfoms better than the other is somewhat unclear to me though.

Disagree here, as explained above. The old coercionKindRole was asymptotically better. But now that we cache roles, it should be good again.

Profiling results don't agree, but other than that it seems plausible. Maybe we're hitting some sort of edge case here?

On the other hand, the separated version really is quadratic... and yet it's faster (on this test case)! That's the conundrum.

Yes, but we don't know if it's actually big-O-faster, or just happens have a more favorable constant factor.

Fair enough.

I'm stymied as to why my patches make things worse. Maybe moving the isReflexiveCo check in addCoerce to the top was a bad idea? Try moving that back to where it was and then try again. The simplCast stuff should be vastly better than it was!

This one has me baffled as well. There's a slight possibility that the profiling runs I did were contaminated with other activity on the same machine, so I might do another run on a separate machine with nothing else going on.

comment:52 Changed 19 months ago by tdammers

Split off the simplCast part into #14737, suggest we continue the discussion there.

comment:53 Changed 19 months ago by tdammers

CoreTidy performance (part 4) moved to #14738.

comment:54 Changed 19 months ago by tdammers

Re-running my profiling job on a pristine machine without significant background noise gives me this:

--- ./cases/Grammar.hs ---
d74b37d565: 16.21
8ac966971e: 12.87
4eb140f564: 12.38
73a99750e1: 20.00
4572849929: 19.67
452dee3ff4: 256.07
--- ./cases/test_rules2.hs ---
d74b37d565: 1.12
8ac966971e: 1.10
4eb140f564: 1.10
73a99750e1: 1.10
4572849929: 1.07
452dee3ff4: 1.03

Again, with meanings:

d74b37d565: simplCore improvements
8ac966971e: With NthCo Role caching
4eb140f564: Un-refactored
73a99750e1: Added some SCC's
4572849929: Simon's patch applied
452dee3ff4: baseline, GHC `master` before branching off

Which mostly confirms the earlier conclusions. The test_rules2.hs example seems to get worse as we proceed through the changes though.

comment:55 Changed 19 months ago by simonpj

Do you have allocation numbers for these runs?

The mysterious thing is that NthCo caching makes things slightly worse. Of course the NthCo constructors have an extra field, but I'm still quite surprised that's a visible worsening. So I think it might be worth a little more digging into the worsening in 8ac966971e, really just to discover if we have accidentally left money on the table.

comment:56 Changed 19 months ago by simonpj

NB: the final one, d74b37d565: simplCore improvements is the domain of #14737

comment:57 in reply to:  55 Changed 19 months ago by tdammers

Replying to simonpj:

Do you have allocation numbers for these runs?

The mysterious thing is that NthCo caching makes things slightly worse. Of course the NthCo constructors have an extra field, but I'm still quite surprised that's a visible worsening. So I think it might be worth a little more digging into the worsening in 8ac966971e, really just to discover if we have accidentally left money on the table.

Yes, of course.

This is 8ac966971e:

        Tue Jan 30 11:25 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

        total time  =       12.87 secs   (12865 ticks @ 1000 us, 1 processor)
        total alloc = 14,385,409,080 bytes  (excludes profiling overheads)

COST CENTRE             MODULE     SRC                                                 %time %alloc

simplCast               Simplify   compiler/simplCore/Simplify.hs:871:62-87             33.7   32.4
addCoerce-pushCoTyArg   Simplify   compiler/simplCore/Simplify.hs:(1236,12)-(1237,72)   13.1   13.1
tc_rn_src_decls         TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)     12.3   14.0
subst_ty                TyCoRep    compiler/types/TyCoRep.hs:2237:28-32                  6.7   10.4
CoreTidy                HscMain    compiler/main/HscMain.hs:1253:27-67                   4.4    3.9
coercionKind            Coercion   compiler/types/Coercion.hs:1725:3-7                   2.6    6.0
zonkTopDecls            TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    2.0    2.2
load'.checkHowMuch      GhcMake    compiler/main/GhcMake.hs:(270,9)-(272,27)             2.0    0.0
simplCast-simplCoercion Simplify   compiler/simplCore/Simplify.hs:1224:57-77             1.6    1.5
simplExprF1-Lam         Simplify   compiler/simplCore/Simplify.hs:896:5-39               1.6    2.2
deSugar                 HscMain    compiler/main/HscMain.hs:511:7-44                     1.5    1.3
simplCast-addCoerce     Simplify   compiler/simplCore/Simplify.hs:1225:53-71             1.4    1.3
tcRnImports             TcRnDriver compiler/typecheck/TcRnDriver.hs:240:20-50            1.0    0.1
Parser                  HscMain    compiler/main/HscMain.hs:(316,5)-(384,20)             0.9    1.6

Which isn't actually significantly worse than 4eb140f564:

        Tue Jan 30 11:45 2018 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

        total time  =       12.38 secs   (12380 ticks @ 1000 us, 1 processor)
        total alloc = 14,385,403,880 bytes  (excludes profiling overheads)

COST CENTRE             MODULE     SRC                                                 %time %alloc

simplCast               Simplify   compiler/simplCore/Simplify.hs:871:62-87             34.3   32.4
addCoerce-pushCoTyArg   Simplify   compiler/simplCore/Simplify.hs:(1236,12)-(1237,72)   13.3   13.1
tc_rn_src_decls         TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)     12.7   14.0
subst_ty                TyCoRep    compiler/types/TyCoRep.hs:2225:28-32                  7.0   10.4
CoreTidy                HscMain    compiler/main/HscMain.hs:1253:27-67                   4.5    3.9
coercionKind            Coercion   compiler/types/Coercion.hs:1707:3-7                   2.9    6.0
zonkTopDecls            TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    2.0    2.2
simplExprF1-Lam         Simplify   compiler/simplCore/Simplify.hs:896:5-39               1.8    2.2
simplCast-simplCoercion Simplify   compiler/simplCore/Simplify.hs:1224:57-77             1.7    1.5
deSugar                 HscMain    compiler/main/HscMain.hs:511:7-44                     1.6    1.3
simplCast-addCoerce     Simplify   compiler/simplCore/Simplify.hs:1225:53-71             1.4    1.3
simplIdF                Simplify   compiler/simplCore/Simplify.hs:868:61-79              1.0    0.5
Parser                  HscMain    compiler/main/HscMain.hs:(316,5)-(384,20)             1.0    1.6

This is only about 4% more execution time, and 0.000035% more allocations.

comment:58 Changed 19 months ago by tdammers

Split off tasks 1 (the original patch from 5) and 2 (further refactoring of coercionKind/Role) into separate branches, cherry-picking commits onto current GHC HEAD.

Branches are wip/tdammers/T11735-1 and wip/tdammers/T11735-2 (the latter also contains commits from the former, because I figured that would make more sense).

I also pushed the patch for task 1 to phabricator D4355.

comment:59 Changed 19 months ago by simonpj

So the allocations are really close. Where does the extra time go?

Can you show mutator and GC times? Maybe residency, and hence GC time, is higher? Perhaps just an accident of the moments at which major GC happens. What happens if you use use just one generation?

comment:60 Changed 19 months ago by goldfire

I think I see what's going on here. The "cached role" results are from commit 8ac966971ec6c30cc3681a913fb9fb1c2342f6cc. This version eagerly computes the role for an NthCo in mkNthCo. This means that every (seqed) NthCo requires a role calculation. Perhaps sometimes, we create an NthCo and never ever check its role. If that's the case, then 8ac96 will be worse.

On the other hand, what if you try my commit 8a6aa5030d34592200fbe799bf38abf3701544db? (Do not be thrown off by the same first two characters of the hash! I was.) That commit supplies the role to mkNthCo, as it can often be deduced by the caller of mkNthCo without too much trouble. No redundant role computation. This one should really be a clear win.

By the way, I changed slightly the way I cached the role: I allowed a more permissive role in the NthCo. For example, suppose you have g :: [a] ~N [b] and you want h :: a ~R b. Before these patches, you would use h = SubCo (NthCo 0 g). With the patch, you can now say h = NthCo Representational 0 g, where Lint checks to make sure that the role is appropriate. Because of this new degree of freedom, it became necessary to serialize the role in iface files. (The serializer could instead be clever and convert an NthCo into a SubCo and an NthCo, but that seemed not to be worth it.) I forget if this realization about iface files made it into the original patch -- that change might be in a later commit. But the commit I referenced above should work fine outside of iface files.

comment:61 Changed 19 months ago by simonpj

It's nice that OptCoercion for NthCo gets simpler when we have the role cached.

I allowed a more permissive role in the NthCo

OK, but

  • I didn't see a change to mkSubCo that would take advantage of this (by changing the role rather than wrapping in SubCo)
  • The comment on the constructor NthCo should state the invariant. Given (NthCo r n co) we require that lteRole r (nthRole (coercionRole co) n)) or something like that.
  • Why does't the same flexibility apply for the other coercions that cache roles: FunCo, Refl, TyConAppCo, UnivCo? We should state their invariants too...

comment:62 Changed 19 months ago by goldfire

All good points. Reconsidering this, I think it's probably best for me to undo this choice -- it adds a subtle complication to the theory for a very tiny upside. I probably won't get to do this until Friday, but I'll make this tweak.

comment:63 Changed 19 months ago by goldfire

Note that this last point should have almost no effect at all on performance, and so any data collected now would still be valid.

comment:64 Changed 19 months ago by simonpj

Fine. In undoing, could you also add the invariants for all Role annotations to the data type itself?

  • The comment on the constructor NthCo should state the invariant. Given (NthCo r n co) we require that lteRole r (nthRole (coercionRole co) n)) or something like that.

comment:65 Changed 19 months ago by Ben Gamari <ben@…>

In e4ab65bd/ghc:

Optimize coercionKind (Trac #11735)

Reviewers: simonpj, goldfire, bgamari

Reviewed By: simonpj

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #11735

Differential Revision: https://phabricator.haskell.org/D4355

comment:66 in reply to:  60 Changed 19 months ago by tdammers

Replying to goldfire:

I think I see what's going on here. The "cached role" results are from commit 8ac966971ec6c30cc3681a913fb9fb1c2342f6cc. This version eagerly computes the role for an NthCo in mkNthCo. This means that every (seqed) NthCo requires a role calculation. Perhaps sometimes, we create an NthCo and never ever check its role. If that's the case, then 8ac96 will be worse.

On the other hand, what if you try my commit 8a6aa5030d34592200fbe799bf38abf3701544db? (Do not be thrown off by the same first two characters of the hash! I was.) That commit supplies the role to mkNthCo, as it can often be deduced by the caller of mkNthCo without too much trouble. No redundant role computation. This one should really be a clear win.

I've done another run using just these three commits, and it shows that the last one is slightly better than the original situation for the test_rules2 case, and about the same for the Grammar.hs case:

--- ./cases/Grammar.hs ---
4eb140f564: 12.31
8ac966971e: 12.70
8a6aa5030d: 12.31
--- ./cases/test_rules2.hs ---
4eb140f564: 1.11
8ac966971e: 1.14
8a6aa5030d: 1.09

So the 8ac... commit makes things slightly worse, but at 8a6... we're back at equal-or-better.

comment:67 Changed 19 months ago by Simon Peyton Jones <simonpj@…>

In db5a4b83/ghc:

Re-center improved perf for T3064

There's a 6% reduction in allocation on T3064.  I think it's due to

  commit e4ab65bd57b2c39f4af52879654514bb6d5b42a0
  Author: Tobias Dammers <tdammers@gmail.com>
  Date:   Wed Jan 31 21:39:45 2018 -0500

      Optimize coercionKind (Trac #11735)

I'm not certain -- but, hey, it's good news

comment:68 in reply to:  67 Changed 19 months ago by tdammers

Replying to Simon Peyton Jones <simonpj@…>:

In db5a4b83/ghc:

Re-center improved perf for T3064

There's a 6% reduction in allocation on T3064.  I think it's due to

  commit e4ab65bd57b2c39f4af52879654514bb6d5b42a0
  Author: Tobias Dammers <tdammers@gmail.com>
  Date:   Wed Jan 31 21:39:45 2018 -0500

      Optimize coercionKind (Trac #11735)

I'm not certain -- but, hey, it's good news

We already knew that this change gives us insane improvements in at least one case, so I guess the fact that we see this reflected somewhere in the test suite is good, I presume. Wonder what T3064 does and whether it's related.

comment:69 in reply to:  59 Changed 19 months ago by tdammers

Replying to simonpj:

So the allocations are really close. Where does the extra time go?

Can you show mutator and GC times? Maybe residency, and hence GC time, is higher? Perhaps just an accident of the moments at which major GC happens. What happens if you use use just one generation?

I've done a run with -s, which gives us:

--- ./cases/Grammar.hs ---
4eb140f564: 12.31 sec, max res. 257,842,544
8ac966971e: 12.70 sec, max res. 425,496,520
8a6aa5030d: 12.31 sec, max res. 254,680,256
--- ./cases/test_rules2.hs ---
4eb140f564: 1.11 sec, max res. 84,030,624
8ac966971e: 1.14 sec, max res. 81,325,072
8a6aa5030d: 1.09 sec, max res. 56,322,200

So it looks like your hunch was about right: in the Grammar.hs case, max residency almost doubles in 8ac..., but 8a6 then gets us back to normal; for the test_rules2.hs case, we see only improvement.

comment:70 Changed 19 months ago by tdammers

Status:

  • Thing 1: Getting rid of non-linearity in coercionKind for the ForallCo case. Done.
  • Thing 2: Attacking non-linearity in coercionKindRole on NthCo's. I think this part is also more or less done, but I haven't seen anything on phab yet.
  • Thing 3: Fixing performance issues with simplCast / SimplCore. Moved to #14737, I believe @goldfire is still working on this.
  • Thing 4: Investigating CoreTidy. Moved to #14738, however it seems that the problem no longer manifests on GHC HEAD.

comment:71 Changed 19 months ago by goldfire

I'm not still working on this. I pushed the changes I had in mind to the tdammers branch. If the performance is OK there, could you please Phab and then commit? As for simplCast, the changes I had in mind are all on your branch, including removing the inefficient one-variable substitution. If my changes aren't an improvement, try undoing the change I made to the ordering of cases in simplCast: I moved the isReflexiveCo case to the top, but perhaps that check is too expensive and is better at the bottom. Regardless, the changes I made in CoreOpt should be enough to improve performance.

Thanks!

comment:72 Changed 19 months ago by tdammers

https://phabricator.haskell.org/D4385 should have all the relevant changes for subtasks 2 and 3, and the SCC's I added to (unsuccessfully) hunt down subtask 4.

comment:73 Changed 19 months ago by simonpj

Good progress

  • Let's split the commits for Thing 2 and Thing 3; they are quite different.
  • We agreed in comment:62 not to do the subtle SubCo stuff for NthCo; that means some changes (simplifications).
  • I have not properly reviewed the simplCast changes, but they don't appear to follow the plan I suggest in comment:10. Richard can you comment on that?

comment:74 Changed 19 months ago by goldfire

Yes, I know changing NthCo caching is outstanding.

For the simplCast changes: you're right -- I ignored your suggestions opting for something simpler. In HEAD, pushTyCoArg (along with friends) goes to some trouble to create a type-correct Refl coercion. But this coercion soon ends up being used in a cast, where Refls are simply discarded. So all that work in building the coercion is wasted. My change is to return Maybe Coercion, preferring Nothing over Just (Refl ...). This avoids the inefficient one-variable substitution that we were worrying about. It should be a clear win; we're doing strictly less work. The small complication is that there are these Maybes floating about, but I think it's worth it. I'm hoping the comments on the push functions are clear enough, but perhaps I didn't include any information on why Maybe Coercion is better than Coercion here.

comment:75 Changed 19 months ago by simonpj

OK: once we have a separate Phab for the simplCast change, I'll review it. Ping me, Tobias!

comment:76 Changed 18 months ago by goldfire

OK. I've pushed my refactoring of NthCo to the branch we've been working on. But I'm unsure the current state of affairs for the Phab stuff, so I didn't upload there.

comment:77 in reply to:  75 Changed 18 months ago by tdammers

Replying to simonpj:

OK: once we have a separate Phab for the simplCast change, I'll review it. Ping me, Tobias!

I split up D4385 into https://phabricator.haskell.org/D4394 (the NthCo stuff) and https://phabricator.haskell.org/D4395 (simplCast / Simplify). May I assume that the refactoring mentioned in comment:76 should go in D4394?

comment:78 Changed 18 months ago by goldfire

Yes, please. Many thanks!

comment:79 Changed 18 months ago by tdammers

Alright, all the changes from the wip/11735 branch should be on either D4394 or D4395 now.

comment:80 Changed 17 months ago by tdammers

Effect of D4394 on compilation of Grammar.hs:

Compiling with GHC HEAD (02b3dad195), built with the quick profile:

tobias@zoidberg:~/well-typed/devel/ghc-T14683/ > time ../ghc/HEAD/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2
[1 of 1] Compiling Grammar          ( grammar-hs/Grammar.hs, grammar-hs/Grammar.o )
../ghc/HEAD/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2  465.26s user 0.52s system 100% cpu 7:45.42 total

Compiling with D4394 applied (6493976fdb), built with the quick profile:

tobias@zoidberg:~/well-typed/devel/ghc-T14683/ > time ../ghc/D4394/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2                     
[1 of 1] Compiling Grammar          ( grammar-hs/Grammar.hs, grammar-hs/Grammar.o )
../ghc/D4394/inplace/bin/ghc-stage2 grammar-hs/Grammar.hs -fforce-recomp -O2  25.95s user 0.36s system 94% cpu 27.977 total

comment:81 Changed 17 months ago by tdammers

Differential Rev(s): D4394 D4395

comment:82 Changed 17 months ago by tdammers

Profile after applying D4395 (coercionKind / Role refactoring):

	Mon Mar 26 15:24 2018 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -RTS -B/home/tobias/well-typed/devel/ghc/D4395-modified/inplace/lib ./cases/Grammar.hs -o ./a -fforce-recomp

	total time  =       19.66 secs   (19655 ticks @ 1000 us, 1 processor)
	total alloc = 24,638,084,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE     SRC                                                 %time %alloc

simplCast       Simplify   compiler/simplCore/Simplify.hs:(1213,5)-(1215,37)    73.8   76.0
tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(494,4)-(556,7)      8.3    8.4
CoreTidy        HscMain    compiler/main/HscMain.hs:1253:27-67                   2.8    2.2
SimplTopBinds   SimplCore  compiler/simplCore/SimplCore.hs:770:39-74             2.7    1.9
coercionKind    Coercion   compiler/types/Coercion.hs:1698:3-7                   1.8    3.4
zonkTopDecls    TcRnDriver compiler/typecheck/TcRnDriver.hs:(445,16)-(446,43)    1.3    1.3
deSugar         HscMain    compiler/main/HscMain.hs:511:7-44                     1.1    0.8

This means that performance is on par, but simplCast still scores very high, so we should do more digging there.

comment:83 Changed 17 months ago by tdammers

Further digging into simplCast performance to be handled in #14737.

comment:84 Changed 17 months ago by simonpj

Yes!! simplCast should not take 73% of compiler time!!

comment:85 Changed 17 months ago by tdammers

comment:86 Changed 17 months ago by Ben Gamari <ben@…>

In 7bb7f99/ghc:

Discard reflexive casts during Simplify

Trac #14735 (derived from Trac #11735) found that 75% of compile
time was being spent in simplCast. This patch is the first in a series
to deal with that problem.

This particular patch actually has very little effect on performance; it
just refactors simplCast so that it builds Refl coercions less often.
Refl coercions require us to compute the type to put inside them, and
even if that's done lazily it is still work and code. Instead we use
Maybe Coercion with Nothing for Refl. This change also percolates to
pushCoTyArg and pushValArg.

Reviewers: goldfire, bgamari, simonpj

Reviewed By: simonpj

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #14737

Differential Revision: https://phabricator.haskell.org/D4395

comment:87 Changed 16 months ago by Ben Gamari <ben@…>

In 2fbe0b5/ghc:

Caching coercion roles in NthCo and coercionKindsRole refactoring

While addressing nonlinear behavior related to coercion roles,
particularly `NthCo`, we noticed that coercion roles are recalculated
often even though they should be readily at hand already in most cases.
This patch adds a `Role` to the `NthCo` constructor so that we can cache
them rather than having to recalculate them on the fly.
https://ghc.haskell.org/trac/ghc/ticket/11735#comment:23 explains the
approach.

Performance improvement over GHC HEAD, when compiling Grammar.hs (see below):

GHC 8.2.1:
```
ghc Grammar.hs  176.27s user 0.23s system 99% cpu 2:56.81 total
```

before patch (but with other optimizations applied):
```
ghc Grammar.hs -fforce-recomp  175.77s user 0.19s system 100% cpu 2:55.78 total
```

after:
```
../../ghc/inplace/bin/ghc-stage2 Grammar.hs  10.32s user 0.17s system 98% cpu 10.678 total
```

Introduces the following regressions:

- perf/compiler/parsing001 (possibly false positive)
- perf/compiler/T9872
- perf/compiler/haddock.base

Reviewers: goldfire, bgamari, simonpj

Reviewed By: simonpj

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #11735

Differential Revision: https://phabricator.haskell.org/D4394
Note: See TracTickets for help on using tickets.