Opened 13 years ago

Last modified 16 months ago

#855 new task

Improvements to SpecConstr

Reported by: simonpj Owned by:
Priority: normal Milestone:
Component: Compiler Version: 6.4.2
Keywords: SpecConstr Cc: hackage.haskell.org@…, ekmett@…, mpickering, sgraf, maoe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: N/A
Blocked By: Blocking: #915
Related Tickets: Differential Rev(s):
Wiki Page:

Description

There are a series of possible improvemnts to SpecConstr, described in the source code. compiler/specialise/SpecConstr

  • Specialising for constant parameters
  • Specialising for lambda parameters
  • Two ideas to do with strictness that look more tricky

Some of them look quite straightforward.

Change History (21)

comment:1 Changed 13 years ago by igloo

Milestone: 6.8
Test Case: N/A

comment:2 Changed 12 years ago by simonpj

Milestone: 6.8 branch_|_

I'll look at this again when someone finds that SpecConstr really isn't doing the Right Thing for them. My guess is that this will happen either in array fusion for nested data parallelism, or in the use of stream fusion for lists (Coutts et al ICFP'07).

Meanwhile, here is the last message I had from Roman on the subject, which summarises what he did to make SpecConstr work reasonably well for their ICFP paper.

Specialising on constants

Consider the following program (admittedly contrived, but we get this kind of code with stream fusion):

lvl = Just True

foo :: Maybe Bool -> Int -> Int
foo (Just True) i = i
foo _           i = foo lvl i

SpecConstr doesn't optimise this even though it is supposed to specialise on global variables. If I understand correctly, the reason for this is the following test in argToPat:

   -- Check if the argument is a variable that
   -- is in scope at the function definition site
   -- It's worth specialising on this if
   --   (a) it's used in an interesting way in the body
   --   (b) we know what its value is
argToPat in_scope con_env (Var v) arg_occ
   | not (isLocalId v) || v `elemInScopeSet` in_scope,
     case arg_occ of { UnkOcc -> False; other -> True },        -- (a)
     isValueUnfolding (idUnfolding v)                   -- (b)
   = return (True, Var v)

The problem with this is that v might have been cloned (via extendBndr*) and in that case, it won't have an unfolding. That is precisely what happens in the above example, I believe.

I solved this by explicitly recording which binders have a known value but you might have a better idea.

The threshold

I don't like the spec-threshold idea at all. In fact, it was the first thing I had to disable for the ICFP paper since it just doesn't make sense for the kind of code stream fusion generates. It can produce some very large loops which *must* be specialised; indeed, SpecConstr will usually decrease the program size.

Specialising on lambdas

Specialising on lambdas is absolutely crucial for stream fusion (for nested lists). I completely agree that the direct approach was too fragile. What I did was a rather bad hack but it worked.

  • I extended FloatOut (or, rather, SetLevels) such that it would float out lambdas in recursive arguments, i.e., given
           foo x = ... foo (C (\x. e)) ...
    
    it would produce
           f a1 ... an x = e
           foo x = ... foo (C (f a1 ... an)) ...
    
    where a1,...,an are the free variables of e.
  • This floating would happen before every SpecConstr pass (see below).
  • I extended SpecConstr (very hackishly) to specialise on partial applications of global variables. For the above, it would produce
           foo' a1 ... an = ...
    
    and the rewrite rule
           forall a1 ... an. foo (C (f a1 ... an)) = foo' a1 ... an
    
    Now, f is used directly in foo'. We have a bad staging problem here, however: we definitely want to inline f in foo' (i.e., it should get an INLINE pragma) but not in foo (at least, not before the rule has fired). I solved this by a terrible hack involving core notes but clearly, something better is needed here. Note that we might generate several specialisations which use the same f and we want to inline the latter in every specialisation.
  • Finally, inlining f in foo' and simplifying might expose further opportunities for SpecConstr. For the ICFP paper, I implemented FloatOut/SpecConstr/Simplify loop. Stream fusion with nested lists requires as many iterations as there are nesting levels (each iteration would peel away one nesting level). Note that the current specLoop doesn't help here because we really have to do full simplification (including inlining).

That was basically all I did. We didn't really use the static argument transformation because specialising on global variables essentially has the same effect.

comment:3 Changed 11 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:4 Changed 11 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:5 Changed 7 years ago by morabbin

Type of failure: None/Unknown

Do we have a place for the kind of knowledge buried in simonpj's comment above? Or a ticket type? (My apologies if I could have found this info elsewhere.)

comment:6 Changed 7 years ago by liyang

Cc: hackage.haskell.org@… added

comment:7 Changed 6 years ago by ekmett

Cc: ekmett@… added

comment:8 Changed 4 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:9 Changed 3 years ago by mpickering

Cc: mpickering added
Keywords: SpecConstr added

comment:10 Changed 22 months ago by sgraf

Cc: sgraf added

comment:11 Changed 22 months ago by sgraf

Blocking: 915 added

comment:12 Changed 22 months ago by sgraf

Blocking: 915 removed

Re: Specialising on lambdas.

To have a concrete example to work on, I here is stream-fusion.hs, which contains a minimal stream fusion harness using just singletonS, enumFromToS, sumS, mapS and concatMapS. There are three example functions ex{1,2,3} with increasing nesting of concatMapS that should all reduce to the same function goal. More examples can follow when we get these running.

Currently, GHC HEAD (8.5, 8529fbba309cd692bbbb0386321515d05a6ed256) produces this infamous piece of Core (-ddump-spec) for ex1. The goal is to specialise go for the call-pattern including a lambda here.

Before I found this ticket, I basically re-implemented part of what simonpj did, results in this commit. I basically tracked CallOccurrences similarly to ScrutOccs.

With this specialisation for lambdas, things currently look like this. The function has been specialised alright, but the free variables of said lambda, which includes the constant Step, are passed to the specialisation as arguments.

We want to specialise for this new Call pattern! However, without -fspec-constr-keen, SpecConstr will only fire when it finds a matching ScrutOcc. Looking at the output of -ddump-spec the corresponding ScrutOcc will only become visible in the specialised RHS, but we currently only specialise the original RHS. Even then, I imagine that in the general case we need a simplifier run in-between to reliably detect that we scrutinize the new parameter. But that's not actually a problem, because we can use GHC.Types.SPEC to tell the compiler to specialise without seeing an ArgOcc.

Still, GHC needs to see a Call pattern with that constant argument, which will emerge here, but only after the corresponding RULEs fired.

Here are some ideas:

  1. We can query the exprFreeVarsList here and see which free variables of the lambda have known constant value and include these in the specialisation. This was somehow fruitless so far for this case, as the free var for the Yield ... wasn't included in the ValueEnv. Not sure if this is even enough to reach the fixed point in all cases.
  2. Specialise specialisations (and fix anything non-terminating) (+ mini simplifier passes) + mini RULE engine. I'm uncomfortable with that much code duplication.
  3. Include call-pattern specialisation in the Simplifier, for a radical change. On an abstract level, this seems reasonable: Specialisation is a more modest form of inlining with code size in mind, one that even works for recursive functions. It could chew on stuff the inliner gave up on because of code size requirements (even non-recursive functions).

comment:13 Changed 22 months ago by sgraf

Blocking: 915 added

Sorry, didn't mean to remove that Blocking thing.

comment:14 Changed 22 months ago by simonpj

Include call-pattern specialisation in the Simplifier, for a radical change

I'm not keen on this

  • The Simplifier and SpecConstr are both very complex passes. Combining them would be ... difficult.
  • SpecConstr relies on a pretty elaborate form of occurrence analysis, which I'm not sure we'd want to replicated in the simplifier.

Let's be really sure that nothing else will work first.

What about just running SpecConstr twice (with simplification in between). It's a bit brutal (and why not three times?) but it might help.

Specialising on lambdas is pretty ambitious.

Last edited 22 months ago by simonpj (previous) (diff)

comment:15 Changed 22 months ago by simonpj

Another way of thinking about: thing in terms of defunctionalisation.

Consider this higher order function

let f :: (Int -> Bool) -> Int -> Char
    f g x = ....(g e)...
in
   ...(f (\x.y+x)).... (f (\v.v*p*q))....(f h)...

Now defunctionalise by making a version of f that takes a data structure as its argument:

data G_Fun = G1 Int         -- \x.y+x
           | G2 Int Int     -- \v.v*p*q
           | G3 (Int->Int)  -- Escape hatch

applyG :: G_Fun -> Int -> Bool
applyG (G1 x)   = \x.y+x
applyG (G2 p q) = \v.v*p*q in ...(g e)...
applyG (G3 g)   = g

let f' :: G_Fun -> Int -> Char
    f' ga x = ...(applyG ga e)...
in
 ...(f' (G1 x))...(f' (G2 p q))...(f' (G3 h))

(I guess you could do this via a w/w kind of transformation, but for now it's purely hypothetical.)

Now we are back in the land of data-constructors, where SpecConstr thrives. Suppose the call is actually

  ...(f' (G1 (Yield e1 e2 e3)))...

Should we specialise on (G1 x) or on the deeper pattern (G1 (Yield a b c))? It depends how much f' scrutinises its argument. And you can see that from what applyG does.

I think you could follow all this reasoning without actually createing G_Fun etc.

comment:16 Changed 22 months ago by sgraf

Thanks for your insights! I have to read up on defunctionalisation, but what you suggest is basically what I had in mind in 1. above: Query the free variables of the lambda and see if they are in the ValueEnv (meaning their RHSs are in WHNF, too). For some reason they weren't present where I expected them, looking into this now.

Should we specialise on (G1 x) or on the deeper pattern (G1 (Yield a b c))? It depends how much f' scrutinises its argument. And you can see that from what applyG does.

Well, I'm not sure we can see that without running some kind of simplification first. But with SPEC, the matching ArgOcc is irrelevant, so we can look into this later. I conveniently left CallOcc with an ArgOcc field for when the result of the call is scrutinised for that.

I'll go and see if I can get the free vars of the lambda into the ValueEnv.

comment:17 Changed 22 months ago by sgraf

This is what I have so far: diff

The reason why I couldn't find the lambda's free vars in the ValueEnv earlier is that the matching argToPat case didn't add them before. That's a problem for our specific case, where the free vars of the lambda are bound within the call-pattern, e.g. Just (let x = 4 in \y. x + y).

With that fix in place, the free var values are detected just fine and guided by the ConVal case I could implement the specialisation code. From what I can tell, the generated specialisation looks great, but it isn't called anywhere because now the rules no longer fire.

My particular solution would just introduce let bindings into the lambda body that re-bind things we find to be values (similar to how arguments of a data con are handled). Example:

let {
  lvl_s4md = *# sc_s4rb sc_s4rb
  lvl_s4lL = I# lvl_s4md
  lvl_s4ll = Yield @ Bool @ Int False lvl_s4lL } in
MkStream
  @ Int
  @ Bool
  (\ (ds_d2Fz :: Bool) ->
    case ds_d2Fz of {
      False -> Stop @ Bool @ Int;
      True -> lvl_s4ll
    })
  True

==>

let {
  lvl_s4md = *# sc_s4rb sc_s4rb
  lvl_s4lL = I# lvl_s4md
  lvl_s4ll = Yield @ Bool @ Int False lvl_s4lL } in
MkStream
  @ Int
  @ Bool
  (\ (ds_d2Fz :: Bool) ->
    let {
      lvl_s4ll = Yield @ Bool @ Int False (I# sc_s4md) } in
    case ds_d2Fz of {
      False -> Stop @ Bool @ Int;
      True -> lvl_s4ll
    })
  True

Kind-of a manual float-in of all bindings we know are values and have corresponding ScrutOcc (which we ignore at the moment).

This results in a much better specialisation for Just (MkStream <lam> True) I believe (dump-simpl):

$s$wgo_s4rg :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
$s$wgo_s4rg (sc_s4rf :: GHC.Prim.Int#)
            (sc1_s4re :: GHC.Prim.Int#)
            (sc2_s4rd :: GHC.Prim.Int#)
  = jump $wgo_s4ph
      GHC.Types.SPEC
      (GHC.Prim.+# sc2_s4rd sc_s4rf)
      sc1_s4re
      (GHC.Base.Just
        @ (Stream Int)
        (Main.MkStream
            @ Int
            @ Bool
            (\ (ds_d2Fz :: Bool) ->
              case ds_d2Fz of {
                False -> Main.Stop @ Bool @ Int;
                True ->
                  Main.Yield @ Bool @ Int GHC.Types.False (GHC.Types.I# sc_s4rf)
              })
            GHC.Types.False));

There's a problem though: That function doesn't get called any more, because the corresponding RULE looks like this:

"SC:$wgo2" [0]
  forall (sc_s4rj :: GHC.Prim.Int#)
          (sc_s4rk :: Bool)
          (sc_s4ri :: GHC.Prim.Int#)
          (sc_s4rh :: GHC.Prim.Int#).
    $wgo_s4ph GHC.Types.SPEC
              sc_s4rh
              sc_s4ri
              (GHC.Base.Just
                  @ (Stream Int)
                  (Main.MkStream
                    @ Int
                    @ Bool
                    (\ (ds_d2Fz :: Bool) ->
                        let {
                          sc_s4rf :: GHC.Prim.Int#
                          [LclId]
                          sc_s4rf = sc_s4rj } in
                        let {
                          lvl_s4ll :: Step Bool Int
                          [LclId]
                          lvl_s4ll
                            = Main.Yield
                                @ Bool
                                @ Int
                                GHC.Types.False
                                (GHC.Types.I# sc_s4rf) } in
                        case ds_d2Fz of {
                          False -> Main.Stop @ Bool @ Int;
                          True -> lvl_s4ll
                        })
                    sc_s4rk))
    = jump $s$wgo_s4rr sc_s4rj sc_s4rk sc_s4ri sc_s4rh

E.g. the extra bindings I introduced obstruct the RULE engine. I can probably figure out how to generate the right rules tomorrow.

Please yell if I'm missing some critical pass or function to do this!

comment:18 Changed 22 months ago by simonpj

Calling the very simple optimiser (in CoreOpt) on the LHS might help.

comment:19 Changed 22 months ago by sgraf

Long story short, I had success with the following diff. Currently, it optimizes ex1 to this code:

-- RHS size: {terms: 39, types: 21, coercions: 0, joins: 3/3}
Main.$wex1 [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>, Unf=OtherCon []]
Main.$wex1
  = \ (ww_s4pl :: GHC.Prim.Int#) ->
      joinrec {
        $s$wgo_s4rk
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(3)], Arity=3, Str=<L,A><S,U><S,U>]
        $s$wgo_s4rk _ [Occ=Dead]
                    (sc1_s4ri :: GHC.Prim.Int#)
                    (sc2_s4rh :: GHC.Prim.Int#)
          = jump $s$wgo2_s4rc sc1_s4ri sc2_s4rh;
        $s$wgo1_s4rg
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(3)], Arity=3, Str=<S,U><S,U><S,U>]
        $s$wgo1_s4rg (sc_s4rf :: GHC.Prim.Int#)
                     (sc1_s4re :: GHC.Prim.Int#)
                     (sc2_s4rd :: GHC.Prim.Int#)
          = jump $s$wgo_s4rk sc_s4rf sc1_s4re (GHC.Prim.+# sc2_s4rd sc_s4rf);
        $s$wgo2_s4rc [Occ=LoopBreaker]
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(2)], Arity=2, Str=<S,U><S,U>, Unf=OtherCon []]
        $s$wgo2_s4rc (sc_s4rb :: GHC.Prim.Int#) (sc1_s4ra :: GHC.Prim.Int#)
          = case GHC.Prim.<# sc_s4rb ww_s4pl of {
              __DEFAULT -> sc1_s4ra;
              1# ->
                jump $s$wgo1_s4rg
                  (GHC.Prim.*# sc_s4rb sc_s4rb) (GHC.Prim.+# sc_s4rb 1#) sc1_s4ra
            }; } in
      jump $s$wgo2_s4rc 1# 0#

Which is just a stone throw (or rather an inlining pass) away from the goal!

For the examples ex{2,3} with increasing nesting level, things don't look so bright yet. Also I got some cleanup work to do tomorrow.

comment:20 Changed 21 months ago by sgraf

A quick update: ex1 is optimized to similar code as above, but without resorting to forced SPEC now. E.g., inference of ArgOccs is now much better, because it looks at occs from specialised RHSs now (specialising for lambdas gives rise to new occs). This entailed a rewrite of the spec loop. Also I had to pass on occs from recursive calls to achieve something like the static argument transformation.

I'll write things up in a wiki page once I'm done trying to optimize ex{2,3}. This is the code currently generated for ex1:

-- RHS size: {terms: 41, types: 23, coercions: 0, joins: 3/3}
Main.$wex1 [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>, Unf=OtherCon []]
Main.$wex1
  = \ (ww_s4m8 :: GHC.Prim.Int#) ->
      joinrec {
        $s$wgo_s4or
          :: GHC.Prim.Int#
             -> GHC.Prim.Int# -> GHC.Prim.Int# -> Bool -> GHC.Prim.Int#
        [LclId[JoinId(4)], Arity=4, Str=<S,U><S,U><L,A><L,A>]
        $s$wgo_s4or (sc_s4oq :: GHC.Prim.Int#)
                    (sc1_s4op :: GHC.Prim.Int#)
                    _ [Occ=Dead]
                    _ [Occ=Dead]
          = jump $s$wgo2_s4nk sc1_s4op sc_s4oq;
        $s$wgo1_s4o0
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(3)], Arity=3, Str=<S,U><S,U><S,U>]
        $s$wgo1_s4o0 (sc_s4nZ :: GHC.Prim.Int#)
                     (sc1_s4nY :: GHC.Prim.Int#)
                     (sc2_s4nX :: GHC.Prim.Int#)
          = jump $s$wgo_s4or
              (GHC.Prim.+# sc2_s4nX sc_s4nZ) sc1_s4nY sc_s4nZ GHC.Types.False;
        $s$wgo2_s4nk [Occ=LoopBreaker]
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(2)], Arity=2, Str=<S,U><S,U>, Unf=OtherCon []]
        $s$wgo2_s4nk (sc_s4nj :: GHC.Prim.Int#) (sc1_s4ni :: GHC.Prim.Int#)
          = case GHC.Prim.<# sc_s4nj ww_s4m8 of {
              __DEFAULT -> sc1_s4ni;
              1# ->
                jump $s$wgo1_s4o0
                  (GHC.Prim.*# sc_s4nj sc_s4nj) (GHC.Prim.+# sc_s4nj 1#) sc1_s4ni
            }; } in
      jump $s$wgo2_s4nk 1# 0#

-flate-dmd-anal gets rid of the superfluous $s$wgo_s4or, but what pass is responsible for contracting the recursive group into a single binding (by inlining until we hit the loopbreaker)?

Version 0, edited 21 months ago by sgraf (next)

comment:21 Changed 16 months ago by maoe

Cc: maoe added
Note: See TracTickets for help on using tickets.