Opened 21 months ago

Last modified 10 months ago

#14461 new task

Reuse free variable lists through nested closures

Reported by: tdammers Owned by: alexbiehl
Priority: normal Milestone:
Component: Compiler Version: 8.2.1
Keywords: CodeGen Cc: simonmar, dfeuer, andreask, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: 7258 Differential Rev(s):
Wiki Page: NestedClosures

Description

Consider NestedClosures and 7258; essentially, deeply nested closures exhibit quadratic compiler performance due to the fact that when allocating registers, each nesting level will have the compiler unpack the entire parent closure and then re-pack the variables into the child closure.

To solve this, check if the parent closure can be carried along wholesale, and pull variables from there so that the repackaging can be bypassed.

Change History (20)

comment:2 Changed 21 months ago by tdammers

Owner: set to alexbiehl

comment:3 Changed 21 months ago by alexbiehl

Transforming the free variables in the codegen (as done in the mentioned patch) remains hairy.

I have another plan: I will introduce a new data GenStgFreeVar occ = StgFreeVar occ [occ] type and calculate the sharing in CoreToStg as we have all the relevant data available and in good shape anyway.

comment:4 Changed 21 months ago by simonpj

I have another plan: I will introduce a new data GenStgFreeVar occ = StgFreeVar occ [occ] type and calculate the sharing in CoreToStg as we have all the relevant data available and in good shape anyway.

I think that's a great plan: see NestedClosures under "Implementation".

If you could describe the moving parts of your plan in the wiki page, it'd help us to offer you support and feedback.

comment:5 in reply to:  4 Changed 21 months ago by alexbiehl

Replying to simonpj:

I think that's a great plan: see NestedClosures under "Implementation".

Ups, I didn't see that paragraph! I will write down my plan there.

I have another thing, somewhat related: If we find something like

  case e of x {
    T a b c -> 
      let [a b c] f = \[] -> ... 
  }

Why not make it

  case e of x { -- evaluate e to not change strictness
    T _ _ _ -> 
      let [x] f = \[] -> 
        case x of 
          T a b c -> ...
  }

This could even be done at Core level I suppose.

Last edited 21 months ago by alexbiehl (previous) (diff)

comment:6 Changed 21 months ago by simonpj

Yes you could do that in Core, but the simplifier would immediately undo it by eliminating the inner case.

However, in our proposed STG-level transformation, I think we could indeed transform

  case e of x {
    T a b c -> 
      let [a b c] f = \[] -> ... 
  }

to

  case e of x {
    T _ _ _ -> 
      let [x{a,b,c}] f = \[] -> ... 
  }

using the notation on the wiki page.

I'm beginning to think that it might be easier to do this free-var-finding stuff as a STG-to-STG pass with a single purpose, rather than as part of Core-to-STG.

comment:7 in reply to:  6 Changed 21 months ago by alexbiehl

Replying to simonpj:

I'm beginning to think that it might be easier to do this free-var-finding stuff as a STG-to-STG pass with a single purpose, rather than as part of Core-to-STG.

Seems reasonable. I will take that into consideration.

Last edited 21 months ago by alexbiehl (previous) (diff)

comment:8 Changed 21 months ago by alexbiehl

I have hit a problem. Using ticket #7258 as a base line I figured that most of the closures we generate during Read/Show deriving are updatable thunks:

let [fv1 fv2 ... fvn] outer = \u []  -- 'u' for updatable
  let [fv1 fv2 ... fvn] inner = \u []
     ... 

Applying the NestedClosure idea gets us:

let [fv1 fv2 ... fvn] outer = \u []
  let [outer{fv1 fv2 ... fvn}] inner = \u []
     ... 

Now, when forcing outer we push an update frame which overwrites outers closure to be an indirection to the resulting value hereby turning the free variables effectively into garbage. We now have no safe way to access outers free variables from inner.

I will try to come up with a new scheme to solve this more generally for all kind of closures.

Last edited 21 months ago by alexbiehl (previous) (diff)

comment:9 Changed 21 months ago by alexbiehl

Ok, compiling with optimizations somewhat reduces the amount of updatable thunks. Possibly because of demand analysis.

We must be careful to not share updatable thunks. I have to measure the impact but I suspect this will greatly reduce the effect of this transformation and we should think about a different scheme.

comment:10 Changed 21 months ago by simonpj

Now, when forcing outer we push an update frame which overwrites outers closure to be an indirection to the resulting value hereby turning the free variables effectively into garbage.

Ah! That's an excellent point, one that I had totally neglected.

Possibilities:

  • Step 1. Don't do this for thunks. We have no data about what the troublesome cases are, but the one case we do know about has functions.
  • Step 2. Still don't do the this for thunks, but
    let k1 = [fv1, ..., fvn] \ []
             let k2 = [fv1, ..., fvn] \ [] -> ...
             in ...
    in ...
    
    ===>
    
    let t = (fv1, ..., fvn) in
    let k1 = [t{fv1, ..., fvn}] \ [] -> 
             let k2 = [t{fv1, ..., fvn}]] \ [] -> ...
             in ...
    in ...
    
    NB: here I am allowing the t{fv1,..,fvn} as a way to access the free vars of a data constructor, as well as a function closure. I don't think this should be hard to arrange. And it might be useful anyway.

Now the shared bit is a separate tuple, which won't get updated.

One thing that would help would be some data on how common this kind of thing is. Even deciding what to measure is non-obvious. For example:

  • For each STG let-binding
         let x = [fv1..fvn] \ argsx -> ex
    
    How often is it the case that there is an enclosing binding
         let y = [gv1,...gvm] \ argsy -> ey
    
    where the [gv1,..gvm] is a subset of [fv1,..fvn]? And what is the distribution of savings?

By "enclosing" here, in the RHS of a let, I include the let-binding itself, even if it is non-recursive. All the examples above have this form.

Some data along these lines would be extremely helpful in guiding what we do.

Another variant: currently we insist that [gv1,..gvm] is a subset of [fv1,..fvn], to avoid space leaks. But with some RTS support we could do better: the closure for x could point to the closure for y, but also have a bitmap (in x's info table) to say which of y's free vars are actually used by x. That would open up the possibility for much more flexible sharing, at the cost of more bitmaps etc. E.g.

x = [fv1,...fv100] \ [a] ->
    let y = [fv1,...fv99, a] \ [] -> ...
    in ...

Here y needs all but one of x's free vars. Data on how often this happens would be would be jolly useful.

comment:11 in reply to:  10 Changed 21 months ago by alexbiehl

I hacked up a POC of this transformation (it excludes updatable thunks from consideration). I had a quick (non-scientific) glance over the amount of CMM code generated for W2.hs (example from #7258):

vanilla ghc ~380000 LOC
hacked  ghc ~270000 LOC

I noticed a lots of small, nested closures which only needed a fraction of the free variables of their outer closures and hence didn't benefit from the transformation.

Some data along these lines would be extremely helpful in guiding what we do.

I will instrument my GHC to collect some data to guide us from here.

Another variant: currently we insist that [gv1,..gvm] is a subset of [fv1,..fvn], to avoid space leaks. But with some RTS support we could do better: the closure for x could point to the closure for y, but also have a bitmap (in x's info table) to say which of y's free vars are actually used by x. That would open up the possibility for much more flexible sharing, at the cost of more bitmaps etc. E.g.

x = [fv1,...fv100] \ [a] ->
    let y = [fv1,...fv99, a] \ [] -> ...
    in ...

Sound reasonable. Let's see what the data shows..

comment:12 Changed 21 months ago by dfeuer

As discussed today, the plan in comment:10 seems to offer the ability to deal also with the case where several, but not all, of the free variables are reused.

comment:13 Changed 21 months ago by simonmar

Cc: simonmar added

comment:14 Changed 20 months ago by dfeuer

Cc: dfeuer added

comment:15 Changed 12 months ago by bgamari

Type of failure: None/UnknownRuntime performance bug

comment:16 Changed 12 months ago by simonpj

@alexbiehl your POC in comment:11 looks so promising that it seems a shame to let this lie. Do you (or anyone else) have time to carry it through?

comment:17 Changed 12 months ago by simonpj

Cc: dfeuer. andreask added; dfeuer removed

AndreasK: maybe you'd be interested?

Last edited 12 months ago by simonpj (previous) (diff)

comment:18 Changed 12 months ago by potato44

Cc: dfeuer added; dfeuer. removed

comment:19 Changed 11 months ago by simonpj

Keywords: CodeGen added

comment:20 Changed 10 months ago by sgraf

Cc: sgraf added
Note: See TracTickets for help on using tickets.