# Late lambda lifting

## Overview

Late Lambda Lifting is a not-very-well-explored optimisation for GHC. The basic idea is to transform

f x = let g y = ...y...x... in ...(g v1)...(g v2)... ===> g' x y = ...y...x... g x = ...(g' x v1)...(g' x v2)...

That is, pretty standard lambda-lifting. The idea is that instead of *allocating a closure* for `g`

, we pass extra parameter(s) to it. More below, under "Background".

Since heap allocation is expensive, this has the possibility of making programs run faster. The challenge is all about getting consistent speedups.

There is a ticket to track progress: #9476. There's a paper in the making at https://github.com/sgraf812/late-lam-lift/blob/master/paper.pdf.

## Tickets

Use Keyword = `LateLamLift`

to ensure that a ticket ends up on these lists.

**Open Tickets:**

**Closed Tickets:** NB: closed tickets may be closed as duplicates; but may still have very illuminating test cases.

## Implementation

The implementation is spread over three modules:

StgLiftLams ├── Analysis ├── LiftM └── Transformation

`Transformation`

contains the actual lambda-lifting implementation on STG
programs. For the decision of whether or not to lift a binding, it calls out
into `Analysis.goodToLift`

. The transformation produces and ultimately runs
effects in `LiftM`

, which nicely encapsulates nitty gritty details such as
dealing with substitution, conjuring fresh binders, cost centre stacks and
caffyness.

### Flags

Late lambda lifting is configurable via these flags:

data GeneralFlag = ... | Opt_StgLiftLam -- ^ Enable late lambda lifting. -- -fstg-lift-lams -- Enabled by default in -O*. ... data DynFlags = DynFlags { ... liftLamsRecArgs :: Maybe Int, -- ^ Maximum number of arguments after lambda lifting a -- recursive function. -- -fstg-lift-lams-rec-args liftLamsNonRecArgs :: Maybe Int, -- ^ Maximum number of arguments after lambda lifting a -- non-recursive function. -- -fstg-lift-lams-nonrec-args liftLamsKnown :: Bool, -- ^ Lambda lift even when this turns a known call -- into an unknown call. -- -fstg-lift-lams-known ... }

## Short wrapup on history

The original work by Nicolas Frisby started out in 2013 as part of his internship on the `wip/llf`

branch.
Sebastian Graf has rebased this branch in mid April 2018. After some debugging
and fixups, it passed `./validate`

(modulo some compiler perf tests) in
c1f16ac.
The most recent variant of Nicolas' original Core transformation can be found
here: https://github.com/sgraf812/ghc/tree/llf.

In July 2018, Sebastian argued that it's probably a good idea to reimplement the transformation on STG instead of Core, the promising implementation of which is available here.

If you want to know more about the original work on Core, look into the history of this Wiki page (before October 2018).

## Background

The primary objective of late lambda lifting is to replace heap allocation of closures by an increase in the number of arguments passed, by floating local functions to top-level.

The following examples aren't strictly in STG, so assume that all non-trivial arguments are given a name and all closures are properly annotated with their free variables.

Consider lambda-lifting `f`

out of `before`

.

before a b = let f x = RHS[f, b, x] in BODY[f, a, b]

Lifting `f`

out of `before`

would yield:

$lf b x = RHS[$lf b, b, x] after a b = BODY[$lf b, a, b]

Note that all occurrences of `f`

in `RHS`

and `BODY`

have been
replaced by the entire expression.

`before1`

(an instance of `before`

) is the basic motivating case for
the LLF.

before1 a b c = let f1 x y = (b + x) + (c + y) in f1 a b * f1 c a

$lf1 b c x y = (b + x) + (c + y) after1 a b c = $lf1 b c a b * $lf1 b c c a

In this case, `after1`

allocates less on the heap than does
`before1`

. `before1`

allocates a closure for `f1`

and then enters it
twice. `after1`

does no such allocation, but it does pass more arguments
when entering the statically allocated `$f1`

. If `before1`

is called
numerous times, then this transformation could drastically reduce the
overall program allocation. The run-time effect of the additional
arguments may be negligible.

## When to lift

The considerations of this section can be found in `StgLiftLams.Analysis`

.

### Why lambda-lift late?

We lift late in the pipeline primarily because lambda-lifting is known to drastically interfere with many of the other optimizations in the pipeline.

Ultimately, we lambda-lift at the end of the STG pipeline, just before
code gen. In fact, we like to think about LLF as a *code gen strategy*
that turns closures into extra arguments *when doing so is viable*.

### Syntactic Consequences of Lifting

Lifting a dynamic function closure has four immediate consequences.

C1It eliminates the let.

C2It creates a new top-level definition.

C3It replaces all occurrences of the lifted function in the let's body with a (partial) application. E.g., all occurences of`f1`

were replaced by`$lf1 b`

.

C4All non-top-level variables that occured in the let's RHS become occurrences of parameters.

### Operational Consequences of Lifting

The resulting operational consequences can be good or bad, both statically and
dynamically. These consequences finally lead to a number of heuristics,
implemented in `StgLiftLams.Analysis.goodToLift`

.

- C1 reduces heap allocation, unless the function is let-no-escape (LNE), which don't allocate to begin with. Consequently, we don't lift LNEs.

- C3 might increase heap allocation and run-time by creating PAPs if
the lifted function was previously a function argument. While this won't
*always*increase allocation, it certainly will only improve allocations if this pushes allocation away from the hot path into some cold leaf. We don't allow this, because it also significantly complicates the transformation for arguably very questionable gain.

- C3 might increase or decrease heap allocation. If
`f`

occurs in a closure in the let body, then

A) it can increase heap allocation, if the additional arguments to

`f`

did not previously occur in that closure

B) it can decrease heap allocation, if those additional arguments did already occur in the closure.

We call the sum result of A and B

closure growth. Estimating closure growth is paramount to achieving unambiguously good results. See "Closure Growth" below.

- C4 might increase stack allocation and/or increase register pressure, depending on how many arguments the new top-level function function would take. We determined 5 extra arguments (the number of available free argument registers on x86_64) to be a sweet spot for cut off.

- C4 might cause a slowdown because it can convert known calls into unknown calls when the lifted function previously closed over a function. We don't perform the lift in this case.

Finally, we obviously only lift functions, no closures or constructor applications.

### Closure Growth

Both A and B above could hold within the same let body. Moreover, A or B could hold wrt a closure that is under a (multi-shot) lambda; so that the changes in heap allocation could happen 0 or billions of times. The conservative choice, then, is that A under a lambda infinitely increases the heap allocation and B under a lambda is ignored.

We look at demand information for a more precise, yet still conservative estimate: Usage information can be harnessed to identify one-shot bodies for A under lambdas, while strictness information can be used to propagate the benefits of B under lambdas.

`StgLiftLams.Analysis.closureGrowth`

estimates closure growth, operating
on a *skeleton* of the actual STG tree. Skeletons are an abstraction, only
capturing the relevant syntactic constructs. E.g., closures with their fvs,
RHSs with relative cardinality, sequential composition, choice and a neutral
element.
This is so that not every lifting decision has to traverse the entire syntax
tree, while also keeping `closureGrowth`

plain and simple. To this purpose, each
let binder of the STG syntax tree is tagged with the Skeleton of its RHS.

From the resulting closure growth, the effects of C1 are subtracted to determine whether the lift would be beneficial.