The ExpPatFrame Parser Refactoring

This page outlines a new design for the expression/pattern parser and the motivation for it.

The Problem

There are places in the grammar where we do not know whether we are parsing an expression or a pattern without infinite lookahead (which we do not have in happy):

  1. View patterns:
         f (Con a b     ) = ...  -- 'Con a b' is a pattern
         f (Con a b -> x) = ...  -- 'Con a b' is an expression
  1. do-notation:
         do { Con a b <- x } -- 'Con a b' is a pattern
         do { Con a b }      -- 'Con a b' is an expression
  1. Guards:
         x | True <- p && q = ...  -- 'True' is a pattern
         x | True           = ...  -- 'True' is an expression
  1. Top-level value/function declarations (FunBind/PatBind):
         f !a         -- TH splice
         f !a = ...   -- function declaration
    Until we encounter the = sign, we don't know if it's a top-level TemplateHaskell splice where ! is an infix operator, or if it's a function declaration where ! is a strictness annotation.

The approach GHC uses is to parse patterns as expressions and rejig later. This turns out to be suboptimal:

  • We can't handle corner cases. For instance, the following function declaration LHS is not a valid expression (see #1087):
          !a + !b = ...
  • There are points in the pipeline where the representation is awfully incorrect. For instance,
          f !a b !c = ...
    is first parsed as
          (f ! a b) ! c = ...
  • We have to extend HsExpr with pattern-specific constructs: EAsPat, EViewPat, ELazyPat, etc. It isn't particularly elegant and we don't want such constructors to show up in GHC API.

Backtracking with Parser Combinators

One might think we could avoid this issue by using a backtracking parser and doing something along the lines of try pExpr <|> pPat. I proposed this in a ghc-devs thread: The situation turned out to be more complex. As there can be patterns inside expressions (e.g. via case, let, do) and expressions inside patterns (e.g. view patterns), naive backtracking would be devastating to performance (asymptotically).

Common Structure

There are common syntactic elements of expressions and patterns (e.g. both of them must have balanced parentheses), and we can capture this common structure in an intermediate data type, ExpPatFrame:

data ExpPatFrame
  = FrameVar RdrName
    -- ^ Identifier: Just, map, BS.length
  | FrameIPVar HsIPName
    -- ^ Implicit parameter: ?x
  | FrameOverLabel FastString
    -- ^ Overloaded label: #label
  | FrameTuple [LTupArgFrame] Boxity
    -- ^ Tuple (section): (a,b) (a,b,c) (a,,) (,a,)
  | FrameViewPat LExpPatFrame LExpPatFrame
    -- ^ View pattern: e -> p
  | FrameTySig LExpPatFrame (LHsSigWcType GhcPs)
    -- ^ Type signature: x :: ty
  | FrameAsPat LExpPatFrame LExpPatFrame
    -- ^ As-pattern: x@(D a b)

ExpPatFrame is a union of all syntactic elements between expressions and patterns, so it includes both expression-specific constructs (e.g. overloaded labels) and pattern-specific constructs (e.g. view/as patterns).

As soon as we have parsed far enough to decide whether it is an expression or a pattern, we can convert to HsExpr or HsPat accordingly. In the future we might want to extend 'ExpPatFrame' (expression/pattern frame) to ExpPatTyFrame (expression/pattern/type frame), because with Dependent Haskell (or even smaller features, such as visible dependent quantification in terms) we will have types inside expressions and expressions inside types.

The nice thing about having a dedicated type such as ExpPatFrame is keeping parsing concerns local to the parser: no matter what hacky intermediate structures we add to ExpPatFrame, we can keep HsExpr and HsPat clean.

Trees that Grow

During the discussion of Phab:D5408, an alternative plan came up: create a new GHC pass, GhcPrePs, and extend HsExpr GhcPrePs with pattern-specific and command-specific constructors. Then disambiguate during the conversion from GhcPrePs to GhcPs.

The reason this design does not work is that some parts of HsExpr should be disambiguated much sooner than we parse an expression in its entirety. We have:

data HsExpr p
  | HsLet ...
                (LHsLocalBinds p)

data Match p body
  = Match {
        m_pats :: [LPat p],

data GRHSs p body
  = GRHSs {
      grhssLocalBinds :: LHsLocalBinds p

data StmtLR idL idR body
  | BindStmt
             (LPat idL)

Imagine we're parsing a HsExpr GhcPrePs – it will contain LHsLocalBinds GhcPrePs and LPat GhcPrePs. Converting them to GhcPs is extra code and extra runtime – we don't want that. Instead, in ExpPatFrame we store LHsLocalBinds GhcPs and LPat GhcPs in corresponding places. Therefore, ExpPatFrame does not constitute a proper pass: we pre-parse little fragments that store GhcPs subtrees and then convert these fragments to HsExpr GhcPs, HsPat GhcPs, or HsPat GhcPs.

Embedding Common Terms

We (simonpj, int-index, goldfire, and I (Shayan)) had an email discussion to overview the design space and the results are documented as the following.

0) The process of parsing includes two subprocesses of parsing ambiguous bits into small in-between datatypes (in linear time), and then resolving ambiguities and converting (in linear time) from the in-between datatypes to the relevant HsSyn types of GhsPs pass.

1) Q) What are the other, besides the ones involved D5408, noticeable ambiguities that could benefit from an in-between datatype?

A)There are [other] ambiguities when parsing data declarations, resolved them with an intermediate data type, too, in D5180.

For example the following definition of TyEl:

data TyEl = TyElOpr RdrName | TyElOpd (HsType GhcPs)
          | TyElTilde | TyElBang
          | TyElUnpackedness ([AddAnn], SourceText, SrcUnpackedness)
          | TyElDocPrev HsDocString

2) I proposed to consider the possibility of defining a common core "pre-expression" where expressions, patterns and commands are made of.

It would possibly help to define the three as a composition of the common core and the bits specific them. Then we can have declarations similar to

data HsTerm trm
   = ... -- the common term constructs as in your `ExpPatFrame`

data HsExpr x
  = TermE (HsTerm  (HsExpr x))
  | ... -- the other constructors specific  to expressions

data Pat x
  = TermP (HsTerm (Pat x))
  | ...

... -- same for commands

There are different possible variations to above, but this is essentially to say

HsExpr = Common Core + Other expression-specific constructs

Pat = Common Core + Other pattern-specific constructs

(Notice one can split things up all the way to achieve one datatype-per-constructor, but a couple of years ago we've found it overkilling)

We identified the following pros and cons.

Pros are that there will be fewer constructors, and more importantly, parts that look the same may be pretty-printed (or possibly renamed, desugared, ...) the same.

Cons are that there will be one more matching to go under TermX constructor and such nesting has a (possibly negligible) performance impact.

Minimizing ExpPatFrame

We'd like to keep ExpPatFrame as small as possible. It means that instead of duplicating all of HsExpr and HsPat constructors in it, we'd rather embed them directly when unambiguous. For example, patterns cannot contain if, case, or do, so we'd rather have this:

data ExpPatFrame
  = ...
  | ...
  | FrameExpr (HsExpr GhcPs)

than this:

data ExpPatFrame
  = ...
  | ...
  | ...
  | FrameIf LExpPatFrame LExpPatFrame LExpPatFrame
    -- ^ If-expression: if p then x else y
  | FrameMultiIf [LFrameGRHS]
    -- ^ Multi-way if-expression: if | p = x \n | q = x
  | FrameCase LExpPatFrame [LFrameMatch]
    -- ^ Case-expression: case x of { p1 -> e1; p2 -> e2 }
  | FrameDo (HsStmtContext Name) [LFrameStmt]
    -- ^ Do-expression: do { s1; a <- s2; s3 }

data FrameStmt
  = ...
  | ...
  | ...
  | FrameBindStmt (LPat GhcPs) LExpPatFrame
    -- ^ Binding statement: p <- e
  | FrameBodyStmt LExpPatFrame
    -- ^ Body statement: e
  | FrameLetStmt (LHsLocalBinds GhcPs)
    -- ^ Let statement: let p = t

Unfortunately, while do or let cannot be used in patterns, they can be used in commands, so we end up duplicating most of HsExpr constructors for the sake of HsCmd. If not for this, we'd be able to make ExpPatFrame smaller.

Nevertheless, in the final iteration we will include constructors for unambiguous cases:

data ExpPatFrame
  = ...
  | ...
  | FrameExpr (HsExpr GhcPs) -- unambiguously an expression
  | FramePat (HsPat GhcPs) -- unambiguously a pattern
  | FrameCommand (HsCmd GhcPs) -- unambiguously a command

Hopefully, this will allow us to remove at least some duplication, even if not much.

Implementation Plan

Initially I (int-index) tried to do the refactoring and fix #1087 simultaneously, and I have a semi-working proof-of-concept passing (almost all) tests:

This prototype also revealed that doing the ExpPatFrame refactoring allows for better error messages in some cases. Compare:

GHCi, version 8.4.3:  :? for help
Prelude> foo# = ()

<interactive>:1:6: error:
   parse error on input '='
   Perhaps you need a 'let' in a 'do' block?
   e.g. 'let x = 5' instead of 'x = 5'


GHCi, version 8.7.20181026:  :? for help
Prelude> foo # = ()

<interactive>:1:5: error:
   Operator is missing a right-hand side argument: #
Prelude> foo# = ()

<interactive>:2:4: error:
   Operator is missing a right-hand side argument: #
   Perhaps you meant to enable MagicHash?

However, after a few hellish rebases, I decided to split this effort into many smaller patches. Here's the new plan:

  1. Introduce ExpPatFrame with as little churn as possible.
  2. Use it to clean up the definition of HsExpr, removing EAsPat, EWildPat, EViewPat, ELazyPat. Perhaps also HsArrApp and HsArrForm.
  3. Use it as the basis for fixing #1087
  4. Investigate its viability for term/type parser unification
Last modified 8 months ago Last modified on Dec 10, 2018 10:54:36 PM