Opened 5 years ago

Closed 5 years ago

#9628 closed feature request (fixed)

Add Annotations to the AST to simplify source to source conversions

Reported by: alanz Owned by: alanz
Priority: normal Milestone:
Component: Compiler Version: 7.9
Keywords: Cc: ndmitchell@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): D297
Wiki Page:

Description (last modified by alanz)

Add Annotations to HsSyn to explicitly track the locations of all non-blank source code elements to allow tools to parse a Haskell file, modify the AST and then produce an updated version of the source preserving the layout for unchanged parts.

Discussion of the feature is at GhcAstAnnotations

Note: an early effort was at Phab:D246, but this was abandoned as unworkable.

Change History (43)

comment:1 Changed 5 years ago by alanz

Owner: set to alanz

comment:2 Changed 5 years ago by simonpj

Alan writes (in email): I have decided to first tackle adding a type parameter to the entire AST, so that tool writers can add custom information as required. My first stab at this is to do is as follows

data HsModule r name
  = HsModule {
      ann :: r, -- ^ Annotation for external tool writers
      hsmodName :: Maybe (Located ModuleName),
        -- ^ @Nothing@: \"module X where\" is omitted (in which case the next
        --     field is Nothing too)
      hsmodExports :: Maybe [LIE name],
     ....

Salient points

  1. It comes as the first type parameter, and is called r
  2. It gets added as the first field of the syntax element
  3. It is always called ann

Before undertaking this particular change, I would appreciate some feedback.

comment:3 Changed 5 years ago by simonpj

Whoa! There appear to be two quite separate things going on here:

  • In GhcAstAnnotations the issue is attaching a SrcSpan to keywords
  • In comment:1 the issue appears to be about attaching an annotation of client-specified type to every node in the tree.

These seem quite orthogonal to me.

For the latter I would suggest looking at the Located type, instead of what you suggest in comment:1. The Located type is wrapped around almost every node in the tree, and if you want to add some ubiquitous annotation type, it would be the place to do so.

Simon

comment:4 Changed 5 years ago by alanz

Firstly, I agree that this is in fact two orthogonal things, but I think it is good to discuss them together because some potential solutions could couple them together.

Located is something that I considered for the annotation parameter, but I am concerned that it is so baked in to everything else that if it changed it would cause other unforeseen problems.

In order to use it in this way it would have to become a parameter to the AST too, effectively replacing all instances of Located with GenLocated.

Reference:

data GenLocated l e = L l e
  deriving (Eq, Ord, Typeable, Data)

type Located e = GenLocated SrcSpan e

comment:5 Changed 5 years ago by simonpj

Yes, that's right. Looks much nicer to me. The syntax tree is already heavily decorated with SrcSpans. Just parameterise over that, and you can decorate with something else instead.

An alternative would be to insist that there was always a SrcSpan, plus perhaps something else:

data GenLocated l e = L SrcSpan l e
type Located e = GenLocated () e

That's tiresome because there are lots of () values, but does mean you can always find a SrcSpan. One would have to explore use cases.

Simon

comment:6 Changed 5 years ago by alanz

I was considering requiring the annotation to be an instance of a class that always returned a SrcSpan, basically move

getLoc :: GenLocated l e -> l

into

class ApiAnnotation l where
    getLoc :: GenLocated l e -> SrcSpan
    annNoLoc :: e -> GenLocated l e
Last edited 5 years ago by alanz (previous) (diff)

comment:7 Changed 5 years ago by alanz

Once the parameter is added to the hsSyn structures, there are two ways to work it forward.

  1. Work it in as a parameter throughout all the compiler passes
  2. Hard code it to SrcSpan for renamer onwards

Option 1 allows tool writers to add annotations and then submit for further processing, but sprinkles the parameter variable throughout the code base.

Option 2 provided the most value for tool developers.

Which road? I am leaning towards working it all the way through.

comment:8 Changed 5 years ago by alanz

On further reflection, option 1 will allow for example a smart editor which uses the annotated ParsedSource as a primary data structure and submits changes through for incremental renaming/type checking and eventual code generation.

This way the normal command line invocation will lock the annotation down as a SrcSpan, but it can be replaced all the way through when used for tooling.

comment:9 Changed 5 years ago by alanz

Question: can I rip hooks out of DynFlags, to be managed separately?

comment:10 Changed 5 years ago by simonpj

You'll need to be much more explicit before anyone is likely to venture an opinion. What is the impact of "managed separately"? What is the motivation for making a change at all?

comment:11 Changed 5 years ago by alanz

Differential Rev(s): D246

comment:12 Changed 5 years ago by simonpj

I'm afraid I'm very confused by this thread.

  • There are two different Phab tickets: Phab:D246 is linked to this ticket, but Phab:D297 (I believe) may supercede it. If so please let's redirect the "Differential revision" field of this ticket, and explicit mark the moribund one as moribund.
  • The wiki page GhcAstAnnotations does not appear to reflect any of the discussion. Indeed it appears to describe only the first bullet from comment:3
  • There has been quite a lot of traffic on ghc-devs that is not captured anywhere. That's fine: an email list is good for discussion. But my input bandwidth is low and struggle to make sense of it all. And the conclusions from the discussion may be useful.
  • Alan has posted a useful summary to Haskell Cafe, which isn't captured on a wiki anywhere.
  • Alan has done some work identifying users for the new features, and written some email notes about that; again this would be useful to capture.

I am too slow to take a big patch and try to reverse-engineer the thought process that went into it. Would be possible to update the wiki page (presumably GhcAstAnnotations) to state

  • The problem we are trying to solve
  • The user-visible (or at least visible-to-client-of-GHC-API) design
  • Other notes about the implementation.

Covering the larger picture about the GHC API improvements you are making (eg no landmines) would be helpful. Maybe you need more than one page.

I'm delighted you are doing this. But I don't want to throw a lot of code into GHC without a clear, shared consensus about what it is we are trying do to, and how we are doing it.

Thanks.

Simon (drowning in review requests) PJ

comment:13 Changed 5 years ago by alanz

Description: modified (diff)
Differential Rev(s): D246D297
Version: 7.8.37.9

comment:14 Changed 5 years ago by alanz

Description: modified (diff)

comment:15 Changed 5 years ago by alanz

I have updated the Wiki page to reflect the current status, including an overview of the current design https://ghc.haskell.org/trac/ghc/wiki/GhcAstAnnotations#design

I have also captured the summary status in GhcApiWorkInProgress.

comment:16 Changed 5 years ago by simonpj

Thanks for the wiki page -- extremely helpful.

I'm concerned about the proliferation of data types. As I read it you intend to have a new data type for each constructor of each data type in HsSyn. That's a LOT of new data types! And I bet you'll soon want Eq, Ord, Data instances for them as well as Typeable. Indeed you say

data AnnClassDecl = AnnClassDecl
        { aclassdecl_class   :: SrcSpan
        , aclassdecl_mwhere  :: Maybe SrcSpan
        , aclassdecl_mbraces :: Maybe (SrcSpan,SrcSpan) }
            deriving (Eq,Data,Typeable,Show)

I wonder if something simpler and more dynamically-typed might do. Suppose you had

lookupApiAnns :: Typeable value => ApiAnns -> SrcSpan -> String -> Maybe value

so that ApiAnns is really a map from (SrcSpan, String, TypeRep) to values, where TypeRep there is the TypeRep of the value. The String is the dynamic bit. Now you could say

processHsClassDecl :: ApiAnns -> LTyClDecl n -> ...
processHsClassDecl anns (L l (ClassDecl { ..} )) = r
  where
    Just kwd_loc = lookupAPiAnns anns loc "class-keyword" :: Maybe SrcSpan
    Just mb_loc  = lookupApiAnns anns loc "class-mwhere" :: Maybe (Maybe SrcSpan)
    ...

OK so you might type those strings in wrong -- but if you do the look up will fail.

I don't want this to sink under the sheer weight of gratuitous declarations.

Oh and you could use the same string in lots of places. e.g. "where-keyword" might be used in a number of constructs.

comment:17 Changed 5 years ago by NeilMitchell

Alan and I debated this point, and I was the one who suggested the many data types approach (although I wouldn't go as far as saying I advocate it - I'm somewhat neutral). My concern was that if you do the dynamic approach you have to document each one of those strings, and the documentation has to be correct or the users won't have a clue what is going on, and if you update it all your clients will break and no one will realise. The documentation will basically be as long as the data types, but won't be statically checked on either side, and will very likely end up being wrong.

Maybe a hybrid approach is to have location information for keywords like "of", "case", "where", "(" and ")". You can have a fixed static enumeration of keywords, so you have type safety, and every keyword is optionally in every AST. The number of keywords is relatively small, and which keyword is contained within each AST node doesn't require documenting (the lexical structure makes it pretty obvious). You end up with no requirement for fragile docs, one data type with a fairly small number of entries, and reasonable safety for both the producer and consumer.

comment:18 Changed 5 years ago by NeilMitchell

Cc: ndmitchell@… added

comment:19 Changed 5 years ago by alanz

I share Neil's concern about the fragility of the string lookup.

A user of this feature would have to keep track of the string key, as well as the shape of the returned structure used for the particular annotation, and would not know until runtime failure that there was a problem.

I think the hybrid approach where there is a specific enumerated type for the keyword/feature being looked up could be workable, but requires the API user to consult the parser.y.pp file to see exactly what fits in where.

From a user of the feature point of view, having a straightforward mapping from the constructor name to the annotation name which can then be interrogated via e.g. ghci is definitely beneficial.

This does require an explosion of types, but this is pretty much a once-off exercise. Is there a code size problem if many classes are derived for each one?

comment:20 Changed 5 years ago by NeilMitchell

Why does the user require consulting parser.y.pp? Things like the Let constructor obviously have a Let keyword and an In keyword. Things like instance declaration obviously have Instance and Where. They seem fairly predictable, and if you do want to comment it, that could be as short as "-- instance ... [where]". Or do you have in mind nodes which wouldn't be as clear, or places where the thing you are indexing isn't so obviously a keyword?

comment:21 Changed 5 years ago by simonpj

I don't have data, but people are already complaining about the amount of code generated by data type declarations #9669. Have you counted how many data constructors there are in HsSyn? It's a LOT.

It just feels like a sledgehammer to crack a nut.

Simon

comment:22 in reply to:  20 Changed 5 years ago by alanz

Replying to NeilMitchell:

Why does the user require consulting parser.y.pp? Things like the Let constructor obviously have a Let keyword and an In keyword. Things like instance declaration obviously have Instance and Where. They seem fairly predictable, and if you do want to comment it, that could be as short as "-- instance ... [where]". Or do you have in mind nodes which wouldn't be as clear, or places where the thing you are indexing isn't so obviously a keyword?

Some of the structures are nested quite deeply, an/or reused in multiple roles. But I think anyone working at the detail level of the annotations is going to be looking closely at the relationship between the source, the parse tree and the annotations anyway, so it should not be a problem.

comment:23 Changed 5 years ago by alanz

The multiple annotation approach seems promising, first stab results in

Annotation types

data AnnModule = AnnModule SrcSpan
            deriving (Eq,Data,Typeable,Show)

data AnnWhere = AnnWhere SrcSpan
            deriving (Eq,Data,Typeable,Show)

-- Should this be AnnBraces? The tokens are ITocurly / ITccurly
data AnnCurlies = AnnCurlies (SrcSpan,SrcSpan)
            deriving (Eq,Data,Typeable,Show)

data AnnSemi = AnnSemi SrcSpan
            deriving (Eq,Data,Typeable,Show)

data AnnComma = AnnComma SrcSpan
            deriving (Eq,Data,Typeable,Show)

-- | Pragma declaration, e.g. '{-# SOURCE' '#-}'
data AnnPragma = AnnPragma (SrcSpan,SrcSpan)
            deriving (Eq,Data,Typeable,Show)

Helper functions in parser

-- | Given a list of @Maybe annotation@, add the @Just@ ones to the
-- given location
ams :: Located a -> [Maybe (SrcSpan -> P ())] -> P (Located a)
ams a@(L l _) bs = (mapM_ (\a -> a l) $ catMaybes bs) >> return a

mj :: (Outputable a,Typeable a,Show a,Eq a)
   => (SrcSpan -> a) -> Located e -> Maybe (SrcSpan -> P ())
mj c l = Just (\s -> addAnnotation s (c (gl $l)))

mm :: (Outputable a,Typeable a,Show a,Eq a)
   => (SrcSpan -> a) -> Maybe SrcSpan -> Maybe (SrcSpan -> P ())
mm c Nothing = Nothing
mm c (Just l) = Just (\s -> addAnnotation s (c l))

mm2 :: (Outputable a,Typeable a,Show a,Eq a)
    => ((SrcSpan,SrcSpan) -> a) -> Maybe (SrcSpan,SrcSpan)
    -> Maybe (SrcSpan -> P ())
mm2 c Nothing = Nothing
mm2 c (Just (l1,l2)) = Just (\s -> addAnnotation s (c (l1,l2)))

Example rule in parser

module  :: { Located (HsModule RdrName) }
        : maybedocheader 'module' modid maybemodwarning maybeexports 'where' body
                {% fileSrcSpan >>= \ loc ->
                   ams (L loc (HsModule (Just $3) $5 (fst $ snd $7) (snd $ snd $7) $4 $1
                      ) )
                     [mj AnnModule $3, mj AnnWhere $6
                     ,mm2 AnnCurlies (fst $ fst $7),mm AnnSemi (snd $ fst $7)] }

comment:24 Changed 5 years ago by NeilMitchell

I had a slightly different design in mind.

data Ann = AnnModule
         | AnnWhere
         | AnnSemi
         | AnnOpen
         | AnnClose

Then you can basically have a function:

getAnnotation :: ASTNode -> Ann -> Maybe SrcSpan

You'd be down to only a single annotation. Things like Pragma could use AnnOpen for the {-# and AnnClose for the #-} bit. As long as you have a separate annotation constructor for each part of a pair, then everything is associated with a Maybe SrcSpan, and you don't have to pack the SrcSpan inside the Ann field.

comment:25 Changed 5 years ago by alanz

That could work, and it is a lot simpler. The stuff for paired SrcSpans just doubles everything up at the moment, as I work it through.

comment:26 Changed 5 years ago by simonpj

Yes, comment:24 seems better.

Simon

comment:27 Changed 5 years ago by alanz

I updated the work in progress on D297 with this approach, and it feels much better, and is less invasive.

comment:28 Changed 5 years ago by alanz

There is one other issue required for source to source conversions, namely original literal values.

When literals are parsed, they are converted into an internal representation, and the original string is discarded.

I am proposing to amend HsLit to be

data HsLit
  = HsChar          String Char               -- Character
  | HsCharPrim      String Char               -- Unboxed character
  | HsString        String FastString         -- String
  | HsStringPrim    String ByteString         -- Packed bytes
  | HsInt           String Integer            -- Genuinely an Int; arises from
                                              --     TcGenDeriv, and from TRANSLATION
  | HsIntPrim       String Integer            -- literal Int#
  | HsWordPrim      String Integer            -- literal Word#
  | HsInt64Prim     String Integer            -- literal Int64#
  | HsWord64Prim    String Integer            -- literal Word64#
  | HsInteger       String Integer  Type      -- Genuinely an integer; arises only from
                                              --   TRANSLATION (overloaded literals are
                                              --   done with HsOverLit)
  | HsRat           String FractionalLit Type -- Genuinely a rational; arises only from
                                              --   TRANSLATION (overloaded literals are
                                              --   done with HsOverLit)
  | HsFloatPrim     String FractionalLit      -- Unboxed Float
  | HsDoublePrim    String FractionalLit      -- Unboxed Double
  deriving (Data, Typeable)

where the first parameter in each case is the original source code represetation of the literal.

Should this be a String or a FastString?

comment:29 Changed 5 years ago by simonpj

I suggest doing so only if the two can differ. In the case of String there can be string gaps, thus

foo :: String
foo = "blah blah\
      \more blah blah\
      \and more"

and I guess you want to have all that layout reproduced. Fine. But for integers like 3234242329423, I don't see how the displayed form could differ.

For Words perhaps there is binary/hex forms?

Regardless, I'm not against this, but very keen that the reasons for keeping the two are documented on a per-literal basis, as I have begun to do above.

comment:30 Changed 5 years ago by simonpj

Well done for making progress. Some thoughts

  • If the patch is ready for review, is GhcAstAnnotations also fully up to date? Could you move any discussion of alternatives to the end, under "Other possible design alternatives" so that what remains is actually a description of the feature you propose, and a sketch of its implementation? I'm unsure about which bits of the wiki page are rejected ideas and which are the ones you adopted.
  • Floating around is also ExtraCommas. I think the two are somewhat orthogonal, right?
  • Does your design say where comments are? That is, can you really round-trip source code?

In particular, an excellent criterion could be: can you do Haddock this way? Currently Haddock has a lot of Haddock-specific fields in HsSyn. Could they all be replaced with annotations in your style? If not, what would take to make that possible? It would be highly cool; after all, Haddock may be privileged, but the more we can make it possible for others to do Haddock-like things without changing GHC itself, the better.

  • You outlined a number of "customers" in an earlier post. Would it be worth adding them to the wiki page?

Simon

comment:31 Changed 5 years ago by alanz

I have updated the GhcAstAnnotations page to reflect the current design, as well as to add a section outlining an approach to comments, which is https://ghc.haskell.org/trac/ghc/wiki/GhcAstAnnotations#comments

The comments have been added in Phab:D297, causing the file ApiAnnotations to be moved into Lexer.x

comment:32 Changed 5 years ago by simonpj

Thank you. Now I understand that all this is focused solely on the source-location of keywords. (Previously I thought that there might be other client-specific information involved.)

Could we then just take a moment to review the pros and cons of the proposed approach (the ApiAnns approach) compared with the earlier idea of just adding SrcSpan info to the HsSyn syntax tree itself. That is, like "Richard Eisenberg's response" and "SPJ response to concern re extra noise in AST" in the wiki page. I'll call that the "in-tree approach". Note that using record update syntax, a pass like the renamer would never need to explicitly match or mention the keyword location fields.

Advantages of the ApiAnns approach

  • Very non-invasive
  • Programs that don't need the information don't touch it; this seems like the strongest advantage

Advantages of the in-tree approach

  • The things you want are right there; no extra mappings to carry around, no lookups, no AnnColon2 stuff.
  • Seems simple and uniform with the existing source locations. Eg. if a keyword 'do' could so be mdo, you might have a Located Bool to indicate. But if it's just do then we'd get Located (), and then we drop it altogether. The in-tree story amounts to not dropping it.

This is really a matter for our customers. Alan (Mr HaRE) is one. Do we have other known customers for this feature? What do they say? Or don't they case?

comment:33 Changed 5 years ago by alanz

Summary of AZ understanding of the landscape

Pros and cons of current approach

Pros

Very non-invasive

Programs that don't need the information don't touch it; this seems like the strongest advantage

Can also bring out comments, tied to smallest enclosing SrcSpan having an annotation. The comments can be in Haddock mode, to aid tool processing. This makes use of the standard Opt_Haddock and Opt_KeepRawTokenStream settings.

Cons

Parser.y.pp is more complex

Actually using the annotations is more complex

Pros and cons of embedded annotations as per Richard Eisenberg

Pros

The things you want are right there; no extra mappings to carry around, no lookups, no AnnColon2 stuff.

Seems simple and uniform with the existing source locations. Eg. if a keyword 'do' could so be mdo, you might have a Located Bool to indicate. But if it's just do then we'd get Located (), and then we drop it altogether. The in-tree story amounts to not dropping it.

Simpler parser

cons

Breaks lots of existing code, both in GHC and tools that make use of the AST.

Other thoughts

In terms of HaRe, the AST mainipulation will happen via ghc-exactprint. For this, the annotations are used as a starting point for processing.

The mechanism intended in ghc-exactprint is to convert the absolute locations to relative locations, much as combinators in a pretty printer. This way when tooling modifies the AST, the output will honour the original layout as much as possible, given changes.

Examples

Lifting a declaration will preserve the layout of the original but dedented appropriately.

Changing the name of an identifier, resulting in a longer or shorter name, just shifts things around naturally.

     foo x = x + y where fn a = a + 2
                         y = fn x

becomes

     foo xlong = xlong + y where fn a = a + 2
                                 y = fn xlong
  • The annotations can be easily accessed via a traversal of the appropriate kind.
  • There is a middle ground, where each AST element that has annotations tied to it in the current system has a single additional field containing [(SrcSpan,Ann]).

Equally, the annotations retrieval function can provide all annotations attached to a given SrcSpan, via the current mechanism.

  • The most natural way to access the annotations is indirectly via a library such as HaRe or ghc-exactprint, as indicated by e.g. hlint, yi and IHaskell developers.

comment:34 Changed 5 years ago by alanz

Status report.

Using Phab:D297 applied to head, ghc-exactprint is able to fully reproduce the module header and exports of a haskell source file.

Working with the annotations feels natural, and it works.

Also, there has been no feedback, adverse or otherwise on wether this is the right approach.

Given that HaRe will be the primary user initially, I would be happy to go ahead with this.

comment:35 Changed 5 years ago by simonpj

Fair enough in general. I've added some comments on Phab

comment:36 Changed 5 years ago by alanz

Status report

Phab:D297 has been withdrawn, as it has been split up into

At this point Phab:D412 is set up to depend on Phab:D438 which in turn depends on Phab:D426

These are all ready for review

comment:37 Changed 5 years ago by alanz

I have updated Phab:D426, Phab:D438 and Phab:D412 to rebase against current master, and to work in the feedback received.

They are once more ready for review.

comment:38 Changed 5 years ago by alanz

Oops, not yet, need to document the annotations per AST element

comment:39 Changed 5 years ago by alanz

I have updated ​Phab:D426, ​Phab:D438 and ​Phab:D412 to include documentation about what annotations can be expected against each AST element, and also rebases against current master.

The process of documentation showed up a number of shortcomings where annotations were missing or duplicated.

I have also changed the getAnnotation result from Maybe SrcSpan to [SrcSpan] and allow multiple annotations of the same type on a given AST element. This means that the horrible AnnColon2 and similar are now gone.

The users of the annotations should be able to work out what is what from the surrounding context.

They are once more ready for final review and hopefully merge for 7.10

comment:40 Changed 5 years ago by Austin Seipp <austin@…>

In 7927658ed1dcf557c7dd78e4b9844100521391c8/ghc:

AST changes to prepare for API annotations, for #9628

Summary:
AST changes to prepare for API annotations

Add locations to parts of the AST so that API annotations can
then be added.

The outline of the whole process is captured here
https://ghc.haskell.org/trac/ghc/wiki/GhcAstAnnotations

This change updates the haddock submodule.

Test Plan: sh ./validate

Reviewers: austin, simonpj, Mikolaj

Reviewed By: simonpj, Mikolaj

Subscribers: thomie, goldfire, carter

Differential Revision: https://phabricator.haskell.org/D426

GHC Trac Issues: #9628

comment:41 Changed 5 years ago by Austin Seipp <austin@…>

In 803fc5db31f084b73713342cdceaed5a9c664267/ghc:

Add API Annotations

Summary:
The final design and discussion is captured at
https://ghc.haskell.org/trac/ghc/wiki/GhcAstAnnotations

This is a proof of concept implementation of a completely
separate annotation structure, populated in the parser,and tied to the
AST by means of a virtual "node-key" comprising the surrounding
SrcSpan and a value derived from the specific constructor used for the
node.

The key parts of the design are the following.

== The Annotations ==

In `hsSyn/ApiAnnotation.hs`

```lang=haskell
type ApiAnns = (Map.Map ApiAnnKey SrcSpan, Map.Map SrcSpan [Located Token])

type ApiAnnKey = (SrcSpan,AnnKeywordId)

-- ---------------------------------------------------------------------

-- | Retrieve an annotation based on the @SrcSpan@ of the annotated AST
-- element, and the known type of the annotation.
getAnnotation :: ApiAnns -> SrcSpan -> AnnKeywordId -> Maybe SrcSpan
getAnnotation (anns,_) span ann = Map.lookup (span,ann) anns

-- |Retrieve the comments allocated to the current @SrcSpan@
getAnnotationComments :: ApiAnns -> SrcSpan -> [Located Token]
getAnnotationComments (_,anns) span =
  case Map.lookup span anns of
    Just cs -> cs
    Nothing -> []

-- | Note: in general the names of these are taken from the
-- corresponding token, unless otherwise noted
data AnnKeywordId
         = AnnAs
         | AnnBang
         | AnnClass
         | AnnClose -- ^ } or ] or ) or #) etc
         | AnnComma
         | AnnDarrow
         | AnnData
         | AnnDcolon
         ....
```

== Capturing in the lexer/parser ==

The annotations are captured in the lexer / parser by extending PState to include a field

In `parser/Lexer.x`

```lang=haskell
data PState = PState {
        ....
        annotations :: [(ApiAnnKey,SrcSpan)]
        -- Annotations giving the locations of 'noise' tokens in the
        -- source, so that users of the GHC API can do source to
        -- source conversions.
     }
```

The lexer exposes a helper function to add an annotation

```lang=haskell
addAnnotation :: SrcSpan -> Ann -> SrcSpan -> P ()
addAnnotation l a v = P $ \s -> POk s {
  annotations = ((AK l a), v) : annotations s
  } ()

```

The parser also has some helper functions of the form

```lang=haskell
type MaybeAnn = Maybe (SrcSpan -> P ())

gl = getLoc
gj x = Just (gl x)

ams :: Located a -> [MaybeAnn] -> P (Located a)
ams a@(L l _) bs = (mapM_ (\a -> a l) $ catMaybes bs) >> return a
```

This allows annotations to be captured in the parser by means of

```
ctypedoc :: { LHsType RdrName }
        : 'forall' tv_bndrs '.' ctypedoc {% hintExplicitForall (getLoc $1) >>
                                            ams (LL $ mkExplicitHsForAllTy $2 (noLoc []) $4)
                                                [mj AnnForall $1,mj AnnDot $3] }
        | context '=>' ctypedoc         {% ams (LL $ mkQualifiedHsForAllTy   $1 $3)
                                               [mj AnnDarrow $2] }
        | ipvar '::' type               {% ams (LL (HsIParamTy (unLoc $1) $3))
                                               [mj AnnDcolon $2] }
        | typedoc                       { $1 }
```

== Parse result ==

```lang-haskell
data HsParsedModule = HsParsedModule {
    hpm_module    :: Located (HsModule RdrName),
    hpm_src_files :: [FilePath],
       -- ^ extra source files (e.g. from #includes).  The lexer collects
       -- these from '# <file> <line>' pragmas, which the C preprocessor
       -- leaves behind.  These files and their timestamps are stored in
       -- the .hi file, so that we can force recompilation if any of
       -- them change (#3589)
    hpm_annotations :: ApiAnns
  }

-- | The result of successful parsing.
data ParsedModule =
  ParsedModule { pm_mod_summary   :: ModSummary
               , pm_parsed_source :: ParsedSource
               , pm_extra_src_files :: [FilePath]
               , pm_annotations :: ApiAnns }
```

This diff depends on D426

Test Plan: sh ./validate

Reviewers: austin, simonpj, Mikolaj

Reviewed By: simonpj, Mikolaj

Subscribers: Mikolaj, goldfire, thomie, carter

Differential Revision: https://phabricator.haskell.org/D438

GHC Trac Issues: #9628

comment:42 Changed 5 years ago by Austin Seipp <austin@…>

In c0ad5bc03e02ce0d7d545599e4b1a68a6f727f2b/ghc:

Capture original source for literals

Summary:
Make HsLit and OverLitVal have original source strings, for source to
source conversions using the GHC API

This is part of the ongoing AST Annotations work, as captured in
https://ghc.haskell.org/trac/ghc/wiki/GhcAstAnnotations and
https://ghc.haskell.org/trac/ghc/ticket/9628#comment:28

The motivations for the literals is as follows

```lang=haskell
x,y :: Int
x = 0003
y = 0x04

s :: String
s = "\x20"

c :: Char
c = '\x20'

d :: Double
d = 0.00

blah = x
  where
    charH = '\x41'#
    intH = 0004#
    wordH = 005##
    floatH = 3.20#
    doubleH = 04.16##
    x = 1
```

Test Plan: ./sh validate

Reviewers: simonpj, austin

Reviewed By: simonpj, austin

Subscribers: thomie, goldfire, carter, simonmar

Differential Revision: https://phabricator.haskell.org/D412

GHC Trac Issues: #9628

comment:43 Changed 5 years ago by alanz

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.