Changes between Version 1 and Version 2 of ApiAnnotations


Ignore:
Timestamp:
Jan 15, 2015 7:29:25 PM (5 years ago)
Author:
alanz
Comment:

First draft description of API Annotations as implemented

Legend:

Unmodified
Added
Removed
Modified
  • ApiAnnotations

    v1 v2  
    1010Rather than pollute the AST with information irrelevant to the actual
    1111compilation process, these locations are captured in the lexer /
    12 parser and returned as a separate structure `ApiAnns` structure in the
     12parser and returned as a separate `ApiAnns` structure in the
    1313`ParsedSource`.
    1414
    1515This keeps both the keyword location and the original comments with their
    16 locations.  WIP AZ, change of venue
     16locations.
    1717
    1818{{{
     
    2424}}}
    2525
    26 Each AST element that needs an annotation has an entry in this Map,
    27 which as a key comprising the SrcSpan of the original element and the
    28 TyeRep of the stored annotation, if it were wrapped in a Just.
     26Each AST element with annotations has an entries in this Map. The key comprises the SrcSpan of the original element and the AnnKeywordId of the stored annotation. The value is a list of SrcSpans where that particular keyword appears. This is a list to cater for e.g. ";;;;". Top level elements are captured against 'nullSrcSpan'.
     27
     28So for the source file "examples/Test.hs" having contents
     29
     30{{{#!haskell
     31-- |Comment not in a SrcSpan
     32foo x = -- Compute foo
     33  let a = 4 -- using a
     34  in a + x
     35}}}
     36
     37The returned ApiAnns are
     38
     39{{{#!haskell
     40([((examples/Test.hs:(2,1)-(4,10), AnnEqual), [examples/Test.hs:2:7]),
     41  ((examples/Test.hs:(2,1)-(4,10), AnnFunId), [examples/Test.hs:2:1-3]),
     42  ((examples/Test.hs:(2,1)-(4,10), AnnSemi),  [examples/Test.hs:5:1]),
     43  ((examples/Test.hs:(3,3)-(4,10), AnnIn),    [examples/Test.hs:4:3-4]),
     44  ((examples/Test.hs:(3,3)-(4,10), AnnLet),   [examples/Test.hs:3:3-5]),
     45  ((examples/Test.hs:3:7-11,       AnnEqual), [examples/Test.hs:3:9]),
     46  ((examples/Test.hs:3:7-11,       AnnFunId), [examples/Test.hs:3:7]),
     47  ((examples/Test.hs:4:6-10,       AnnVal),   [examples/Test.hs:4:8]),
     48  ((<no location info>, AnnEofPos),           [examples/Test.hs:5:1])],
     49
     50 [(examples/Test.hs:(2,1)-(4,10),
     51   [AnnLineComment "-- Compute foo"]),
     52  (examples/Test.hs:(3,3)-(4,10), [AnnLineComment "-- using a"]),
     53  (<no location info>,
     54   [AnnLineComment "-- |Comment not in a SrcSpan"])])
     55}}}
    2956
    3057This allows the annotation to be retrieved by
     
    3259{{{
    3360#!haskell
    34 -- | Retrieve an annotation based on the SrcSpan of the annotated AST
    35 -- element, and the known type of the annotation.
    36 getAnnotation :: ApiAnns -> SrcSpan -> Ann -> Maybe SrcSpan
    37 getAnnotation anns span ann = Map.lookup (AK span ann) anns
     61-- | Retrieve a list of annotation 'SrcSpan's based on the 'SrcSpan'
     62-- of the annotated AST element, and the known type of the annotation.
     63getAnnotation :: ApiAnns -> SrcSpan -> AnnKeywordId -> [SrcSpan]
     64getAnnotation (anns,_) span ann
     65   = case Map.lookup (span,ann) anns of
     66       Nothing -> []
     67       Just ss -> ss
    3868}}}
    3969
     
    6292  1. `AnnOpen` / `AnnClose` capture all bracketed structures.
    6393
    64   2. In the few places where the same annotation type can repeat, multiple indices are provided, hence `AnnColon` and `AnnColon2`.
    65 
    66   3. Where a value is being captured via e.g. `getINTEGER` the annotation index is called `AnnVal`.
     94  2. Where a value is being captured via e.g. `getINTEGER` the annotation index is called `AnnVal`.
    6795
    6896
     
    143171}}}
    144172
    145 === Implications ===
    146 
    147 This approach has minimal implications on the rest of GHC, except that some AST elements will require to be `Located` to enable the annotation to be looked up.
    148 
    149 Also, initial `./validate` tests show that haddock complains of increased memory usage, due to the extra information being captured in the AST. If this becomes a major problem a flag could be introduced when invoking the parser as to whether to actually capture the annotations or not.
    150173
    151174== Notes / Shortcomings ==
     
    169192}}}
    170193
    171 = [=#comments Possible] Extension for Comments =
    172 
    173 A possible extension to the above annotations would be to add comments as annotations, if enabled by the existing `Opt_KeepRawTokenStream` flag.
    174 
    175 This would add an additional structure to the `PState` for comments indexed by `SrcSpan`, and also a queue of pending comments.
    176 
    177 The basic mechanism would be to modify `lexToken` in `Lexer.x` to accumulate the comments in the queue as parsing progressed, and when the helper function `addComments` is called with a `SrcSpan` argument it moves any comments in the queue that fit within the `SrcSpan` into the comment annotation structure indexed by the given `SrcSpan`.
     194= Comments =
     195
     196Comments are returned as annotations, if enabled by the existing `Opt_KeepRawTokenStream` flag.
     197
     198There is an additional structure to the `PState` for comments indexed by `SrcSpan`, and also a queue of pending comments.
     199
     200The basic mechanism is for `lexToken` in `Lexer.x` to accumulate the comments in the queue as parsing progressed, and when the helper function `addComments` is called with a `SrcSpan` argument it moves any comments in the queue that fit within the `SrcSpan` into the comment annotation structure indexed by the given `SrcSpan`.
    178201
    179202These are then available for retrieval at the and, as with the existing annotations.
     
    1852082. It is possible to put the lexer into `Opt_Haddock` mode, a flag which is currently unset when `Opt_KeepRawTokenStream` is enabled. If these were made inclusive, it would be possible to explicitly tag the comments as being of type haddock, so that at a future date the annotations could possibly be used directly by haddock, rather than the very complicated parsing rules currently in place.
    186209
    187  
    188 = Early design discussion =
    189 
    190 == Richard Eisenberg response ==
    191 
    192 For what it's worth, my thought is not to use SrcSpanInfo (which, to me, is the wrong way to slice the abstraction) but instead to add SrcSpan fields to the relevant nodes. For example:
    193 {{{
    194 #!haskell
    195 
    196   | HsDo        SrcSpan              -- of the word "do"
    197                 BlockSrcSpans
    198                 (HsStmtContext Name) -- The parameterisation is unimportant
    199                                      -- because in this context we never use
    200                                      -- the PatGuard or ParStmt variant
    201                 [ExprLStmt id]       -- "do":one or more stmts
    202                 PostTcType           -- Type of the whole expression
    203 
    204 ...
    205 
    206 data BlockSrcSpans = LayoutBlock Int  -- the parameter is the indentation level
    207                                  ...  -- stuff to track the appearance of any semicolons
    208                    | BracesBlock ...  -- stuff to track the braces and semicolons
    209 }}}
    210 
    211 The way I understand it, the SrcSpanInfo proposal means that we would have lots of empty SrcSpanInfos, no? Most interior nodes don't need one, I think.
    212 
    213 Popping up a level, I do support the idea of including this info in the AST.
    214 
    215 == SPJ response to concern re extra noise in AST ==
    216 
    217  I thiink the key question is whether it is acceptable to sprinkle this kind of information throughout the AST. For someone interested in source-to-source conversions (like me) this is great, others may find it intrusive.
    218 
    219 It’s probably not too bad if you use record syntax; thus
    220 
    221 {{{
    222 #!haskell
    223   | HsDo  { hsdo_do_loc :: SrcSpan              -- of the word "do"
    224           , hsdo_blocks :: BlockSrcSpans
    225           , hsdo_ctxt   :: HsStmtContext Name
    226           , hsdo_stmts  :: [ExprLStmt id]
    227           , hsdo_type    :: PostTcType }
    228 
    229 }}}
    230 
    231 == Other issues ==
    232 
    233 The AST is initially created by the parser, and then changed through the renamer and type checker.
    234 
    235 From a source to source conversion perspective the `ParsedSource` is closest to the original source code, as it respects the original linear ordering of the declarations, which are each wrapped in an appropriate constructor from `HsDecl`.
    236 
    237 The `RenamedSource` gathers all the like declarations together, and strips out the `HsDecl`, as well as re-ordering binds to appear in dependency order.
    238 
    239 The `TypeCheckedSource` further changes the `RenamedSource` to replace the original type information with the calculated types.
    240 
    241 So manipulations need to happen at the `ParsedSource` level, but with the ability to query information from the `RenamedSource` or `TypecheckedSource` as required.
    242 
    243 At the moment HaRe manages this by building up a token tree indexed by SrcSpan with tokens at the leaves, constructed from the `ParsedSource`, and then indexes into it for changes based on the `RenamedSource`.  The token tree is fiddly and brittle, so it would be better to keep this information directy in the AST.
    244 
    245 == Abortive annotation parameter attempt ==
    246 
    247 [https://phabricator.haskell.org/D246 D246] captures an attempt to work through a type parameter. This exploded in complexity, and was abandoned.
    248 
    249 == SPJ alternative suggestion ==
    250 
    251     Another way to tackle this would be to ensure that syntax tree nodes have a "node-key" (a bit like their source location) that clients could use in a finite map, to map node-key to values of their choice.
    252 
    253 An initial investigation shows some complexity blow up too.  The same effect can be achieved with a virtual node key.
    254 
    255 == AZ Virtual node key proposal ==
    256 
    257 Instead of physically placing a "node-key" in each AST Node, a virtual
    258 node key can be generated from any `GenLocated SrcSpan e` comprising a
    259 combination of the `SrcSpan` value and a unique identifier from the
    260 constructor for `e`, perhaps using its `TypeRep`, since the entire AST
    261 derives Typeable.
    262 
    263 To further reduce the intrusiveness, a base Annotation type can be
    264 defined that captures the location of noise tokens for each AST
    265 constructor. This can then be emitted from the parser, if the
    266 appropriate flag is set to enable it.
    267 
    268 So
    269 
    270 {{{
    271 #!haskell
    272 
    273     data ApiAnnKey = AK SrcSpan TypeRep
    274 
    275     mkApiAnnKey :: (Located e) -> ApiAnnKey
    276     mkApiAnnKey = ...
    277 
    278     data Ann =
    279       ....
    280       | AnnHsLet    SrcSpan -- of the word "let"
    281                     SrcSpan -- of the word "in"
    282 
    283       | AnnHsDo     SrcSpan -- of the word "do"
    284 }}}
    285 
    286 And then in the parser
    287 
    288 {{{
    289 #!haskell
    290 
    291         | 'let' binds 'in' exp   { mkAnnHsLet $1 $3 (LL $ HsLet (unLoc $2) $4) }
    292 }}}
    293 
    294 The helper is
    295 
    296 {{{
    297 #!haskell
    298 
    299     mkAnnHsLet :: Located a -> Located b -> LHsExpr RdrName -> P (LHsExpr RdrName)
    300     mkAnnHsLet (L l_let _) (L l_in _) e = do
    301       addAnnotation (mkAnnKey e) (AnnHsLet l_let l_in)
    302       return e;
    303 }}}
    304 
    305 The Parse Monad would have to accumulate the annotations to be
    306 returned at the end, if called with the appropriate flag.
    307 
    308 There will be some boilerplate in getting the annotations and helper
    309 functions defined, but it will not pollute the rest.
    310 
    311 This technique can also potentially be backported to support older GHC
    312 versions via a modification to ghc-parser[1].
    313 
    314 
    315 https://hackage.haskell.org/package/ghc-parser
    316 
    317 == Neil Mitchell Response ==
    318 
    319 I was getting a bit lost between the idea and the implementation. Let
    320 me try rephrasing the idea in my own words.
    321 
    322 The goal: Capture inner source spans in AST syntax nodes. At the
    323 moment if ... then ... else ... captures the spans [if [...] then
    324 [...] else [...]]. We want to capture the spans for each keyword as
    325 well, so: [{if} [...] {then} [...] {else} [...]].
    326 
    327 The proposal: Rather than add anything to the AST, have a separate
    328 mapping `(SrcSpan,AstCtor)` to `[SrcSpan]`. So you give in the `SrcSpan`
    329 from the `IfThenElse` node, and some token for the `IfThenElse`
    330 constructor, and get back a list of `IfThenElse` for the particular
    331 keyword.
    332 
    333 I like the proposal because it adds nothing inside the AST, and
    334 requires no fresh invariants of the AST. I dislike it because the
    335 contents of that separate mapping are highly tied up with the AST, and
    336 easy to get out of sync. I think it's the right choice for three
    337 reasons, 1) it is easier to try out and doesn't break the AST, so we
    338 have more scope for changing our minds later; 2) the same technique is
    339 able to represent things other than `SrcSpan` without introducing a
    340 polymorphic src span; 3) the people who pay the complexity are the
    341 people who use it, which is relatively few people.
    342 
    343 That said, as a tweak to the API, rather than a single data type for
    344 all annotations, you could have:
    345 
    346 {{{
    347 #!haskell
    348 data AnnIfThenElse = AnnIfThenElse {posIf, posThen, posElse :: SrcSpan}
    349 data AnnDo = AnnDo {posDo :: SrcSpan}
    350 }}}
    351 
    352 Then you could just have an opaque `Map (SrcSpan, TypeRep) Dynamic`,
    353 with the invariant that the `TypeRep` in the key matches the `Dynamic`.
    354 Then you can have:
    355 
    356 {{{
    357 #!haskell
    358 getAnnotation :: Typeable a => Annotations -> SrcSpan -> Maybe a
    359 }}}
    360 
    361 I think it simplifies some of the TypeRep trickery
    362 you are engaging in with `mkAnnKey`.
    363 
    364 
    365 There was some further email between AZ and NDM (teaching AZ some basics) resulting in the following
    366 
    367 This allows code using the annotation to access this as follows
    368 
    369 {{{
    370 #!haskell
    371 processHsLet :: ApiAnns -> LHsExpr -> CustomReturnType
    372 processHsLet anns (L l (HsExpr localBinds expr)) = r
    373   where
    374     Just ann = getAnnotation anns l :: Maybe AnnHsLet
    375     ...
    376 }}}
    377 
    378 
    379 Key data structures
    380 
    381 {{{
    382 #!haskell
    383 type ApiAnns = Map.Map ApiAnnKey Value
    384 
    385 data ApiAnnKey = AK SrcSpan TypeRep
    386                   deriving (Eq,Ord,Show)
    387 
    388 mkApiAnnKey :: (Typeable a) => SrcSpan -> a -> ApiAnnKey
    389 mkApiAnnKey l a = AK l (typeOf (Just a))
    390 
    391 data Value = forall a . (Eq a, Show a, Typeable a, Outputable a) => Value a
    392 
    393 newValue :: (Eq a, Show a, Typeable a, Outputable a) => a -> Value
    394 newValue = Value
    395 
    396 typeValue :: Value -> TypeRep
    397 typeValue (Value x) = typeOf x
    398 
    399 fromValue :: Typeable a => Value -> a
    400 fromValue (Value x) = fromMaybe (error errMsg) $ res
    401   where
    402     res = cast x
    403     errMsg = "fromValue, bad cast from " ++ show (typeOf x)
    404                 ++ " to " ++ show (typeOf res)
    405 
    406 }}}
    407 
    408 Note that the `Value` type is based on the one in [https://github.com/ndmitchell/shake/blob/master/Development/Shake/Value.hs shake]
    409 
    410 This allows the annotation to be retrieved by
    411 
    412 {{{
    413 #!haskell
    414 
    415 -- | Retrieve an annotation based on the SrcSpan of the annotated AST
    416 -- element, and the known type of the annotation.
    417 getAnnotation :: (Typeable a) => ApiAnns -> SrcSpan -> Maybe a
    418 getAnnotation anns span = res
    419   where res = case  Map.lookup (AK span (typeOf res)) anns of
    420                        Nothing -> Nothing
    421                        Just d -> Just $ fromValue d
    422 }}}
    423 
    424 === Annotation structures
    425 
    426 Each annotation is a separate data structure, named specifically for the constructor of the AST element being retrieved.
    427 
    428 So if we have an AST element `L l ConstructorXXX` the corresponding annotation will be called `AnnConstructorXXX`.
    429 
    430 An examples
    431 
    432 {{{
    433 #!haskell
    434 
    435 -- TyClDecl
    436 data AnnClassDecl = AnnClassDecl
    437         { aclassdecl_class   :: SrcSpan
    438         , aclassdecl_mwhere  :: Maybe SrcSpan
    439         , aclassdecl_mbraces :: Maybe (SrcSpan,SrcSpan) }
    440             deriving (Eq,Data,Typeable,Show)
    441 
    442 }}}
    443 
    444 == Open Questions (AZ) ==
    445 
    446 I am currently working annotations into the parser, provided them as a separate structure at the end of the parse, indexed to the original by SrcSpan and AST element type.
    447 
    448 The question I have is how to capture commas and semicolons in lists of items.
    449 
    450 There are at least three ways of doing this
    451 
    452 1. Make sure each of the items is Located, and add the possible comma location to the annotation structure for it.
    453 
    454 This has the drawback that all instances of the AST item annotation have the possible comma location in them, and it does not cope with multiple separators where these are allowed.
    455 
    456 
    457 2. Introduce a new hsSyn structure to explicitly capture comma-separated lists.
    458 
    459 This is the current approach I am taking, modelled on the OrdList implementation, but with an extra constructor to capture the separator location.
    460 
    461 Thus
    462 
    463 {{{
    464 #!haskell
    465 data HsCommaList a
    466   = Empty
    467   | Cons a (HsCommaList a)
    468   | ExtraComma SrcSpan (HsCommaList a)
    469        -- ^ We need a SrcSpan for the annotation
    470   | Snoc (HsCommaList a) a
    471   | Two (HsCommaList a) -- Invariant: non-empty
    472         (HsCommaList a) -- Invariant: non-empty
    473 }}}
    474 
    475 
    476 3. Change the lists to be of type `[Either SrcSpan a]` to explicitly capture the comma locations in the list.
    477 
    478 
    479 4. A fourth way is to add a list of SrcSpan to the annotation for the parent structure of the list, simply tracking the comma positions. This will make working with the annotations complicated though.
    480 
    481 
    482 I am currently proceeding with option 2, but would appreciate some comment on whether this is the best approach to take.
    483 
    484 Option 2 will allow the AST to capture the extra commas in record constructors, as suggested by SPJ in the debate on that feature.
    485 
    486 However, the structure is being misused in that `ExtraComma` is used to capture ALL commas, as well as semicolons in the `{ .. ; .. }` idiom.
    487 
    488 == Update 2014-10-12 ==
    489 
    490 Based on further feedback from Neil Mitchell and SPJ, the basic annotation is now
    491 
    492 {{{
    493 #!haskell
    494 type ApiAnns = Map.Map ApiAnnKey SrcSpan
    495 
    496 data ApiAnnKey = AK SrcSpan Ann
    497                   deriving (Eq,Ord,Show)
    498 
    499 -- ---------------------------------------------------------------------
    500 
    501 -- | Retrieve an annotation based on the SrcSpan of the annotated AST
    502 -- element, and the known type of the annotation.
    503 getAnnotation :: ApiAnns -> SrcSpan -> Ann -> Maybe SrcSpan
    504 getAnnotation anns span ann = Map.lookup (AK span ann) anns
    505 
    506 -- --------------------------------------------------------------------
    507 
    508 -- | Note: in general the names of these are taken from the
    509 -- corresponding token, unless otherwise noted
    510 data Ann = AnnAs
    511          | AnnBang
    512          | AnnClass
    513          | AnnClose -- ^ } or ] or ) or #) etc
    514          | AnnComma
    515          | AnnDarrow
    516          | AnnData
    517          | AnnDcolon
    518          ....
    519 }}}
    520 
    521 This is a lot simpler than before.
    522