Compiling one module: HscMain

Here we are going to look at the compilation of a single module. There is a picture that goes with this description, which appears at the bottom of this page, but you'll probably find it easier to open this link in another window, so you can see it at the same time as reading the text.

You can also watch a video of Simon Peyton-Jones explaining the compilation pipeline here: Compiler Pipeline II (10'16")

Look at the picture first. The yellow boxes are compiler passes, while the blue stuff on the left gives the data type that moves from one phase to the next. The entire pipeline for a single module is run by a module called HscMain (compiler/main/HscMain.hs). Each data type's representation can be dumped for further inspection using a -ddump-* flag. (Consider also using -ddump-to-file: some of the dump outputs can be large!) Here are the steps it goes through:

  • The Front End processes the program in the big HsSyn type. HsSyn is parameterised over the types of the term variables it contains. The first three passes (the front end) of the compiler work like this:

    • The Parser produces HsSyn parameterised by RdrName. To a first approximation, a RdrName is just a string. (-ddump-parsed)

    • The Renamer transforms this to HsSyn parameterised by Name. To a first appoximation, a Name is a string plus a Unique (number) that uniquely identifies it. In particular, the renamer associates each identifier with its binding instance and ensures that all occurrences which associate to the same binding instance share a single Unique. (-ddump-rn)

    • The Typechecker transforms this further, to HsSyn parameterised by Id. To a first approximation, an Id is a Name plus a type. In addition, the type-checker converts class declarations to Classes, and type declarations to TyCons and DataCons. And of course, the type-checker deals in Types and TyVars. The data types for these entities (Type, TyCon, Class, Id, TyVar) are pervasive throughout the rest of the compiler. (-ddump-tc)

These three passes can all discover programmer errors, which are sorted and reported to the user.

  • The Desugarer (compiler/deSugar/Desugar.hs) converts from the massive HsSyn type to GHC's intermediate language, CoreSyn. This Core-language data type is unusually tiny: just eight constructors.) (-ddump-ds)

    Generally speaking, the desugarer produces few user errors or warnings. But it does produce some. In particular, (a) pattern-match overlap warnings are produced here; and (b) when desugaring Template Haskell code quotations, the desugarer may find that THSyntax is not expressive enough. In that case, we must produce an error (compiler/deSugar/DsMeta.hs).

    This late desugaring is somewhat unusual. It is much more common to desugar the program before typechecking, or renaming, because that presents the renamer and typechecker with a much smaller language to deal with. However, GHC's organisation means that
    • error messages can display precisely the syntax that the user wrote; and
    • desugaring is not required to preserve type-inference properties.

  • Then the CoreTidy pass gets the code into a form in which it can be imported into subsequent modules (when using --make) and/or put into an interface file.

It makes a difference whether or not you are using -O at this stage. With -O (or rather, with -fomit-interface-pragmas which is a consequence of -O), the tidied program (produced by tidyProgram) has unfoldings for Ids, and RULES. Without -O the unfoldings and RULES are omitted from the tidied program. And that, in turn, affects the interface file generated subsequently.

There are good notes at the top of the file compiler/main/TidyPgm.hs; the main function is tidyProgram, documented as "Plan B" ("Plan A" is a simplified tidy pass that is run when we have only typechecked, but haven't run the desugarer or simplifier).

The serialisation does (pretty much) nothing except serialise. All the intelligence is in the Core-to-IfaceSyn conversion; or, rather, in the reverse of that step.

  • The same, tidied Core program is now fed to the Back End. First there is a two-stage conversion from CoreSyn to GHC's intermediate language, StgSyn.
    • The first step is called CorePrep, a Core-to-Core pass that puts the program into A-normal form (ANF). In ANF, the argument of every application is a variable or literal; more complicated arguments are let-bound. Actually CorePrep does quite a bit more: there is a detailed list at the top of the file compiler/coreSyn/CorePrep.hs.
    • The second step, CoreToStg, moves to the StgSyn data type (compiler/stgSyn/CoreToStg.hs). The output of CorePrep is carefully arranged to exactly match what StgSyn allows (notably ANF), so there is very little work to do. However, StgSyn is decorated with lots of redundant information (free variables, let-no-escape indicators), which is generated on-the-fly by CoreToStg.
  • Now the path forks again:
    • If we are generating GHC's stylised C code, we can just pretty-print the C-- code as stylised C (compiler/cmm/PprC.hs)
    • If we are generating native code, we invoke the native code generator. This is another Big Mother (compiler/nativeGen).
    • If we are generating LLVM code, we invoke the LLVM code generator. This is a reasonably simple code generator (compiler/llvmGen).

The Diagram

This diagram is also located here, so that you can open it in a separate window.

Last modified 4 years ago Last modified on Sep 12, 2015 10:01:52 PM

Attachments (2)

Download all attachments as: .zip