#15103 closed task (fixed)

Speed optimizations for elimCommonBlocks

Reported by: AndreasK Owned by: AndreasK
Priority: normal Milestone: 8.6.1
Component: Compiler Version: 8.5
Keywords: CodeGen Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D4597
Wiki Page:

Description

Use toBlockList instead of revPostorder.

Block elimination works on a given Cmm graph by:

  • Getting a list of blocks.
  • Looking for duplicates in these blocks.
  • Removing all but one instance of duplicates.

There are two (reasonable) ways to get the list of blocks.

  • The fast way: toBlockList: This just flattens the underlying map into a list.
  • The convenient way: revPostorder: Start at the entry label, scan for reachable blocks and return only these. This has the advantage of removing all dead code.

If there is dead code the later is better. Work done on unreachable blocks is clearly wasted work. However by the point we run the common block elimination pass the input graph already had all dead code removed. This is done during control flow optimization in CmmContFlowOpt which is our first Cmm pass.

This means common block elimination is free to use toBlockList because revPostorder would return the same blocks. (Although in a different order).

Change the triemap used for grouping by a label list

from (TM.ListMap UniqDFM) to ListMap (GenMap IntMap).

  • Using GenMap offers leaf compression. Which is a trie optimization described

by the Note [Compressed TrieMap] in CoreSyn/TrieMap.hs

  • Using a IntMap removes the overhead associated with UniqDFM.

The reasoning why this is deterministic after the change:

  • IntMap is deterministic given the same keys.
  • Labels have a Int representation, so for the same Labels we get the

same keys, hence the same result for each run.

Change History (11)

comment:1 Changed 17 months ago by AndreasK

Owner: set to AndreasK

comment:2 Changed 17 months ago by AndreasK

Status: newpatch

comment:3 Changed 17 months ago by simonpj

Keywords: CodeGen added

comment:4 in reply to:  3 Changed 17 months ago by AndreasK

Replying to simonpj: Is this tag accurate? After all it doesn't change the code being generated.

It's strictly a optimization for making GHC faster.

comment:5 Changed 17 months ago by simonpj

Yes, but it makes it appear on this page which collects backend tickets generally. I'd be entirely happy with a finer-grain categorisation, if you would like to propose one, but I'm finding these topic pages quite helpful. Here's my list of topics.

comment:6 Changed 16 months ago by Ben Gamari <ben@…>

In bd43378d/ghc:

Optimizations for CmmBlockElim.

* Use toBlockList instead of revPostorder.

    Block elimination works on a given Cmm graph by:
     * Getting a list of blocks.
     * Looking for duplicates in these blocks.
     * Removing all but one instance of duplicates.

    There are two (reasonable) ways to get the list of blocks.
     * The fast way: `toBlockList`
       This just flattens the underlying map into a list.
     * The convenient way: `revPostorder`
       Start at the entry label, scan for reachable blocks and return
       only these. This has the advantage of removing all dead code.

    If there is dead code the later is better. Work done on unreachable
    blocks is clearly wasted work. However by the point we run the
    common block elimination pass the input graph already had all dead code
    removed. This is done during control flow optimization in
    CmmContFlowOpt which is our first Cmm pass.

    This means common block elimination is free to use toBlockList
    because revPostorder would return the same blocks. (Although in
    a different order).

* Change the triemap used for grouping by a label list
  from `(TM.ListMap UniqDFM)` to `ListMap (GenMap LabelMap)`.

    * Using GenMap offers leaf compression. Which is a trie
      optimization described by the Note [Compressed TrieMap] in
      CoreSyn/TrieMap.hs

    * Using LabelMap removes the overhead associated with UniqDFM.

  This is deterministic since if we have the same input keys the same
  LabelMap will be constructed.

Test Plan: ci, profiling output

Reviewers: bgamari, simonmar

Reviewed By: bgamari

Subscribers: dfeuer, thomie, carter

GHC Trac Issues: #15103

Differential Revision: https://phabricator.haskell.org/D4597

comment:7 Changed 16 months ago by simonpj

What is the perf impact of this optimisation, if any? (I think the goal is to improve compile times, right? Not run-time.)

comment:8 in reply to:  7 Changed 16 months ago by AndreasK

Replying to simonpj:

What is the perf impact of this optimisation, if any?

Generated code is the same. Compiler performance changes for the better.

I think the file I looked at was nofib/spectral/simple

ghc-stage2 -fforce-recomp -c -O1 Main.hs +RTS -s -RTS

build Allocated (Byte) measured time
master 4,475,812,248 3.220 s (3.204 s .. 3.251 s)
toBlockList 4,474,452,216 3.196 s (3.192 s .. 3.204 s)
IntMap + blockList 4,461,727,496 3.189 s (3.174 s .. 3.203 s)

I've added leaf compression afterwards as well as changed IntMap to LabelMap. But the former should strictly improve things while the later should result in the same code as it's just a newtype wrapper. So the above results should still be representative.

I think the goal is to improve compile times, right? Not run-time.

Yes it's purely an "make GHC faster" patch.

Last edited 16 months ago by AndreasK (previous) (diff)

comment:9 Changed 16 months ago by simonpj

Terrific. Would you like to do a nofib run before-and-after, and dump the nofib-analyse summary in here? It's very helpful just to record the benefits of particular changes, for posterity.

comment:10 in reply to:  9 Changed 16 months ago by AndreasK

Replying to simonpj:

Terrific. Would you like to do a nofib run before-and-after, and dump the nofib-analyse summary in here? It's very helpful just to record the benefits of particular changes, for posterity.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
       maillist           0.0%      0.0%     0.024     0.025     +6.5%
           mate          -0.0%      0.0%     +4.6%     +4.6%      0.0%
   cryptarithm1           0.0%      0.0%     -1.7%     -1.8%      0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -0.0%     -1.7%     -1.8%      0.0%
            Max          +0.0%     +0.1%     +4.6%     +4.6%     +6.5%
 Geometric Mean          -0.0%     +0.0%     +0.3%     +0.3%     +0.1%

Compile Times

-------------------------------------------------------------------------------
        Program             logNoOpt       logWithOpt
-------------------------------------------------------------------------------
        -1 s.d.                -----            -1.9%
        +1 s.d.                -----            +1.5%
        Average                -----            -0.2%

Compile Allocations

-------------------------------------------------------------------------------
        Program             logNoOpt       logWithOpt
-------------------------------------------------------------------------------
        -1 s.d.                -----            -0.3%
        +1 s.d.                -----            -0.1%
        Average                -----            -0.2%

These runtime results took me by surprise!

It turns out code layout depends on uniques. And the final uniques change with this patch. So we get slightly different codelayout.

I verified that the code for mate is otherwise the same. Using different starting or increment values for uniques also lead to different results with up to the same performance spread. So I don't think the runtime change is representative.

Compiler performance impact is as expected.

comment:11 Changed 15 months ago by AndreasK

Resolution: fixed
Status: patchclosed

This has been merged

Note: See TracTickets for help on using tickets.