Opened 5 years ago

Closed 5 years ago

Last modified 4 years ago

#10137 closed task (fixed)

Rewrite switch code generation

Reported by: nomeata Owned by: nomeata
Priority: normal Milestone: 8.0.1
Component: Compiler (CodeGen) Version: 7.9
Keywords: Cc: bgamari, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #9157, #9159, #8326, #8317, #9159 Differential Rev(s):
Wiki Page:

Description (last modified by nomeata)

Inspired by #10124 I looked at the code generation for enumeration and integer types, and I think this can be improved in a few ways. My goals are:

  • Fancier code for integer types. Currently, the code for enumeration types will emit jump tables for dense choices; there is no reason to treat integer types differently.
  • The ability to behave differently if some of the case alternatives are equal, and, as an extension of that,
  • The possibility to create branchless code if multiple checks would go to the same jump.

The current scheme is roughly:

  • When we generate Cmm code for a STG case expression, we handle enumeration types and literals separately.
  • At this point, the decisions about what code to generate are made (jump tables (but only for enumeration types) or if-then-else trees).
  • The Common Block Optimization on Cmm happens later in the pipeline, making it non-trivial to detect branches that do the same thing.

My plan is the following:

  • In the STG→Cmm transformation, floats will be handled separately. Matching against literals is fishy anyways, so my suggestion is to simply generate a linear list of equality checks here – turning the intended operation (equality test) into something else (comparisons in a if-then-else tree) feels wrong to me for floats. And the rest would not work well for floats, so I’d like to have them out of the way.
  • The case of enumeration types will be reduced to word literals, and treated the same from now on.
  • For integer types, no code generation decisions is made at this point. Instead, always a CmmSwitch statement is generated.
  • In a separate Cmm transformation pass, which will run /after/ the common block optimization, we visit all CmmSwitches and create proper code for them.

I’d like to separate the algorithm that plans the code generation into a function (possibly even module) of its own, so that the decisions can easily by analyized and modified.

The algorithm has a few choices to make:

  • If multiple values point to the same code, we can generate branchless code (y := x == 1 || x == 5 || x = 7; if (y==0) then goto l1 else goto l2).
  • If there are whole ranges pointing to the same code, the above can also use comparisons.
  • If there are dense ranges (i.e. a range with more than half of the possible values mapped to something), we want to generate jump tables from them (still CmmSwitch values).
  • Unless all options are handled by one of these possibilities, they need to be combined using if-then-else trees.

The CmmSwitch constructor needs to change for that. It currently takes a [Maybe Label] argument, which is not suitable for before that pass, when its table may be sparse. I think an IntMap Label would work nicely.

Change History (15)

comment:1 Changed 5 years ago by simonpj

Sounds good. I do think that some work can be done in Core. For example:

case x of
  1# -> e1
  2# -> e2
  7# -> e1
  _  -> e2

might turn into

case (x ==# 1# `orI#` x ==# 7#) of
  0# -> e2   -- False
  1# -> e1   -- True

I would also dearly like to implement #8317 to get rid of all those stupid tagToEnum# calls that clutter up the code. We don't do this for a stupid reason: #8326. As part of your crusade, dealing with this would be fantastic.

I'd be happy to advise/help.

Simon

comment:2 Changed 5 years ago by jstolarek

The Common Block Optimization on Cmm happens later in the pipeline, making it non-trivial to detect branches that do the same thing.

#9157 demonstrates a case where CBE fails, although it definitely shouldn't.

Re #8326 I spent several weeks trying to fix it but ultimately I failed. You can find various experimental versions of my code on these branches: T8326-heap-checks T8326-heap-checks-aletr-gc_plan T8326-heap-checks-alternative-plan T8326-heap-checks-precompile-alts though it probably will be easier to write your own code.

comment:3 Changed 5 years ago by nomeata

All very well, but can we please keep this particular ticket for the discussion of better code generation from STG onwards :-) #10124 is more suitable for what we can do in Core, I think.

My rewrite will not only fix the issue of branchless cases, but also other (small) deficiencies in the current code (jump tables also for integer types; a better choice for if-then-else branching that leads to larger jump tables).

Anyway, I started on wip/T10137, but already my refactoring broke something and I get segfaults :-)

comment:4 Changed 5 years ago by rwbarton

Also see the comments of #9159.

The case of enumeration types will be reduced to word literals, and treated the same from now on.

Note that when matching on an enumeration type, we can assume that the constructor tag is within the range of possible tag values. We cannot make any such assumption for matches on ints. So, we should remember the extra information that we have about the range when doing this reduction. I imagine you will have a function to generate code for an int pattern match with known bounds anyways, as part of a binary search algorithm, so this should not complicate the code greatly.

Matching against literals literals is fishy anyways, so my suggestion is to simply generate a linear list of equality checks here – turning the intended operation (equality test) into something else (comparisons in a if-then-else tree) feels wrong to me for floats.

I don't see anything wrong with using comparisons for pattern matching on floats as long as the patterns are all finite numbers (though there is trickiness around negative zero to be aware of). But I also think float pattern matching is much less important than your other goals, so I would agree with doing something simple for floats, at least initially.

comment:5 Changed 5 years ago by nomeata

Note that when matching on an enumeration type, we can assume that the constructor tag is within the range of possible tag values. We cannot make any such assumption for matches on ints. So, we should remember the extra information that we have about the range when doing this reduction.

Right. The datatype is currently

data SwitchTargets =
    SwitchTargets (Maybe (Integer, Integer)) (Maybe Label) (M.Map Integer Label)

so there optionally is a definite range (absent when matching literal). The function addRange in the new CmmSwitch module (currently in branch wip/T10137) takes care of that case, before the general layout algorithm runs on the remaining range.

comment:6 Changed 5 years ago by nomeata

Owner: set to nomeata

I’m happy to have some code review on Phab:D720, but it is still work-in-progress.

comment:7 Changed 5 years ago by nomeata

Description: modified (diff)

comment:8 Changed 5 years ago by nomeata

Fist benchmarks results are in; binary sizes are reduced, runtime changes varies with a small tendency towards improvement:

                          Size    Allocs   Runtime   Elapsed  TotalMem
            Min          -1.8%     -0.0%     -4.5%     -7.8%      0.0%
            Max          -0.7%      0.0%     +3.3%     +3.4%     +2.9%
 Geometric Mean          -1.4%     -0.0%     -0.2%     -0.6%     +0.1%

The code still has some magic constants where an optimum could probably be found somehow.

comment:9 Changed 5 years ago by nomeata

Status: newpatch

Ok, I think the code is ready for some review; see Phab:D720. I’d like to land this before adding some of the extra bells and whistles, such as branches code in situations where this is possible.

Note that although #10124 and the story about branchless codes made me look at this, the rewrite of code generation for switches stands on its own, as it improves upon the previous situations. In particular, we currently never generate jump tables for case expressions on Int# and Word#, which is quite a waste (previously reported as #9159 as well), and we were building the if-then-else tree before creating jump tables, thus possibly splitting them in the middle.

comment:10 Changed 5 years ago by Joachim Breitner <mail@…>

In de1160be047790afde4ec76de0a81ba3be0c73fa/ghc:

Refactor the story around switches (#10137)

This re-implements the code generation for case expressions at the Stg →
Cmm level, both for data type cases as well as for integral literal
cases. (Cases on float are still treated as before).

The goal is to allow for fancier strategies in implementing them, for a
cleaner separation of the strategy from the gritty details of Cmm, and
to run this later than the Common Block Optimization, allowing for one
way to attack #10124. The new module CmmSwitch contains a number of
notes explaining this changes. For example, it creates larger
consecutive jump tables than the previous code, if possible.

nofib shows little significant overall improvement of runtime. The
rather large wobbling comes from changes in the code block order
(see #8082, not much we can do about it). But the decrease in code size
alone makes this worthwhile.

```
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
            Min          -1.8%      0.0%     -6.1%     -6.1%     -2.9%
            Max          -0.7%     +0.0%     +5.6%     +5.7%     +7.8%
 Geometric Mean          -1.4%     -0.0%     -0.3%     -0.3%     +0.0%
```

Compilation time increases slightly:
```
        -1 s.d.                -----            -2.0%
        +1 s.d.                -----            +2.5%
        Average                -----            +0.3%
```

The test case T783 regresses a lot, but it is the only one exhibiting
any regression. The cause is the changed order of branches in an
if-then-else tree, which makes the hoople data flow analysis traverse
the blocks in a suboptimal order. Reverting that gets rid of this
regression, but has a consistent, if only very small (+0.2%), negative
effect on runtime. So I conclude that this test is an extreme outlier
and no reason to change the code.

Differential Revision: https://phabricator.haskell.org/D720

comment:11 Changed 5 years ago by nomeata

Resolution: fixed
Status: patchclosed

Pushed this now, ready to tackle any fallout. Now on to implementing a fix for #10124.

comment:12 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:13 Changed 4 years ago by nomeata

I don’t plan to look into this now, but whoever touches this code next and feels like improving the heuristics that decide whether to do a table or a binary search, see 9159#comment:5 for what the Java compiler does.

comment:14 Changed 4 years ago by nomeata

comment:15 Changed 4 years ago by thomie

Type of failure: None/UnknownRuntime performance bug
Note: See TracTickets for help on using tickets.