Opened 4 years ago

Last modified 4 years ago

#10124 new bug

Simple case analyses generate too many branches

Reported by: bgamari Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.4
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #6135, #9661,#10137 Differential Rev(s):
Wiki Page:

Description

Take the following example,

f :: Int -> Bool
f a = case a of
        1  -> True
        5  -> True
        8  -> True
        9  -> True
        11 -> True
        19 -> True
        _  -> False
{-# NOINLINE f #-}

main = print $ f 8

This gets lowered to the following Core (by GHC 7.8),

f
f =
  \ a_s3EH ->
    case a_s3EH of _ { I# ds_s3EJ ->
    case ds_s3EJ of _ {
      __DEFAULT -> False;
      1 -> True;
      5 -> True;
      8 -> True;
      9 -> True;
      11 -> True;
      19 -> True
    }
    }

I have expected GHC to lower this into a nice string of logical operations, with perhaps a couple of branches at the end to determine the result.

Unfortunately, this is not what happens. Instead the C-- is a sea of branches,

       c3F7:
           _s3EK::I64 = I64[R1 + 7];
           if (%MO_S_Lt_W64(_s3EK::I64, 9)) goto c3Fz; else goto c3FA;
       c3Fz:
           if (%MO_S_Lt_W64(_s3EK::I64, 5)) goto c3Fq; else goto c3Fr;
       c3Fq:
           if (_s3EK::I64 != 1) goto c3Ff; else goto c3Fg;
       c3Fr:
           if (%MO_S_Lt_W64(_s3EK::I64, 8)) goto c3Fn; else goto c3Fo;
       c3Fn:
           if (_s3EK::I64 != 5) goto c3Ff; else goto c3Fg;
       c3Fo:
           if (_s3EK::I64 != 8) goto c3Ff; else goto c3Fg;
       c3FA:
           if (%MO_S_Lt_W64(_s3EK::I64, 11)) goto c3Fw; else goto c3Fx;
       c3Fw:
           if (_s3EK::I64 != 9) goto c3Ff; else goto c3Fg;
       c3Fx:
           if (%MO_S_Lt_W64(_s3EK::I64, 19)) goto c3Ft; else goto c3Fu;
       c3Ft:
           if (_s3EK::I64 != 11) goto c3Ff; else goto c3Fg;
       c3Fu:
           if (_s3EK::I64 != 19) goto c3Ff; else goto c3Fg;
       c3Ff:
           R1 = False_closure+1;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c3Fg:
           R1 = True_closure+2;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
...

Which gets turned into the branchy assembler that you would expect. To my surprise even the LLVM backend isn't able to bring this back into a branchless form.

Change History (20)

comment:1 Changed 4 years ago by bgamari

Type of failure: None/UnknownRuntime performance bug

comment:2 Changed 4 years ago by bgamari

This is very closely related to the unboxed bool work mentioned in #6135 and https://ghc.haskell.org/trac/ghc/wiki/PrimBool

Last edited 4 years ago by bgamari (previous) (diff)

comment:3 Changed 4 years ago by simonpj

What did you mean by "a nice string of logical operations with a couple of branches"? What code do you want from this case expression?

Simon

comment:4 Changed 4 years ago by bgamari

I would hope that the example above would get lowered to something like the following,

let !p = a ==# 1 `orI#` a ==# 5 `orI#` a ==# 8 `orI#` a ==# 9 `orI#` a ==# 11 `orI#` a ==# 19
case p of 
  1 -> do_something
  0 -> do_something_else

The idea is that I'd want to avoid emitting branches in this case.

comment:5 Changed 4 years ago by jstolarek

comment:6 Changed 4 years ago by simonpj

Ah, I see. So maybe you want a rule like this:

isTrue# a || isTrue# b = isTrue# (a `orI#` b)

NB that the (==) method for Int says

(I# x) `eqInt` (I# y) = isTrue# (x ==# y)

You'd have delay inlining (||) a bit, but you could do that.

Simon

comment:7 Changed 4 years ago by nomeata

Would could go further and expect

f =
  \ a_s3EH ->
    case a_s3EH of _ { I# ds_s3EJ ->
    case ds_s3EJ of _ {
      __DEFAULT -> x;
      1 -> y;
      5 -> y;
      8 -> y;
      9 -> y;
      11 -> y;
      19 -> y
    }
    }

to be replaced by

f =
  \ a_s3EH ->
    case a_s3EH of _ { I# ds_s3EJ ->
    let p! = some_branchless_formula_involving ds_s3EJ
    case p of _ {
      __DEFAULT -> x;
      1 -> y;
    }
    }

where some_branchless_formula_involving could be any expression that is 1 for 1,5,8,9,11,19 – whether it is a disjunction of equalities or some fancy bit-fiddling magic.

Even more general, how about turning

f =
  \ a_s3EH ->
    case a_s3EH of _ { I# ds_s3EJ ->
    case ds_s3EJ of _ {
      __DEFAULT -> x;
      1 -> y;
      5 -> y;
      8 -> z;
      9 -> y;
      11 -> z;
      19 -> z
    }
    }

into

f =
  \ a_s3EH ->
    case a_s3EH of _ { I# ds_s3EJ ->
    let p! = some_branchless_formula_involving ds_s3EJ
    case p of _ {
      __DEFAULT -> x;
      1 -> y;
      2 -> z;
    }
    }

This would generate one branch per unique right-hand-side of a ->, instead of one branch per literal matched against.

Not sure if Core is the right place for this, though – it feels more like instruction selection in the code generator.

And indeed, emitCmmSwitch and emitCmmLitSwitch in StgCmmUtils is where currently, fancy if-then-else trees are generated. Maybe that’s where one could try to generate branchless code? (At that stage, it would involve detecting which branches actually contain the same code; not sure if that is desirable.)

Last edited 4 years ago by nomeata (previous) (diff)

comment:8 Changed 4 years ago by simonpj

Well Joachim's stuff depends on spotting common RHSs. I'd rather avoid producing them in the first place. But one could imagine doing both.

Simon

comment:9 Changed 4 years ago by bgamari

nomeata, indeed LLVM transforms the C analogue to the example given here into only 9 instructions.

I'd agree that it seems better to avoid breaking up the case expression to begin with.

This weekend I focused on giving the user the ability to write the branchless implementation explicitly by fixing #9661. It would be nice to have a story for how we could transform cases into branch-less form without user intervention, although I'm not sure how to know when this transform will pay off.

comment:10 Changed 4 years ago by nomeata

I'd agree that it seems better to avoid breaking up the case expression to begin with.

Not sure if that helps, after all, we want to have good code even when the user writes a case statement to begin with:

myIsSpace ' ' = True
myIsSpace '\n' = True
myIsSpace '\t' = True
myIsSpace _ = False

I would expect the compiler to (try to) create optimal code from this specification, whether it’s a linear list of condition, an if-then-else-tree or a jump table.

I looked at the CG, which even has a pass to unify the common branches, but we need CMM first, so we have to generate some assembly. We currently unconditionally generate an if-then-else tree there, so it is hard to rewrite that into something smarter.

Maybe the way forward would be to implement this todo from 11 years ago:

emitCmmLitSwitch :: CmmExpr                    -- Tag to switch on
               -> [(Literal, CmmAGraphScoped)] -- Tagged branches
               -> CmmAGraphScoped              -- Default branch (always)
               -> FCode ()                     -- Emit the code
-- Used for general literals, whose size might not be a word,
-- where there is always a default case, and where we don't know
-- the range of values for certain.  For simplicity we always generate a tree.
--
-- ToDo: for integers we could do better here, perhaps by generalising
-- mk_switch and using that.  --SDM 15/09/2004

For emitCmmSwitch there is already a case that generates a CmmSwitch statement, used when compiling to C. So we might want to try to create a switch statement also in emitCmmLitSwitch when compiling via LLVM, to leave it to the compiler.

If we emit a switch statement in emitCmmLitSwitch unconditionally, then we can run the common block transformation first, and later, in a separate step replace the CmmSwitch by a tree of if-then-else statements, or by some branchless code, or whatever looks best by then. For LLVM, we might simply leave it as a switch statement.

In order to do so, the type of CmmSwitch would have to support sparse switch statements better – it currently takes a [Maybe Label].

(Disclaimer: I don’t know much about code generation. If I’m not helpful here, just tell me :-))

comment:11 Changed 4 years ago by simonpj

I'm all for doing both!

  • Try to avoid duplication in the first place (comment:6)
  • Recover sharing even if duplication happens, or the programmer wrote it (comment:10)

Re comment:10, one could do it in Cmm, but I suspect it'd be better done in Core. Could be a variant of Note [Combine identical alternatives] in SimplUtils.

Over to you, Ben.

comment:12 Changed 4 years ago by nomeata

I believe that at least some of this can and should be handled in the code generator, where the code generation for branching on integers can be improved anyways. I describe a possible plan in #10137.

comment:13 Changed 4 years ago by Joachim Breitner <mail@…>

In de1160be047790afde4ec76de0a81ba3be0c73fa/ghc:

Refactor the story around switches (#10137)

This re-implements the code generation for case expressions at the Stg →
Cmm level, both for data type cases as well as for integral literal
cases. (Cases on float are still treated as before).

The goal is to allow for fancier strategies in implementing them, for a
cleaner separation of the strategy from the gritty details of Cmm, and
to run this later than the Common Block Optimization, allowing for one
way to attack #10124. The new module CmmSwitch contains a number of
notes explaining this changes. For example, it creates larger
consecutive jump tables than the previous code, if possible.

nofib shows little significant overall improvement of runtime. The
rather large wobbling comes from changes in the code block order
(see #8082, not much we can do about it). But the decrease in code size
alone makes this worthwhile.

```
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
            Min          -1.8%      0.0%     -6.1%     -6.1%     -2.9%
            Max          -0.7%     +0.0%     +5.6%     +5.7%     +7.8%
 Geometric Mean          -1.4%     -0.0%     -0.3%     -0.3%     +0.0%
```

Compilation time increases slightly:
```
        -1 s.d.                -----            -2.0%
        +1 s.d.                -----            +2.5%
        Average                -----            +0.3%
```

The test case T783 regresses a lot, but it is the only one exhibiting
any regression. The cause is the changed order of branches in an
if-then-else tree, which makes the hoople data flow analysis traverse
the blocks in a suboptimal order. Reverting that gets rid of this
regression, but has a consistent, if only very small (+0.2%), negative
effect on runtime. So I conclude that this test is an extreme outlier
and no reason to change the code.

Differential Revision: https://phabricator.haskell.org/D720

comment:14 Changed 4 years ago by nomeata

With the above refactoring in place, it should be easier now to implement this on the Cmm level. I had a quick look, but I’m a bit lost with Cmm here (conditionals vs. boolean ops etc.). If someone can tell me what the correct Cmm code should be for, say, the example from comment:10, that’d be helpful.

comment:15 Changed 4 years ago by nomeata

Hmm, given the example from comment:10, we currently generate this code:

block_c3S3_info:
_c3S3:
	movq 7(%rbx),%rax
	cmpq $10,%rax
	jb _u3Si
_u3Sj:
	cmpq $11,%rax
	jb _c3Sh
_u3Sk:
	cmpq $32,%rax
	jne _c3Se
_c3Sh:
	movl $GHC.Types.True_closure+2,%ebx
	addq $8,%rbp
	jmp *(%rbp)
_c3S7:
	movl $myIsSpace_rkB_closure,%ebx
	jmp *-8(%r13)
_c3Se:
	movl $GHC.Types.False_closure+1,%ebx
	addq $8,%rbp
	jmp *(%rbp)
_u3Si:
	cmpq $9,%rax
	jb _c3Se
	jmp _c3Sh

while my code now generates

_c3S3:
	movq 7(%rbx),%rax
	cmpq $32,%rax
	setne %bl
	movzbl %bl,%ebx
	cmpq $10,%rax
	setne %cl
	movzbl %cl,%ecx
	andq %rbx,%rcx
	cmpq $9,%rax
	setne %al
	movzbl %al,%eax
	andq %rcx,%rax
	testq %rax,%rax
	jne _c3Se
_c3Sh:
	movl $GHC.Types.True_closure+2,%ebx
	addq $8,%rbp
	jmp *(%rbp)
_c3S7:
	movl $myIsSpace_rkB_closure,%ebx
	jmp *-8(%r13)
_c3Se:
	movl $GHC.Types.False_closure+1,%ebx
	addq $8,%rbp
	jmp *(%rbp)
	.size myIsSpace_rkB_info, .-myIsSpace_rkB_info

But not even this simple microbenchmark

main :: IO ()
main = x `seq` return ()
  where x = length $ filter myIsSpace $ concatMap (replicate 100000000) $ ['\001'..'z']

shows a change in performance. bgamari, is that expected? Is it just that the assembly (generated from

if ((_s3Rr::I64 != 9) & (_s3Rr::I64 != 10) & (_s3Rr::I64 != 32) != 0) goto c3Se; else goto c3Sh;

is too bad, or is this not a test case where this will help a lot?

comment:16 Changed 4 years ago by rwbarton

This benchmark does quite a lot of allocation, unless that changes in HEAD (I tested with 7.10.1). It would be better for microbenchmarking purposes to avoid all the allocation.

Anyways I would expect the original to be a bit faster for this particular microbenchmark, since the branches will all be predicted correctly and a predicted branch is basically free. It might be a different matter with concat (replicate 100000000 ['\001'..'z']).

comment:17 Changed 4 years ago by rwbarton

Oh, I needed a {-# NOINLINE myIsSpace #-} to avoid the allocation, interesting...

comment:18 Changed 4 years ago by rwbarton

That branchless assembly is certainly not the greatest though it doesn't look too terrible to me either. What happens if you use the LLVM backend?

comment:19 Changed 4 years ago by nomeata

It might be a different matter with concat (replicate 100000000 ['\001'..'z']).

Eh, that’s of course the code I intended, thanks for noticing!

... unfortunately, not even that code gets faster:

$ python -m timeit -s 'import os' 'os.system("./IsSpace-before")'
10 loops, best of 3: 4.9 sec per loop
$ python -m timeit -s 'import os' 'os.system("./IsSpace-after")'
10 loops, best of 3: 4.9 sec per loop

comment:20 Changed 4 years ago by nomeata

I think these numbers might be wrong due to mistakes on my side. Now I get

1 loops, best of 10: 27.1 sec per loop
1 loops, best of 10: 41.5 sec per loop

so the branchless code is actually much more expensive.

(This is testing length $ filter myIsSpace $ concatMap (const ['\001'..'z']) $ [0..50000000::Int] which produces better code than replicate.)

Also in light of Sven Pannes’s valuable comments (https://mail.haskell.org/pipermail/ghc-devs/2015-April/008858.html) I’m inclined to abandon this idea.

Note: See TracTickets for help on using tickets.