#14644 closed task (fixed)

Improve cmm/assembly for pattern matches with two constants.

Reported by: AndreasK Owned by: AndreasK
Priority: normal Milestone: 8.6.1
Component: Compiler (CodeGen) Version: 8.2.2
Keywords: Codegen, CMM, Patterns, Pattern Matching Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D4294
Wiki Page:

Description

For a pattern like:

f_test :: Int#->Int#
f_test a = case a of
    1# -> 33#
    7# -> 34#
     _ -> -1#

GHC currently generates code that works best if the default branch is taken most often.

Pseudo-Code

if (a >= 8)
    return -1;
else {
    if (a < 7) {
        if(a != 1)
            return -1;
        else
            return 33;
    } 
    else
        return 34;
}

CMM:

       c1cr: // global
           if (%MO_S_Ge_W64(R2, 8)) goto c1co; else goto u1cu;
       u1cu: // global
           if (%MO_S_Lt_W64(R2, 7)) goto u1cv; else goto c1cq;
       u1cv: // global
           if (R2 != 1) goto c1co; else goto c1cp;
       c1co: // global
           R1 = (-1);
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c1cp: // global
           R1 = 33;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c1cq: // global
           R1 = 34;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }

Wouldn't the following be better?

if(a == 1)
    return 33
else if(a == 7)
    return 34
else
    return -1

I would expect that to

  • Improve the cases:
    • a = 1
    • a < 7
  • Be the same for:
    • a = 7
  • Be worse for
    • a > 8

This would be especially nice for the cases where the default branch is raising a pattern match exception. Which is a code path I wouldn't expect to be taken often nor be very performance sensitive.

Even if we keep the current logic there is room for improvement. GHC currently creates the following assembly:

_c1cr:
	cmpq $8,%r14
	jge _c1co
_u1cu:
	cmpq $7,%r14
	jl _u1cv
_c1cq:
	movl $34,%ebx
	jmp *(%rbp)
_u1cv:
	cmpq $1,%r14
	jne _c1co
_c1cp:
	movl $33,%ebx
	jmp *(%rbp)
_c1co:
	movq $-1,%rbx
	jmp *(%rbp)

It would be nice if we could remove one comparison at least.

_c1cr:
	cmpq $7,%r14
	jg _c1co
_u1cu:
	;No longer neccesary cmpq $7,%r14
	jl _u1cv
_c1cq:
	movl $34,%ebx
	jmp *(%rbp)
_u1cv:
	cmpq $1,%r14
	jne _c1co
_c1cp:
	movl $33,%ebx
	jmp *(%rbp)
_c1co:
	movq $-1,%rbx
	jmp *(%rbp)

Change History (16)

comment:1 Changed 21 months ago by simonpj

Looks plausible to me. I wonder how we'd measure any potential benefit?

comment:2 Changed 21 months ago by AndreasK

It should also be possible to see the difference using criterion or instruction counters when running affected code in a tight loop.

I will take a stab at changing the logic to if ... else if ... else for these cases. I think that's a good starting point to learn how Cmm works inside GHC.

comment:3 Changed 21 months ago by svenpanne

I am not sure if things are that easy: To do these kind of switches in a really good way, one would need a probability distribution how often the individual cases actually happen.

If we e.g. assume that all Ints happen equally often in the example above, it would be best to check for <1 and >7 first, so we would get roughly 1.5 comparisons on average. Depending on the number of registers available, you can even get away with 1 subtraction and 1 unsigned comparison for this range check, a classic C hack (ab)using wrap-around for unsigned ints.

If we have some hints, e.g. raising a pattern matching exception, we could do better, e.g. assume 0% probability for this case. If we have more detailed (estimated) probabilities, we could do a Huffman-like decision tree. This is where profile-guided optimization shines.

Additional things to consider: Performance in tight loops is often vastly different, because branch prediction/caching will most likely kick in visibly. Correctly predicted branches will cost you almost nothing, while unknown/incorrectly predicted branches will be much more costly. In the absence of more information from their branch predictor, quite a few processors assume that backward branches are taken and forward branches are assumed to be not taken. So code layout has a non-trivial performance impact.

Instruction counts are quite misleading nowadays...

comment:4 in reply to:  3 Changed 21 months ago by AndreasK

Replying to svenpanne:

I am not sure if things are that easy: To do these kind of switches in a really good way, one would need a probability distribution how often the individual cases actually happen.

This is by no means meant as a "this is optimal" idea.

As far as I'm aware there is currently no way to propagate probabilities through GHC's various stages. So even in cases where we can make a judgement on the probability we can't really use it at the Cmm level.

In the end I suggested it because it:

  • I think it has a good chance of being faster
  • Leads to smaller code
  • Seems to be what gcc/clang do (using conditional moves instead of jumps but still).

But then vmove doesn't stall the pipeline . I will see if an improvement is measureable I guess.

Additional things to consider: Performance in tight loops is often vastly different, because branch prediction/caching will most likely kick in visibly. Correctly predicted branches will cost you almost nothing, while unknown/incorrectly predicted branches will be much more costly. In the absence of more information from their branch predictor, quite a few processors assume that backward branches are taken and forward branches are assumed to be not taken. So code layout has a non-trivial performance impact.

Instruction counts are quite misleading nowadays...

Indeed :(

Last edited 21 months ago by AndreasK (previous) (diff)

comment:5 Changed 20 months ago by AndreasK

Differential Rev(s): Phab:D4294
Owner: set to AndreasK

I ran a simple Benchmark and the results seem almost too good with >10% gains in almost all cases.

If it really is that good then, well, I guess good for us.

Assuming I didn't introduce an error somewhere I assume the combination of smaller code and less instructions on some paths is just better in general.

If we e.g. assume that all Ints happen equally often in the example above, it would be best to check for <1 and >7 first, so we would get roughly 1.5 comparisons on average. Depending on the number of registers available, you can even get away with 1 subtraction and 1 unsigned comparison for this range check, a classic C hack (ab)using wrap-around for unsigned ints.

I thought about this a bit. While true it also means the first comparison has a hard time with prediction if the numbers are a random distribution over the range. So two 99.9% correctly predicted comparisons might still be better even without considering code size.

range check, jg, je, cmp, je, j <default> could still be better. But not sure if I will look into that and it might warrant it's own ticket as GHC also uses this for range checks on jump tables I think.

Code used for the test:

f_test :: Int -> Int
f_test  111 = 1111
f_test  222 = 2222
f_test  _ = -1

run = print . sum . map f_test

--main = print . sum . map f_test $ ([(-1500000000::Int) .. 0])
main = do
    args <- getArgs :: IO [String]
    if null args 
        then putStrLn "Usage: pos|neg|both"
        else case (head args) of
            "pos" -> run [0 .. (1500000000::Int)]
            "neg" -> run [-1500000000::Int .. 0]
            "both" -> run [-750000000::Int .. 750000000::Int]
            "negrep" -> run . concat . replicate 100000000 $ [-311, -322, -333, -444]
            "posrep" -> run . concat . replicate 100000000 $ [311, 322, 333, 444]
            "unpre" -> run . concat . replicate 100000000 $ [111, 322, -333, 222]

I get these results for the current approach (range) and mine (if) on an 6700K

With inlining:

benchmarking execute: range pos
mean                 1.196 s    (1.196 s .. 1.198 s)
benchmarking execute: if pos
mean                 806.0 ms   (803.5 ms .. 807.3 ms)

benchmarking execute: range neg
mean                 2.384 s    (2.383 s .. 2.385 s)
benchmarking execute: if neg
mean                 1.841 s    (1.841 s .. 1.842 s)

benchmarking execute: range negrep
mean                 840.1 ms   (838.1 ms .. 841.3 ms)
benchmarking execute: if negrep
mean                 728.6 ms   (728.3 ms .. 728.8 ms)

benchmarking execute: range unpre
mean                 852.2 ms   (851.1 ms .. 853.1 ms)
benchmarking execute: if unpre
mean                 789.7 ms   (789.3 ms .. 789.9 ms)

With inlining disabled on f_test

benchmarking execute: range pos
mean                 2.385 s    (2.383 s .. 2.386 s)
benchmarking execute: if pos
mean                 2.383 s    (2.382 s .. 2.384 s)

benchmarking execute: range neg
mean                 2.845 s    (2.839 s .. 2.848 s)
benchmarking execute: if neg
mean                 2.047 s    (2.041 s .. 2.053 s)

benchmarking execute: range negrep
mean                 1.204 s    (1.201 s .. 1.205 s)
benchmarking execute: if negrep
mean                 829.4 ms   (828.4 ms .. 830.1 ms)

benchmarking execute: range unpre
mean                 1.165 s    (1.164 s .. 1.166 s)
benchmarking execute: if unpre
mean                 1.118 s    (1.117 s .. 1.119 s)

comment:6 Changed 20 months ago by simonpj

A 10% improvement is amazing.

If you commit this, please include a Note explaining the strategy, and pointing to the ticket.

comment:7 in reply to:  6 Changed 20 months ago by AndreasK

Replying to simonpj:

A 10% improvement is amazing.

Indeed! Nothing in nofib gets up to 10% but wheel-sieve1 comes close.

  • 17 programs are changed (contain that kind of case statement)
  • 15 Show no significant runtime difference. (It's fair to assume it's an improvement but not mensurable with the noise on my machine)
  • k-nucleotide improves by ~1.2%
  • wheel-sieve1 has the code in a inner loop. It improves by ~4.3% in mode=normal. Increasing the problem size makes the difference even bigger up to somewhere around 8-9%

If you commit this, please include a Note explaining the strategy, and pointing to the ticket.

Will do.

The patch is up on phab. If someone can run this on an older intel (or even better amd) processor and check the result that would be great to make sure it's not just a pecularity of my i7-6700K.

comment:8 Changed 20 months ago by AndreasK

Status: newpatch

comment:9 Changed 20 months ago by simonpj

Didn't you say "I ran a simple Benchmark and the results seem almost too good with >10% gains in almost all cases.". But now you say "mostly no change; one gets 5-9%". Were the earlier measurements wrong?

Anyway the patch looks good. Thanks

comment:10 in reply to:  9 Changed 20 months ago by AndreasK

Replying to simonpj:

Didn't you say "I ran a simple Benchmark and the results seem almost too good with >10% gains in almost all cases.". But now you say "mostly no change; one gets 5-9%". Were the earlier measurements wrong?

No these are right. But the simple benchmark was running the code I posted above in comment 5. All cases referred to the pos/neg/.. branches in that case.

Nofib seems to be in a permanent state of breakdown on windows so I stopped using it in my initial benchmarks. And it was the right thing to do as it took some work to get it running.

comment:11 in reply to:  3 Changed 20 months ago by AndreasK

Replying to svenpanne:

Additional things to consider: Performance in tight loops is often vastly different, because branch prediction/caching will most likely kick in visibly. Correctly predicted branches will cost you almost nothing, while unknown/incorrectly predicted branches will be much more costly. In the absence of more information from their branch predictor, quite a few processors assume that backward branches are taken and forward branches are assumed to be not taken. So code layout has a non-trivial performance impact.

I went over Agners guide and it seems like this is only for Netburst CPU's, the last of which was released in 2001 so I'm not too worried about these. And even if you have on of these according to Agner:

It is rarely worth the effort to take static prediction into account. Almost any branch that is executed sufficiently often for its timing to have any significant effect is likely to stay in the BTB so that only the dynamic prediction counts.

All other architectures he lists default to not taken if they use static prediction at all.


What might help explain the difference is that jumps not taken should be faster than taken jumps on both modern Intel and AMD CPU's.

If someone wants to dig deeper Agner probably has enough info in the guides to explain the change completely based on the assembly generated. But I don't think that is necessary.

comment:12 Changed 20 months ago by simonpj

It would be great to ensure that there is a Note to explain the thinking before you tie the bow on this. Thanks!

Simon

comment:13 Changed 20 months ago by simonpj

Are we good to go on this? I.e. commit?

comment:14 in reply to:  13 Changed 20 months ago by AndreasK

Replying to simonpj:

Are we good to go on this? I.e. commit?

From my side the patch is good to go as it is.

comment:15 Changed 20 months ago by Ben Gamari <ben@…>

In 7ff6023/ghc:

cmm: Use two equality checks for two alt switch with default

For code like:
f 1 = e1
f 7 = e2
f _ = e3

We can treat it as a sparse jump table, check if we are outside of the
range in one direction first and then start checking the values.

GHC currently does this by checking for x>7, then x <= 7 and at last x
== 1.

This patch changes this such that we only compare for equality against
the two values and jump to the default if non are equal.

The resulting code is both faster and smaller.
wheel-sieve1 improves by 4-8% depending on problem size.

This implements the idea from #14644

Reviewers: bgamari, simonmar, simonpj, nomeata

Reviewed By: simonpj, nomeata

Subscribers: nomeata, simonpj, rwbarton, thomie, carter

Differential Revision: https://phabricator.haskell.org/D4294

comment:16 Changed 20 months ago by bgamari

Milestone: 8.6.1
Resolution: fixed
Status: patchclosed
Note: See TracTickets for help on using tickets.