Opened 5 years ago

Last modified 4 years ago

#9342 new feature request

Branchless arithmetic operations

Reported by: hvr Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.8.3
Keywords: Cc: simonmar, greg@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

While working on #9281 I noticed sub-optimal code generation for common arithmetic operations on Int. Below are some examples of code generation, where the branchless versions may be desirable on modern CPUs:

abs

-- base version
absI# :: Int# -> Int#
absI# i# = i'#
  where
    !(I# i'#) = abs (I# i#)

-- optimized version
optAbsI# :: Int# -> Int#
optAbsI# i# = (i# `xorI#` nsign) -# nsign
  where
    -- nsign = i# <# 0#
    nsign = uncheckedIShiftRA# i# (WORD_SIZE_IN_BITS# -# 1#)

results in

absI#_info:
_c1To:
        testq %r14,%r14
        jge _c1Tx
_c1Ty:
        movq %r14,%rbx
        negq %rbx
        jmp *(%rbp)
_c1Tx:
        movq %r14,%rbx
        jmp *(%rbp)


optAbsI#_info:
_c1SX:
        movq %r14,%rax
        sarq $63,%rax
        movq %r14,%rbx
        xorq %rax,%rbx
        subq %rax,%rbx
        jmp *(%rbp)

signum

sgnI# :: Int# -> Int#
sgnI# i# = i'#
  where
    !(I# i'#) = signum (I# i#)

optSgnI# :: Int# -> Int#
optSgnI# x# = (x# ># 0#) -# (x# <# 0#)
sgnI#_info:
_c27W:
        testq %r14,%r14
        jl _c283
_c284:
        testq %r14,%r14
        jne _c28a
_c28b:
        xorl %ebx,%ebx
        jmp *(%rbp)
_c283:
        movq $-1,%rbx
        jmp *(%rbp)
_c28a:
        movl $1,%ebx
        jmp *(%rbp)


optSgnI#_info:
_c26P:
        testq %r14,%r14
        setl %al
        movzbl %al,%eax
        testq %r14,%r14
        setg %bl
        movzbl %bl,%ebx
        subq %rax,%rbx
        jmp *(%rbp)

compare

cmpI# :: Int# -> Int# -> Int#
cmpI# x# y# = dataToTag# (compare (I# x#) (I# y#))

-- returns 0, 1 or 2, can be fed to tagToEnum# :: Ordering
optCmpI# :: Int# -> Int# -> Int#
optCmpI# x# y# = 1# +# (x# ># y#) -# (x# <# y#)
cmpI#_info:
_c25s:
        cmpq %rsi,%r14
        jl _c25z
_c25A:
        cmpq %rsi,%r14
        je _c25M
_c25N:
        movl $2,%ebx
        jmp *(%rbp)
_c25z:
        xorl %ebx,%ebx
        jmp *(%rbp)
_c25M:
        movl $1,%ebx
        jmp *(%rbp)


optCmpI#_info:
_c24N:
        cmpq %rsi,%r14
        setl %al
        movzbl %al,%eax
        movl $1,%ebx
        subq %rax,%rbx
        cmpq %rsi,%r14
        setg %al
        movq %rbx,%rcx
        movzbl %al,%ebx
        addq %rcx,%rbx
        jmp *(%rbp)

Attachments (1)

bench.hs (959 bytes) - added by strake888 4 years ago.
benchmark code

Download all attachments as: .zip

Change History (13)

comment:1 Changed 5 years ago by schyler

+1, but please benchmark. I would be skeptical that the branch ones are slower in real tight loops, since I've often noticed that well predicted branches are very close to "free" while that arithmetic is most certainly not.

Last edited 5 years ago by schyler (previous) (diff)

comment:2 Changed 5 years ago by jstolarek

Same here: +1 but benchmark. Based on my experience from #6135 I don't expect to see much performance improvement.

comment:3 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:4 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

Changed 4 years ago by strake888

Attachment: bench.hs added

benchmark code

comment:5 Changed 4 years ago by strake888

On my AMD FX-8320, branchless abs and sgn seem speedier but branchless cmp seems slower:

$ ./bench 'abs '
benchmarking abs
time                 37.06 ms   (36.60 ms .. 37.56 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 37.99 ms   (37.68 ms .. 38.37 ms)
std dev              676.9 μs   (490.1 μs .. 926.6 μs)

$ ./bench 'abs'''
benchmarking abs'
time                 34.41 ms   (33.97 ms .. 34.77 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 34.08 ms   (33.95 ms .. 34.25 ms)
std dev              307.3 μs   (243.0 μs .. 418.2 μs)

$ ./bench 'sgn '
benchmarking sgn
time                 36.70 ms   (36.33 ms .. 37.10 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 36.93 ms   (36.79 ms .. 37.03 ms)
std dev              216.7 μs   (127.8 μs .. 343.1 μs)

$ ./bench 'sgn'''
benchmarking sgn'
time                 34.60 ms   (34.24 ms .. 34.92 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 34.18 ms   (34.02 ms .. 34.36 ms)
std dev              348.1 μs   (263.4 μs .. 461.5 μs)

$ ./bench 'cmp '
benchmarking cmp
time                 35.80 ms   (35.04 ms .. 36.80 ms)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 36.79 ms   (36.32 ms .. 37.15 ms)
std dev              875.3 μs   (665.6 μs .. 1.178 ms)

$ ./bench 'cmp'''
benchmarking cmp'
time                 36.41 ms   (35.72 ms .. 37.08 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 36.44 ms   (35.96 ms .. 36.74 ms)
std dev              726.9 μs   (513.8 μs .. 1.044 ms)

$ 

comment:6 Changed 4 years ago by svenpanne

I'd like to point out my previous email on a similar topic: https://mail.haskell.org/pipermail/ghc-devs/2015-April/008858.html

In a nutshell: It's far from clear that code with branches is "sub-optimal", toy benchmarks are totally meaningless (they don't stress e.g. the very limited number of shifting units in CPUs), and benchmarks on a single (micro-)architecture are even less meaningful (what's good for one HW makes things worse on another HW). Applying rules of thumb from previous decades doesn't really work anymore today, this has already been learned the hard way in several other compilers/JITs, I just wanted to warn about this again...

comment:7 Changed 4 years ago by tibbe

As a data point the hashtables package makes effective use of branchless code in practice: https://github.com/gregorycollins/hashtables/search?utf8=%E2%9C%93&q=branchless

comment:8 Changed 4 years ago by svenpanne

Re #7: Do we have real-world benchmarks for Atom/i3/i7/Xeon in ia32/x64 modes? ARM? ARM64? With and without branches? Without that, it's unclear if that's "effective use". My point (again) is: Being branchless in itself is a non-goal.

Picking e.g. just https://github.com/gregorycollins/hashtables/blob/9e477b825a98e4f574a0e889e5c7955a1015a430/src/Data/HashTable/Internal/CacheLine.hs#L165 : This is a perfect example which will probably make things *slower*, depending on the availability of spare registers. It uses a handful of intermediate values, while the branching code uses none, and a single spill caused by the higher register pressure will be much, much more costly than anything else, especially ia32 sucks here. Furthermore, if surrounding code uses shifts and/or complicated addressing modes a lot, one will have to wait for the shifting unit to become available. This is all not theoretical, we had to revert to the straightforward code with branches in Chrome's V8 JavaScript JIT on some platforms/CPUs in similar places. Perhaps the code patterns in GHC-generated code are different, but the only way to know is to do benchmarking on wide variety of benchmarks and CPUs. Yes, that's a lot of work and needs some infrastructure, but without that, changes like this are just a shot in the dark.

comment:9 Changed 4 years ago by tibbe

Cc: greg@… added

I'm sure Greg measured. I've CCed him.

comment:10 Changed 4 years ago by gregorycollins

I measured, yes, but not across processors: when I was working on this I optimized for 64-bit i7 (probably Sandy Bridge IIRC). The version of mask you linked to with the funky branchless code was definitively faster on that chip vs. the simpler alternative:

mask# :: Int# -> Int# -> Int#
mask# !a# !b# = let !(I# z#) = fromEnum (a# ==# b#)
                    !q#      = negateInt# z#
                in q#

(of course this is from when ==# returned Bool rather than Int#).

The difference was about 15-20% IIRC. Unfortunately I've lost the raw numbers, sorry, but as Sven points out they'd be useless anyways towards determining how good the change is in aggregate. Quite willing to believe that code could be a pessimization on ia32.

comment:11 Changed 4 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:12 Changed 4 years ago by thomie

Milestone: 8.0.1
Note: See TracTickets for help on using tickets.