Opened 2 years ago

Last modified 9 months ago

#13629 new bug

sqrt should use machine instruction on x86_64

Reported by: bgamari Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler (NCG) Version: 8.0.1
Keywords: Cc: michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: numeric/num009
Blocked By: Blocking:
Related Tickets: #13570 Differential Rev(s): Phab:D3508
Wiki Page:

Description (last modified by kavon)

The NCG currently implements the square root MachOp with an out-of-line call. This resulted in a large performance regression in n-body (see #13570). Fix this.

See discussion here as well.

Change History (24)

comment:1 Changed 2 years ago by bgamari

Differential Rev(s): Phab:D3265Phab:D3508

comment:2 Changed 2 years ago by bgamari

comment:3 Changed 2 years ago by simonpj

The NCG currently implements the square root MachOp with an out-of-line call

Where is this choice made? E.g. I think the sin MachOp really generates a sin instruction.

comment:4 Changed 2 years ago by bgamari

It's made in the native code generator, genCCall in compiler/nativeGen/X86/CodeGen.hs. While we use the fsin instruction on i386, we don't on x86_64 (and i386 with -msse2).

See Phab:D3508 for a quick attempt at fixing the sqrt situation. In the case of trigonometric functions I'm afraid we have little choice but to fall back on libm since SSE provides no trig instructions.

comment:5 Changed 2 years ago by kavon

Description: modified (diff)

comment:6 Changed 2 years ago by bgamari

Test Case: numeric/num009

comment:7 Changed 2 years ago by Ben Gamari <ben@…>

In 9ac22183/ghc:

nativeGen: Use SSE2 SQRT instruction

Reviewers: austin, dfeuer

Subscribers: dfeuer, rwbarton, thomie

GHC Trac Issues: #13629

Differential Revision: https://phabricator.haskell.org/D3508

comment:8 Changed 2 years ago by bgamari

Resolution: fixed
Status: newclosed

comment:9 in reply to:  4 Changed 2 years ago by kavon

Replying to bgamari:

It's made in the native code generator, genCCall in compiler/nativeGen/X86/CodeGen.hs. While we use the fsin instruction on i386, we don't on x86_64 (and i386 with -msse2).

If there is already x87 FPU instruction support in the NCG for x86-32, it might be profitable to reuse that support for x86-64 to speed up trig functions, etc.

The simplest way I see it is to expand the foreign call into an instruction sequence that moves the float from XMM registers to the x87 stack, computes the value, and moves it back to XMM registers. This way we no longer have a C call in a potentially bad place.

It's worth comparing x87 on modern processors against the assembly routine backing the C function first [2]. It seems on platforms like Skylake the x87 fsin takes 50-120 cycles [1], but I'm not sure about the library versions. If they're roughly equivalent, there's likely a benefit to eliding the C call.

[1] http://www.agner.org/optimize/instruction_tables.pdf (Page 223 for Skylake x87)

[2] http://www.muddsnyder.com/pubs/fastSSE2.pdf (This paper compared hand-crafted SSE vs x87 on a Pentium 4, which had a slower x87 FPU than, say, Skylake)

Last edited 2 years ago by kavon (previous) (diff)

comment:10 Changed 2 years ago by bgamari

I think I would prefer to make the C call cheaper (by, for instance, marking it as pure so that the sinker can better optimize code around it) than to bring x87 into the fold. x87's fsin is terribly imprecise (and consequently causes the num009 test to fail on platforms where it is used). Moreover, as you point out, x87 performance varies widely between microarchitectures. Finally, x87's architecture is really a nuisance for the native codegen, requiring a number of hacks. It would be nice if some day we could do away with x87 support entirely (although this likely won't happen any time soon).

comment:11 Changed 2 years ago by bgamari

For the record, I did a quick, rather unscientific, comparison of SSE2 and x87,

#include <time.h>
#include <math.h>
#include <stdio.h>
#include <functional>

const int N = 1e8;

template<typename Ret, typename... Args>
void time_it(const char *name, std::function<Ret(Args...)> f, Args... args) {
        struct timespec start, end;

        clock_gettime(CLOCK_MONOTONIC, &start);
        f(args...);
        clock_gettime(CLOCK_MONOTONIC, &end);
        double t = (end.tv_nsec - start.tv_nsec) * 1e-9 + (end.tv_sec - start.tv_sec);
        printf("%s: %f seconds / %d = %f ns per iter\n", name, t, N, t / N / 1e-9);
}

static double test(double x) {
        double y = 0;
        for (int i=0; i<N; i++) {
                y += 1e-4;
                x += sin(y);
        }
        return x;
}

int main() {
        time_it("test", std::function<double(double)>(test), 0.0);
        return 0;
}

For x87 I compiled this with,

$ g++ -lm hi.cpp -O2 -std=c++11 -march=core2   -ffast-math -mfpmath=387

Which produced (ignoring the epilogue responsible for preparing the return value)

0000000000000b00 <_ZL4libmd>:
 b00:   f2 0f 11 44 24 f8       movsd  %xmm0,-0x8(%rsp)
 b06:   dd 44 24 f8             fldl   -0x8(%rsp)
 b0a:   b8 00 e1 f5 05          mov    $0x5f5e100,%eax
 b0f:   dd 05 53 01 00 00       fldl   0x153(%rip)        # c68 <_ZTSPFddE+0xb>
 b15:   d9 ee                   fldz   
 b17:   dd 05 53 01 00 00       fldl   0x153(%rip)        # c70 <_ZTSPFddE+0x13>
 b1d:   dd 05 55 01 00 00       fldl   0x155(%rip)        # c78 <_ZTSPFddE+0x1b>
 b23:   eb 0f                   jmp    b34 <_ZL4libmd+0x34>
 b25:   0f 1f 00                nopl   (%rax)
 b28:   dc c2                   fadd   %st,%st(2)
 b2a:   d9 ca                   fxch   %st(2)
 b2c:   d9 fe                   fsin   
 b2e:   d9 cb                   fxch   %st(3)
 b30:   d9 cc                   fxch   %st(4)
 b32:   d9 ca                   fxch   %st(2)
 b34:   d9 c2                   fld    %st(2)
 b36:   83 e8 01                sub    $0x1,%eax
 b39:   d8 c2                   fadd   %st(2),%st
 b3b:   d9 cd                   fxch   %st(5)
 b3d:   de c4                   faddp  %st,%st(4)
 b3f:   75 e7                   jne    b28 <_ZL4libmd+0x28>
...

For SSE I compiled with,

$ g++ -lm hi.cpp -O2 -std=c++11 -march=core2   -ffast-math

Which produced,

0000000000000b60 <_ZL4libmd>:
 b60:   53                      push   %rbx
 b61:   66 0f ef c9             pxor   %xmm1,%xmm1
 b65:   bb 00 e1 f5 05          mov    $0x5f5e100,%ebx
 b6a:   48 83 ec 10             sub    $0x10,%rsp
 b6e:   f2 0f 11 04 24          movsd  %xmm0,(%rsp)
 b73:   f2 0f 10 05 5d 01 00    movsd  0x15d(%rip),%xmm0        # cd8 <_ZTSPFddE+0xb>
 b7a:   00 
 b7b:   eb 1e                   jmp    b9b <_ZL4libmd+0x3b>
 b7d:   0f 1f 00                nopl   (%rax)
 b80:   f2 0f 10 05 60 01 00    movsd  0x160(%rip),%xmm0        # ce8 <_ZTSPFddE+0x1b>
 b87:   00 
 b88:   f2 0f 58 c1             addsd  %xmm1,%xmm0
 b8c:   e8 4f fd ff ff          callq  8e0 <sin@plt>
 b91:   f2 0f 10 54 24 08       movsd  0x8(%rsp),%xmm2
 b97:   66 0f 28 ca             movapd %xmm2,%xmm1
 b9b:   f2 0f 10 15 3d 01 00    movsd  0x13d(%rip),%xmm2        # ce0 <_ZTSPFddE+0x13>
 ba2:   00 
 ba3:   83 eb 01                sub    $0x1,%ebx
 ba6:   f2 0f 58 04 24          addsd  (%rsp),%xmm0
 bab:   f2 0f 58 d1             addsd  %xmm1,%xmm2
 baf:   f2 0f 11 04 24          movsd  %xmm0,(%rsp)
 bb4:   f2 0f 11 54 24 08       movsd  %xmm2,0x8(%rsp)
 bba:   75 c4                   jne    b80 <_ZL4libmd+0x20>

Note the call to sin which is sent through the PLT. Yuck.

This gave the following results on my i7-6600U (Skylake @ 2.6GHz) (looking at a variety of iteration counts to ensure that the timing isn't dominated by setup time, as well as a few measurements at each count to get a sense for variance) ,

Method iterations Time per iteration (ns)
SSE 1e8 33.91
SSE 1e8 34.08
SSE 1e8 33.08
SSE 5e7 34.47
SSE 5e7 34.76
SSE 5e7 36.36
SSE 1e7 39.69
SSE 1e7 36.45
SSE 1e7 39.40
X87 1e8 37.95
X87 1e8 39.74
X87 1e8 40.94
X87 5e7 37.29
X87 5e7 36.35
X87 5e7 36.60
X87 1e7 37.75
X87 1e7 37.34
X87 1e7 38.30

It seems to me like (in the case of this particularly terrible benchmark) the two implementations are relatively similar, with SSE perhaps having a slight edge. Intriguingly, replacing sin in the example with sqrt changes the outcome dramatically: both timings decrease markedly (as one might expect; sin isn't an easy thing to compute), but X87 is twice as fast as SSE (2.2 ns/iteration vs. 5.5 ns/iteration).

comment:12 Changed 2 years ago by kavon

This is really interesting, thanks for looking into it!

I'm not sure whether there are real programs that will benefit if we inline this library call by using x87 instructions. I'll leave that to someone who has a better intuition about whether there are graphics/mathematics programs that would appreciate the attention.

We're at least no worse than C/C++ in this regard, though we probably pay a slightly higher call-site penalty due to a register mismatch between C and GHC conventions.

comment:13 Changed 2 years ago by dfeuer

Resolution: fixed
Status: closednew

This made n-body slower and did not appreciably affect other benchmarks. It would be really good to figure out what's going on here.

comment:15 Changed 2 years ago by nomeata

If you have reasons to doubt the results, better try to reproduce them on your machine. I see performance cliffs of that size on perf.h.o that are sometimes triggered by very insignificant changes to the runtime system, for example.

comment:16 Changed 2 years ago by bgamari

I've reproduced the regression locally. While it's hard to measure the runtime difference with the default test configuration, if you increase the argument from 3000 to 10000 it becomes quite clear. Before the patch the runtime is 1.14 seconds, after it's 1.30 seconds.

This is quite surprising since there are a few bits of the generated code that get rather significantly shorter. Namely,

        ...
        subq $8,%rsp
        movsd %xmm0,%xmm4
        movss %xmm6,%xmm0
        mulss %xmm6,%xmm0
        mulss %xmm6,%xmm0
        movq %rax,%rbx
        movl $1,%eax
        movq %rcx,%r14
        movq %rdx,72(%rsp)
        movq %rsi,80(%rsp)
        movq %rdi,88(%rsp)
        movsd %xmm4,96(%rsp)
        movsd %xmm1,104(%rsp)
        movsd %xmm2,112(%rsp)
        movsd %xmm3,120(%rsp)
        call sqrtf
        addq $8,%rsp
        movss _n8uc(%rip),%xmm1
        divss %xmm0,%xmm1
        movss %xmm1,%xmm0
        movsd 112(%rsp),%xmm2
        mulss %xmm2,%xmm0
        movss %xmm1,%xmm2
        movsd 104(%rsp),%xmm3
        mulss %xmm3,%xmm2
        movsd 96(%rsp),%xmm3
        mulss %xmm3,%xmm1
        movq 64(%rsp),%rax
        leaq 1(%rax),%rcx

Whereas after we get,

        ...
        movss %xmm6,%xmm4
        mulss %xmm6,%xmm4
        mulss %xmm6,%xmm4
        sqrtss %xmm4,%xmm4
        movss _n8ud(%rip),%xmm5
        divss %xmm4,%xmm5
        movss %xmm5,%xmm4
        mulss %xmm3,%xmm4
        movss %xmm5,%xmm3
        mulss %xmm2,%xmm3
        mulss %xmm1,%xmm5
        leaq 1(%rdx),%rbx

comment:17 Changed 2 years ago by bgamari

So perf stat reveals a rather interesting picture. With the patch we retire few instructions (9,146,722,334 vs. 12,346,134,565), perform far fewer branches (908,661,442 compared to 1,308,583,802 without, the predictor missing around 0.01% in both cases), yet see a significantly higher runtime.

comment:18 Changed 2 years ago by bgamari

Perhaps I'll just start a table of the PMU counters before and after,

Metric Before patch After patch
instructions 12,346,134,565 9,146,722,334
branches 1,308,583,802 908,661,442
branch predict misses 143,844 151,980
cache-misses 83,983 57,490
L1-icache-load-misses 138,869 156,777
topdown-fetch-bubbles 218,555,294 140,545,631
topdown-recovery-bubbles 2,246,152 2,343,984
topdown-slots-issued 11,804,030,981 8,350,690,923
topdown-slots-retired 11,838,646,312 8,348,858,159
topdown-total-slots 10,129,866,070 11,487,578,236
Last edited 2 years ago by bgamari (previous) (diff)

comment:19 Changed 2 years ago by bgamari

For the record, sqrtf is the following on my machine,

   0x00007ffff78892c0 <+0>:     pxor   %xmm1,%xmm1
   0x00007ffff78892c4 <+4>:     ucomiss %xmm0,%xmm1
   0x00007ffff78892c7 <+7>:     ja     0x7ffff78892d0 <__sqrtf+16>
   0x00007ffff78892c9 <+9>:     sqrtss %xmm0,%xmm0
   0x00007ffff78892cd <+13>:    retq
   0x00007ffff78892ce <+14>:    xchg   %ax,%ax
   0x00007ffff78892d0 <+16>:    mov    0x2cccf9(%rip),%rax        # 0x7ffff7b55fd0
   0x00007ffff78892d7 <+23>:    cmpl   $0xffffffff,(%rax)
   0x00007ffff78892da <+26>:    je     0x7ffff78892c9 <__sqrtf+9>
   0x00007ffff78892dc <+28>:    movaps %xmm0,%xmm1
   0x00007ffff78892df <+31>:    mov    $0x7e,%edi
   0x00007ffff78892e4 <+36>:    jmpq   0x7ffff788ed50 <__kernel_standard_f>
Last edited 2 years ago by bgamari (previous) (diff)

comment:20 Changed 2 years ago by kavon

The verbose lowering of sqrt that was there before this patch must have stumbled upon some sort of lucky interaction in the processor (arithmetic vs memory unit ports) in that particular hot patch of code in n-body, which is not something that can be relied on in the grand scheme of things.

What we can tell is that this patch produces better code, more closely matching what LLVM produces as well. bgamari has also kindly quantified how much better it is.

Thus, I think this ~4% drop in n-body is not something to worry about, as it seems even non-functional changes can knock a few nofib runtime numbers by that amount [1]. Ticket #13570 is also a separate issue: I noticed the poor sqrt lowering when I was investigating that ticket.

[1] https://perf.haskell.org/ghc/#revision/945c45ad50ed31e3acb96fdaafb21640c4669f12

comment:21 Changed 2 years ago by michalt

Cc: michalt added

comment:22 Changed 20 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:23 Changed 15 months ago by bgamari

Milestone: 8.6.18.8.1

These will not be addressed in GHC 8.6.

comment:24 Changed 9 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.