Opened 6 years ago

Last modified 12 months ago

#8578 new task

Improvements to SpinLock implementation

Reported by: parcs Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 7.7
Keywords: Cc: simonmar, dfeuer
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


The SpinLock implementation has a number of deficiencies. Here is a pair of patches that improves its implementation. Let me know what you think.

Attachments (2)

0001-Consolidate-common-spin-lock-code.patch (2.3 KB) - added by parcs 6 years ago.
0002-Rewrite-ACQUIRE_SPIN_LOCK.patch (2.6 KB) - added by parcs 6 years ago.

Download all attachments as: .zip

Change History (13)

Changed 6 years ago by parcs

comment:1 Changed 6 years ago by parcs

Status: newpatch

comment:2 Changed 6 years ago by parcs

According to a C microbenchmark, the new spinlock code is much more scalable than the old code. However, this scalability improvement remains to be seen in a GHC program. Is there a program one can test that is particularly sensitive to the scalability of the RTS's spinlocks?

comment:3 Changed 6 years ago by simonmar

The patch looks like a nice improvement, thanks! Before committing I want to run some benchmarks. The usual ones I use are nofib/parallel running on various numbers of cores, at least 8. I can try these sometimes next week.

comment:4 Changed 6 years ago by thoughtpolice

Very good stuff Patrick looks nice to me - and I can see how it'd help scalability.

I did a quick glance and one question I have is why are we using busy_wait_nop in the wait loop, which just becomes rep; nop on x86?

For spinlock implementations, Intel at least has a PAUSE instruction which specifically informs the CPU that this is a spinlock wait-loop, which allows it to optimize cache and memory accesses if possible to avoid memory ordering violations, requiring the CPUs to synchronize. This can be quite dramatic on older processors I believe (I wouldn't be surprised if rep; nop devolved in pause on some microarchs, but probably not fully guaranteed.) PAUSE might also actually *delay* the CPU for this optimization to happen, where as rep nop will merely run as fast as possible. So I'd suggest replacing busy_wait_nop on x86 with something like _pause_mm from GCC:

#define _mm_pause()                             \
  ({                                            \
    __asm__ __volatile__("pause" ::: "memory"); \

Finally, one thing I did for fun a while back was write an accelerated spinlock implementation using the new TSX extensions in Intel processors. In my very non-scientific experiments, this essentially eliminated any lock contention (or even any need to take a lock) in a lot of cases. I wonder if it's worth adding here to see if there's any difference in throughput or contention. You can find the code here (pinpointed to the code in question):

Note my example does *not* use lock elision prefixes for xchg (which are backwards compatible,) but instead uses the new TSX RTM instructions to wrap locks/unlocks in a transaction, allowing speculative elision. In practice this works just as well and RTM is more flexible.

It would however require checking cpuid to see if TSX is available and if so, dynamically replacing the code path as I have done. On Haswell, this indirect-branch would probably be predicted so well its overhead is basically zero, but I don't know what it might do to e.g. AMD machines in terms of prediction or throughput.

In any case - Patrick, thanks. If you'd like to test the elision thing, or the impact of _mm_pause, I have a powerful Haswell server available (8 hardware threads/32GB RAM) you can play around with for testing.

comment:5 Changed 6 years ago by simonmar

"rep; nop" is the pause instruction :-)

comment:6 Changed 6 years ago by thoughtpolice

Ah, yes, you're right - I did some more digging. The tricky bit is that rep; nop and pause actually have the same opcodes (F390) for backwards compatibility, but on newer CPUs, rep; nop is treated magically (allowing other hyperthreads on the same core), and pause is simply an alias to make it more clear.

In that case, ignore me and my inability to keep up with Intel shenanigans.

comment:7 Changed 6 years ago by simonmar

Here are my results with -N4 on an Intel Core i7-3770 (4 cores, 8 threads).

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
   blackscholes          +0.0%     +0.0%     -1.7%     -2.4%     -0.3%
          coins          +0.0%     -0.0%     +0.4%     +1.0%     -8.6%
           gray          +0.0%     +0.0%    +15.1%    +14.3%     +0.0%
         mandel          +0.0%     +0.0%     +3.3%     +3.3%     -0.8%
        matmult          +0.0%     +8.1%     -2.4%     -2.6%     +0.0%
        minimax          +0.0%     +0.0%     -1.3%     -1.1%     +0.0%
          nbody          +0.0%     -6.0%     -1.9%      0.06     +0.0%
         parfib          +0.0%     +0.1%    +16.2%    +16.2%     +0.0%
        partree          +0.0%     -0.0%     +1.0%     +0.5%     -3.0%
           prsa          +0.0%     -0.1%     +1.1%     +0.9%     +0.0%
         queens          +0.0%     -0.5%     -1.3%     -0.5%     +7.1%
            ray          +0.0%     -0.3%     -0.4%     -0.5%     +0.0%
       sumeuler          +0.0%     +0.0%     +1.0%     +1.0%     +0.0%
      transclos          +0.0%     +0.0%     +1.2%     +1.4%     +0.0%
            Min          +0.0%     -6.0%     -2.4%     -2.6%     -8.6%
            Max          +0.0%     +8.1%    +16.2%    +16.2%     +7.1%
 Geometric Mean          +0.0%     +0.1%     +2.0%     +2.3%     -0.4%

Not good! Two programs (gray and parfib) are significantly worse.

The effect is real, here is the timing info for parfib before and after:

5.70user 0.00system 0:01.43elapsed 397%CPU (0avgtext+0avgdata 20816maxresident)k
6.52user 0.00system 0:01.64elapsed 397%CPU (0avgtext+0avgdata 21568maxresident)k

I wonder whether not using a locked instruction in the spinlock might cause the loop to spin for longer, because it takes longer for the memory write to reach the core that is waiting for it?

Someone could probably dig into this further with perf. But the lesson here, as usual, is to always benchmark and don't just assume that because it looks good it will work in practice!

comment:8 Changed 6 years ago by simonmar

Owner: parcs deleted
Status: patchnew

Moving out of the patch state pending further investigation.

comment:9 Changed 4 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:10 Changed 12 months ago by dfeuer

Since thoughtpolice mentioned them, I'm wondering if there are specific places we should look at as HLE (or perhaps RTM) candidates. I don't think we can rely on the system glibc to use them to implement mutexes unless we ask for that specifically in some fashion. I'd expect very small and short transactions (most but not all of what we do) to benefit. Has anyone looked into that yet?

comment:11 Changed 12 months ago by dfeuer

Cc: dfeuer added
Note: See TracTickets for help on using tickets.