Opened 4 years ago

Last modified 2 years ago

#10229 new bug

setThreadAffinity assumes a certain CPU virtual core layout

Reported by: nh2 Owned by: simonmar
Priority: normal Milestone:
Component: Runtime System Version: 7.10.1
Keywords: Cc: nh2, thomie, simonmar, maoe, pacak
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #1741 Differential Rev(s):
Wiki Page:


The RTS -qa option that can set thread affinity was implemented in

// Schedules the thread to run on CPU n of m.  m may be less than the
// number of physical CPUs, in which case, the thread will be allowed
// to run on CPU n, n+m, n+2m etc.
setThreadAffinity (nat n, nat m)

Today I discovered that on some machines, this option helps parallel performance (e.g. +RTS -N4) a lot, while on others it doesn't.

Together with thomie on #ghc, I found out the reason:

Lets assume I have 4 real cores with hyperthreading, so 8 virtual cores.

The mapping of hyperthreading cores to physical cores is different across machines.

On my one machine (Intel i5), the layout is 11223344, meaning that the first two vCPUs (hyperthreads) that the OS announces (visible e.g. in HTOP) map to the first physical core in the system, and so on.

On my other machine (Intel Xeon), the layout is 12341234; here the 1st and the 5th vCPU map to the same physical core.

This layout can be (on Linux) observed by running:

cat /proc/cpuinfo|egrep "processor|physical id|core id" |sed 's/^processor/\nprocessor/g'

I do not know whether this layout is dictated by the processor, chosen by the OS, or even changing across reboots; what is clear is that the layout can vary across machines.

Now, as explained by thomie:

-qa will set your 4 capabilities to cores [(1,5), (2,6), (3,7), (4,8)], and then the os randomly chooses out of those tuples

This strategy is optimal for the 12341234 layout; for example, when running with -N4, it ensures that two threads are not scheduled onto vCPUs that are on the same physical core. The possible +RTS -aq choice 1__4_23_ is a great assignment in this case, as is 1234____ (_ means the vCPU is not chosen).

But for the 11223344, the choice 1234____ isn't good, because it uses only 2 of our 4 physical cores; our program now takes twice as long to run.

It seems likely to me that setThreadAffinity was written on a machine with 12341234 layout, and with the assumption that all machines have this layout.

It would be great if we could change it to take the actual layout into account.

Change History (6)

comment:1 Changed 4 years ago by fryguybob

I have a version of GHC that I use to allow explicit setting of thread affinity for GHC capabilities.

For a real patch we would want to think about the details of the RTS flag and file format as well as allowing some more friendly command-line options for common settings and low core counts (things that, at the moment, I don't have time to do). I needed explicit setting to get consistent results on a Xeon E5-2699v3 machine with 72 threads where we wanted to consider not only hyperthreads and sockets, but also the proximity of particular cores on the same die. Without setting thread affinity for capabilities results were quite scattered.

comment:2 Changed 4 years ago by fryguybob

Also I'll note that the optimal mapping of capabilities to threads is very workload dependent. It also seems likely that, in the near future at least, the gains from finding the best mapping over what the OS gives you will continue to increase.

comment:3 Changed 4 years ago by thomie

comment:4 Changed 4 years ago by fryguybob

comment:5 Changed 2 years ago by maoe

Cc: maoe added

comment:6 Changed 2 years ago by pacak

Cc: pacak added
Note: See TracTickets for help on using tickets.