Opened 12 years ago

Closed 12 years ago

Last modified 10 years ago

#1889 closed bug (wontfix)

Regression in concurrency performance from ghc 6.6 to 6.8

Reported by: dons Owned by: simonmar
Priority: high Milestone: 6.8.3
Component: Runtime System Version: 6.8.1
Keywords: threads, concurrency, performance Cc: dons@…, sjanssen@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

While most Great Language Shootout benchmarks are running noticeably (10-20+ %) faster with GHC 6.8.1, benchmarks for concurrency seem to consistently be running slower, on a variety of architectures.

Attached are some example programs from the shootout where performance has worsened.

The thread-ring benchmark:

import Control.Monad
import Control.Concurrent
import System.Environment
 
ring = 503
 
new l i = do
  r <- newEmptyMVar
  forkIO (thread i l r)
  return r
 
thread :: Int -> MVar Int -> MVar Int -> IO ()
thread i l r = go
  where go = do
          m <- takeMVar l
          when (m == 1) (print i)
          putMVar r $! m - 1
          when (m > 0) go
 
main = do
  a <- newMVar . read . head =<< getArgs
  z <- foldM new a [2..ring]
  thread 1 z a

We can benchmark this as follows:

$ ghc-6.8.1 -O2 threadring.hs -o ghc68 -no-recomp
$ ghc-6.6.1 -O2 threadring.hs -o ghc66 -no-recomp

$ time ./ghc68 30000000 ; time ./ghc66 30000000
75
./ghc68 30000000  4.98s user 0.05s system 100% cpu 5.030 total
75
./ghc66 30000000  4.58s user 0.04s system 99% cpu 4.643 total

'chameneos' is another that has got slower.

http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneos&lang=ghc&id=2

$ time ./ghc66 7000000 ; time ./ghc68 7000000 
14000000
./ghc66 7000000  3.36s user 0.05s system 99% cpu 3.425 total
14000000
./ghc68 7000000  3.75s user 0.06s system 99% cpu 3.815 total

Attachments (1)

B.hs (1.3 KB) - added by dons 12 years ago.
thread-ring code

Download all attachments as: .zip

Change History (12)

Changed 12 years ago by dons

Attachment: B.hs added

thread-ring code

comment:1 Changed 12 years ago by jedbrown

More concerning to me is this discrepancy for the chameneos benchmark.

$ ghc-6.6.1 -O chameneos.hs -no-recomp -o ghc66-O
$ ghc-6.6.1 -O2 chameneos.hs -no-recomp -o ghc66-O2
$ ghc-6.8.1 -O chameneos.hs -no-recomp -o ghc68-O
$ ghc-6.8.1 -O2 chameneos.hs -no-recomp -o ghc68-O2

$ for e in ./ghc*; do time $e 7000000; done
14000000
./ghc66-O 7000000       0:02.72 real  2.58 user  0.00 sys
14000000
./ghc66-O2 7000000      0:05.16 real  4.79 user  0.02 sys
14000000
./ghc68-O 7000000       0:05.94 real  5.62 user  0.00 sys
14000000
./ghc68-O2 7000000      0:05.99 real  5.74 user  0.01 sys

Do people see this difference on other architectures (I am on Pentium-M 1.6 GHz)?

comment:2 Changed 12 years ago by simonmar

Milestone: 6.8.3
Owner: set to simonmar

comment:3 Changed 12 years ago by j.waldmann

I noticed another strange behaviour (for threadring.hs from above):

when I compile (with 6.8) with -threaded and run with +RTS -N1, runtime roughly doubles (w.r.t. omitting -threaded).

when I run with +RTS -N2, runtime again doubles (or more).

(this is on a dual core machine, x86 opensuse-10.3)

comment:4 Changed 12 years ago by simonmar

I don't see the differences reported. threadring runs at exactly the same speed with 6.6.1 and 6.8.1 here, and chaneneos is slightly faster with 6.8.1. So we have to look at how your GHC was built: for reference the builds I'm using are

BeConservative = YES
XMLDocWays=html
PublishCp=rsync
PublishLocation=simonmar@haskell.org:/home/haskell/ghc/dist/stable
GhcStage2HcOpts=-DDEBUG -debug
GhcLibHcOpts=-O2 -fasm -dcore-lint -fgenerics
HADDOCK_DOCS=YES

The only thing that should make a difference in performance relative to the default build is the GhcLibHcOpts line.

These are the build settings used by the nightly builds, and the same settings are used to build the binary distributions we ship from haskell.org.

Can someone who is seeing a performance difference give more details: OS/architecture, GHC build settings (or where you got your binaries from), gcc version. I'll see if I can reproduce it from that.

In reply to jedbrown: here are the results I get

> for e in ./ghc-66-O ./ghc-66-O2 ./ghc-68-O ./ghc-68-O2 ; do time $e 7000000 >/dev/null; done
7.97s real   7.96s user   0.01s system   99% $e 7000000 > /dev/null
7.47s real   7.44s user   0.01s system   99% $e 7000000 > /dev/null
6.93s real   6.92s user   0.01s system   100% $e 7000000 > /dev/null
6.89s real   6.85s user   0.02s system   99% $e 7000000 > /dev/null

This is on x86_64/Linux with gcc 4.1.0.

In reply to j.waldmann: the first result is known, adding -threaded turns on atomic locking for MVar operations, see #693. The atomic operations aren't necessary with -N1, so that ticket suggests adding some conditionals to speed things up in that case.

The second result, namely that adding -N2 slows things down even more, is because this test is hard to parallelise. Unless the scheduler manages to schedule exactly half the ring on each CPU, performance goes down the drain due to the communication overhead. You can get a modest speedup using GHC.Conc.forkOnIO to fix the threads to CPUs. Here is the version of the benchmark we have in nofib/smp/threads004 for testing the scheduler:

import Control.Concurrent
import Control.Monad
import System
import GHC.Conc (forkOnIO)

thread :: MVar Int -> MVar Int -> IO ()
thread inp out = do x <- takeMVar inp; putMVar out $! x+1; thread inp out

spawn cur n = do next <- newEmptyMVar
                 forkOnIO (if (n <= 1000) then 0 else 1) $ thread cur next
                 return next

main = do n <- getArgs >>= readIO.head
          s <- newEmptyMVar
          e <- foldM spawn s [1..2000]
          f <- newEmptyMVar
          forkOnIO 1 $ replicateM n (takeMVar e) >>= putMVar f . sum
          replicateM n (putMVar s 0)
          takeMVar f

comment:5 Changed 12 years ago by simonmar

One more tip: when using -threaded, never use the main thread to do the real work. Communication between the main thread and other forkIO'd threads is much much slower than between two forkIO'd threads, because the main thread is a "bound thread" (i.e. it has an OS thread to itself). This is perhaps the main reason that turning on -threaded will make the version of threadring above go much slower.

comment:6 Changed 12 years ago by igloo

Priority: normalhigh

comment:7 Changed 12 years ago by simonmar

Type: bugrun-time performance bug

comment:8 Changed 12 years ago by simonmar

Resolution: wontfix
Status: newclosed

I've investigated this further. The small regression in performance is due to the introduction of a write barrier on MVars. There isn't much we can do to improve this, however benchmarks that have a lot of MVars (especially a lot of inactive MVars) will see a big improvement in 6.8.2. This particular benchmark is not one of the winners, sadly, but the system scales to much larger numbers of MVars better now.

Indeed, this benchmark will regress again in 6.10 with the introduction of a write barrier on TSOs, but there will be bigger winners due to that change too.

comment:9 Changed 11 years ago by simonmar

Architecture: MultipleUnknown/Multiple

comment:10 Changed 11 years ago by simonmar

Operating System: MultipleUnknown/Multiple

comment:11 Changed 10 years ago by simonmar

Type of failure: Runtime performance bug
Note: See TracTickets for help on using tickets.