Opened 12 years ago

Closed 12 years ago

Last modified 10 years ago

#1890 closed bug (duplicate)

Regression in mandelbrot benchmark due to inlining

Reported by: dons Owned by:
Priority: high Milestone: 6.8.3
Component: Compiler (NCG) Version: 6.8.1
Keywords: asm, double Cc: dons@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The (rather delicate) mandelbrot benchmark on the Great Language Shootout shows a large regression between ghc 6.6 and ghc 6.8. The unfold function's inlining/worker wrapper changes quite a lot, in 6.8, wrt. the 6.6 code.

Here's the code:

{-# OPTIONS -fexcess-precision #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Spencer Janssen, Trevor McCort, Christophe Poucet and Don Stewart
--
-- Must be compiled with the -fexcess-precision flag as a pragma. GHC
-- currently doesn't recognise the -fexcess-precision flag on the command
-- line (!).
--
-- The following flags are suggested when compiling:
--
--      -O -fglasgow-exts -optc-march=pentium4
--      -fbang-patterns -funbox-strict-fields -optc-O2 -optc-mfpmath=sse -optc-msse2
--

import System
import System.IO
import Foreign
import Foreign.Marshal.Array

main = do
    w <- getArgs >>= readIO . head
    let n      = w `div` 8
        m  = 2 / fromIntegral w
    putStrLn ("P4\n"++show w++" "++show w)
    p <- mallocArray0 n
    unfold n (next_x w m n) p (T 1 0 0 (-1))

------------------------------------------------------------------------
--
-- compiled quite differently with ghc 6.8
--
-- This function is very sensitive to inlining
--

unfold :: Int -> (T -> Maybe (Word8,T)) -> Ptr Word8 -> T -> IO ()
unfold !i !f !ptr !x0 = loop x0
  where
    loop !x = go ptr 0 x

    go !p !n !x = case f x of
        Just (w,y) | n /= i -> poke p w >> go (p `plusPtr` 1) (n+1) y
        Nothing             -> hPutBuf stdout ptr i
        _                   -> hPutBuf stdout ptr i >> loop x

------------------------------------------------------------------------

--
-- These are all compiled the same:
--

data T = T !Int !Int !Int !Double

next_x :: Int -> Double -> Int -> T -> Maybe (Word8, T)
next_x !w !iw !bw (T bx x y ci)
    | y  == w   = Nothing
    | bx == bw  = Just (loop_x w x 8 iw ci 0, T 1 0    (y+1)   (iw+ci))
    | otherwise = Just (loop_x w x 8 iw ci 0, T (bx+1) (x+8) y ci)

loop_x :: Int -> Int -> Int -> Double -> Double -> Word8 -> Word8
loop_x !w !x !n !iw !ci !b
    | x < w = if n == 0
                    then b
                    else loop_x w (x+1) (n-1) iw ci (b+b+v)
    | otherwise = b `shiftL` n
  where
    v = fractal 0 0 (fromIntegral x * iw - 1.5) ci 50

fractal :: Double -> Double -> Double -> Double -> Int -> Word8
fractal !r !i !cr !ci !k
    | r2 + i2 > 4 = 0
    | k == 0      = 1
    | otherwise   = fractal (r2-i2+cr) ((r+r)*i+ci) cr ci (k-1)
  where
    (!r2,!i2) = (r*r,i*i)

One change in 6.8 is that the shiftL is inlined (yay), and the other inner loops are compiled identically to Core, however, the unfold function gets moved around a lot.

Benchmarking:

$ ghc-6.6.1 -O2 -fglasgow-exts -fbang-patterns -funbox-strict-fields B.hs -o B66  -no-recomp

$ time ./B68 4000 > /dev/null ; time ./B66 4000 > /dev/null                                 
./B68 4000 > /dev/null  5.39s user 0.10s system 100% cpu 5.489 total
./B66 4000 > /dev/null  3.67s user 0.07s system 100% cpu 3.736 total

Attachments (1)

B.hs (2.3 KB) - added by dons 12 years ago.
mandelbrot.hs

Download all attachments as: .zip

Change History (10)

Changed 12 years ago by dons

Attachment: B.hs added

mandelbrot.hs

comment:1 Changed 12 years ago by simonmar

Milestone: 6.8.3

Thanks for the report, we'll look into it for 6.8.3

comment:2 Changed 12 years ago by simonpj

I had a quick go at reproducing this. I added

{-# OPTIONS -fexcess-precision -O -fglasgow-exts -fbang-patterns 
            -funbox-strict-fields -optc-O2 -optc-mfpmath=sse -optc-msse2 #-}
module Main where

to the file. (The module line unnecessarily exports everything but it makes it easier to compare, and I got the same results with module Main(main).

In OPTIONS I omitted -optc-march=pentium4 because I got

/tmp/ghc27259_0/ghc27259_0.hc:1:0:
     error: CPU you selected does not support x86-64 instruction set

This is on a 64-bit machine, though.

With that setup I got slightly faster execution with 6.8 and virtually identical code for unfold. So I'm puzzled.

Simon

comment:3 Changed 12 years ago by simonpj

Don? Any thoughts?

Try the HEAD too. I get 211Mbyets of allocation, and roughly identical times, for 6.6, 6.8.1, 6.8.2, and HEAD

Simon

comment:4 Changed 12 years ago by igloo

Priority: normalhigh

Don, is this reproducible for you?

comment:5 Changed 12 years ago by dons

Component: CompilerCompiler (NCG)
Keywords: asm double added; inlining performance removed

I think I know what this is :) -O2 doesn't enable -fvia-C now , and this does lots of Double math.

GHC 6.6 -O2

./B66 4000 > /dev/null 9.26s user 0.13s system 99% cpu 9.392 total

GHC 6.8.2 -O2 -fvia-C

./B68 4000 > /dev/null 9.35s user 0.12s system 99% cpu 9.482 total

GHC 6.8.2 -O2 -fasm

./B68 4000 > /dev/null 13.75s user 0.20s system 99% cpu 14.000 total

So its actually just the Double math issues showing in the native codegen, which is now on by default.

This could be closed, or used as a test case for a native codegen ticket.

comment:6 Changed 12 years ago by simonmar

Type: bugrun-time performance bug

comment:7 Changed 12 years ago by simonmar

Resolution: duplicate
Status: newclosed

This will be fixed by #594 (support use of SSE2 in the x86 native codegen).

comment:8 Changed 11 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:9 Changed 10 years ago by simonmar

Type of failure: Runtime performance bug
Note: See TracTickets for help on using tickets.