Opened 8 years ago

Closed 8 years ago

#5749 closed bug (invalid)

GHC 7.0.4 Performance Regression (Possibly Vector)

Reported by: sanketr Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.0.4
Keywords: performance, vector Cc: sanket.agrawal@…, pho@…
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: 5623 Differential Rev(s):
Wiki Page:

Description

I have noticed ~100% performance degradation for my code when I switched from 6.12.3 to 7.0.4. This might be related to vector performance ticket 5623 but I noticed it was filed for performance regression of 7.2.1 relative to 7.0.4, and 6.12.1 vs 7.0.4 performance was reported as ok. So, I am filing it as new bug report.

I am attaching an edited version of my code below which reproduces the issue. It is from actual production code which is used for db driver. Relevant performance benchmarks (~95% degradation):

GHC 6.12.3 MUT Time: 0.48s

GHC 7.0.4 MUT Time: 0.95s

In actual code, performance degrades by ~100%, from ~1.3s to ~2.6s. So, I can't move from 6.12.3 to 7.0.4 or 7.4+ if I want to keep the performance :(

Code below - the comment block at the end shows how to compile, and reproduce the issue - I will be happy to provide more information to fix the issue:

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import qualified Data.Vector.Storable.Mutable as MSV
import qualified Data.Vector as V
import Foreign (sizeOf)
import Foreign.C.Types (CChar)
import GHC.Int
import System.IO.Unsafe (unsafePerformIO)
import Control.Exception (evaluate)


data Elems = IV {-# UNPACK #-} !(SV.Vector Int32)
             | SV {-#UNPACK #-} !Int {-# UNPACK #-} !(SV.Vector CChar) -- Int stores the number of null-terminated C Strings
             | T {-# UNPACK #-} !Int {-# UNPACK #-} !(V.Vector Elems) -- Int stores total bytes needed to copy vectors to ByteString
             | L {-# UNPACK #-} !(V.Vector Elems) -- General list of elements
                deriving (Show)

-- | Function to return total size in bytes taken to store the data from Elems
size :: Elems -> Int
size (IV x) = 6 + (sizeOf (undefined :: Int32)) * (SV.length x)
size (SV _ x) =  6 + (sizeOf (undefined :: CChar)) * (SV.length x)
size (T n _) = n
size (L x) = V.foldl' (\x y -> x + size y) 6 x
{-# INLINE size #-}

fillS :: [[CChar]] -> Elems
fillS x = let (x',y') = createS x
            in SV x' y'
{-# INLINE fillS #-}

createS :: [[CChar]] -> (Int, SV.Vector CChar)
createS cl = unsafePerformIO $ do
            v <- MSV.new (Prelude.length . Prelude.concat $ cl)
            fill v 0 $ Prelude.concat cl
            SV.unsafeFreeze v >>= \x -> return (Prelude.length cl,x)
          where
            fill v _ [] = return ()
            fill v i (x:xs) = MSV.unsafeWrite v i x >> fill v (i + 1) xs
{-# INLINE createS #-}

-- | Constructor for T - a db table - we must always build it using this function
fillT :: V.Vector Elems -> Elems
fillT !xs = T (V.foldl' (\x y -> x + size y) 3 xs) xs -- 2 bytes for table header + 1 additional byte for dict type header => 3     bytes additional overhead
{-# INLINE fillT #-}

main = do
  let il1 = IV $ SV.enumFromN 1 50000000
      il2 = IV $ SV.enumFromN 1 50000000
      il3 = IV $ SV.enumFromN 1 50000000
      l1 = L (V.fromList [il1,il2,il3])
      sl1 = fillS [[97,0],[98,0],[99,0]]
  evaluate $ fillT (V.fromList [sl1,l1])
  return ()

{-- GHC 6.12.3:

 $ ghc -O2 --make test.hs -fforce-recomp -rtsopts -fasm && ./test +RTS -s
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...
./test +RTS -s
     600,843,536 bytes allocated in the heap
           8,336 bytes copied during GC
     200,002,504 bytes maximum residency (2 sample(s))
         793,936 bytes maximum slop
             574 MB total memory in use (9 MB lost due to fragmentation)

  Generation 0:     2 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     2 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.48s  (  0.97s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.48s  (  0.97s elapsed)

  %GC time       0.2%  (0.1% elapsed)

  Alloc rate    1,259,822,857 bytes per MUT second

  Productivity  99.6% of total user, 49.0% of total elapsed

-----
GHC 7.0.4:

 $ ghc -O2 --make test.hs -fforce-recomp -rtsopts -fasm && ./test +RTS -s
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...
./test +RTS -s
     600,836,872 bytes allocated in the heap
           7,664 bytes copied during GC
     200,002,224 bytes maximum residency (2 sample(s))
         794,216 bytes maximum slop
             574 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     2 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     2 collections,     0 parallel,  0.11s,  0.11s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.94s  (  1.01s elapsed)
  GC    time    0.11s  (  0.11s elapsed)
  EXIT  time    0.00s  (  0.11s elapsed)
  Total time    1.05s  (  1.12s elapsed)

  %GC time      10.2%  (9.7% elapsed)

  Alloc rate    638,055,951 bytes per MUT second

  Productivity  89.4% of total user, 83.4% of total elapsed
--}

Change History (5)

comment:1 Changed 8 years ago by sanketr

Cc: sanket.agrawal@… added

comment:2 Changed 8 years ago by sanketr

I just ran some more tests, including criterion benchmarking and full profiling, and they show 7.0.4 performs about the same as 6.12.3 actually. Which contradicts the RTS data I used to diagnose this issue, and this ticket! I ran basic timing test on Linux which too confirmed it:

GHC 7.0.4: ghc -O2 --make test.hs -fforce-recomp -rtsopts -fasm && time ./test
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...

real    0m1.194s
user    0m0.484s
sys     0m0.649s

GHC 6.12.3: ghc -O2 --make test.hs -fforce-recomp -rtsopts -fasm && time ./test
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...

real    0m1.331s
user    0m0.467s
sys     0m0.679s

So, actually GHC 7.0.4 seems to perform about the same in basic timing tests (and benchmarking and profiling outputs back it up). But, RTS output shows a different story as attached in the original description. I am now confused whether I should be trusting RTS output. I understand that profiling can introduce distortions, but if I understand it correctly, RTS is a basic timekeeping option, not a full-on profiling.

In case there is any doubt, all data here has been collected on the same box, and validated for consistency across multiple iterations.

I will leave the ticket open for now - please resolve as appropriate. As part of the resolution, I will appreciate some clarity into why RTS data contradicts the benchmarks and timing tests.

comment:3 Changed 8 years ago by PHO

Cc: pho@… added

comment:4 Changed 8 years ago by simonpj

difficulty: Unknown

They both allocate about the same (to within 0.01%, which is quite surprising) so I expect their runtimes are close. I'll let Simon M comment on the RTS MUT time reporting issue which indeed seems odd.

Simon

comment:5 Changed 8 years ago by simonmar

Resolution: invalid
Status: newclosed

I suspect this caused the change:

commit 5e195173cd1921e838eb977f649ef178dff00415

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Mon Sep 13 13:38:18 2010 +0000

    Use clock_gettime (if available) to measure the process CPU time
    This is much more accurate than getrusage, which was giving misleading
    results when trying to time very quick operations like a minor GC.

Your program is spending a lot of time in the kernel:

real    0m1.331s
user    0m0.467s
sys     0m0.679s

and it looks like in 6.12 the method we used for measuring CPU time did not include time spent in the kernel (sys time in the time output above), whereas now it does.

Note: See TracTickets for help on using tickets.