Opened 3 years ago

Last modified 9 months ago

#12893 new bug

Profiling defeats stream fusion when using vector library

Reported by: newhoggy Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.0.1
Keywords: stream-fusion profiling Cc: sgraf, simonpj
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I have this problem where enabling profiling in Haskell undoes stream fusion and I’d like to do my profiling with stream fusion intact.

The following reproduces my problem:

$ git clone git@github.com:haskell-works/hw-tutorial-performance.git
$ cd hw-tutorial-performance/
$ stack init
$ stack build --executable-profiling
$ time $(find .stack-work/dist/ -type f -name hw-tutorial-performance-rwhe) 1e7
5000000.5

real    0m4.432s
user    0m3.484s
sys    0m0.936s

$ rm -rf .stack-work/
$ stack build
$ time $(find .stack-work/dist/ -type f -name hw-tutorial-performance-rwhe) 1e7
5000000.5

real    0m0.034s
user    0m0.018s
sys    0m0.013s
With profiling, my program takes over 4 seconds to run. Without profiling, it only takes 0.034 seconds.

The problem has nothing to do with stack.

(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1]
08:12 $ ghc -fforce-recomp --make -O2 -funbox-strict-fields -rtsopts -optc-O2 Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1]
08:12 $ time ./Main +RTS -sstderr -RTS 1e7
5000000.5
         115,888 bytes allocated in the heap
           3,480 bytes copied during GC
          44,384 bytes maximum residency (1 sample(s))
          17,056 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.005s     0.0049s    0.0049s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.010s  (  0.025s elapsed)
  GC      time    0.000s  (  0.005s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    0.012s  (  0.037s elapsed)

  %GC     time       1.8%  (13.1% elapsed)

  Alloc rate    11,904,262 bytes per MUT second

  Productivity  97.1% of total user, 31.2% of total elapsed


real	0m0.057s
user	0m0.012s
sys	0m0.010s
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1]
08:12 $ ghc -prof -fforce-recomp --make -O2 -funbox-strict-fields -rtsopts -optc-O2 Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1]
08:12 $ time ./Main +RTS -sstderr -RTS 1e7
5000000.5
   3,012,099,000 bytes allocated in the heap
   4,079,698,408 bytes copied during GC
   1,162,199,352 bytes maximum residency (13 sample(s))
      19,313,368 bytes maximum slop
            1990 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4869 colls,     0 par    1.046s   1.055s     0.0002s    0.0017s
  Gen  1        13 colls,     0 par    1.230s   1.960s     0.1508s    0.5764s

  INIT    time    0.000s  (  0.004s elapsed)
  MUT     time    0.963s  (  0.825s elapsed)
  GC      time    2.276s  (  3.016s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.009s  (  0.185s elapsed)
  Total   time    3.251s  (  4.029s elapsed)

  %GC     time      70.0%  (74.8% elapsed)

  Alloc rate    3,126,345,026 bytes per MUT second

  Productivity  30.0% of total user, 24.2% of total elapsed


real	0m4.197s
user	0m3.251s
sys	0m0.883s

Change History (5)

comment:1 Changed 3 years ago by newhoggy

Here is the an example of how I reproduce the problem assuming the vector library is already installed:

(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1…1]
10:46 $ cat Main.hs
module Main (main) where

import System.Environment
import Text.Printf
import qualified Data.Vector.Unboxed as DVU

main :: IO ()
main = do
    [d] <- map read `fmap` getArgs
    printf "%f\n" (mean (DVU.enumFromTo 1 d))

mean :: DVU.Vector Double -> Double
mean xs = s / fromIntegral (n :: Int)
  where
    Pair n s       = DVU.foldl k (Pair 0 0) xs
    k (Pair m t) x = Pair (m + 1) (t + x)
{-# INLINE mean #-}

data Pair = Pair !Int !Double
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1…1]
10:47 $ ghc -prof -rtsopts -fno-prof-count-entries  -fno-prof-cafs -fno-prof-auto -fforce-recomp -O2 -funbox-strict-fields Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1…1]
10:47 $ time ./Main +RTS -sstderr -RTS 1e7
5000000.5
   3,012,099,000 bytes allocated in the heap
   4,079,698,408 bytes copied during GC
   1,162,199,352 bytes maximum residency (13 sample(s))
      19,313,368 bytes maximum slop
            1990 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4869 colls,     0 par    1.112s   1.123s     0.0002s    0.0017s
  Gen  1        13 colls,     0 par    1.333s   2.194s     0.1688s    0.7226s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    1.027s  (  0.880s elapsed)
  GC      time    2.445s  (  3.317s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.009s  (  0.195s elapsed)
  Total   time    3.483s  (  4.395s elapsed)

  %GC     time      70.2%  (75.5% elapsed)

  Alloc rate    2,932,567,762 bytes per MUT second

  Productivity  29.8% of total user, 23.6% of total elapsed


real	0m4.565s
user	0m3.484s
sys	0m0.939s
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1…1]
10:47 $ ghc -rtsopts -fno-prof-count-entries  -fno-prof-cafs -fno-prof-auto -fforce-recomp -O2 -funbox-strict-fields Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
(java-1.8.0 ghc-7.10.3) ✔ ~/wrk/haskell-works/hw-tutorial-performance/hw-tutorial-performance-rwhe [master|✚ 1…1]
10:47 $ time ./Main +RTS -sstderr -RTS 1e7
5000000.5
         115,888 bytes allocated in the heap
           3,480 bytes copied during GC
          44,384 bytes maximum residency (1 sample(s))
          17,056 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.005s     0.0050s    0.0050s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.010s  (  0.026s elapsed)
  GC      time    0.000s  (  0.005s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    0.012s  (  0.039s elapsed)

  %GC     time       1.4%  (13.0% elapsed)

  Alloc rate    11,679,903 bytes per MUT second

  Productivity  97.5% of total user, 30.5% of total elapsed


real	0m0.057s
user	0m0.012s
sys	0m0.008s

The first run with profiling is slow and uses a lot of memory. The second run without profiling is very fast and uses a negligible amount of memory.

comment:2 Changed 3 years ago by simonpj

It's extremely difficult to combine (a) cost-centre profiling with (b) lots of optimisation. It's a research question and GHC doesn't solve it.

If you want unrestricted optimisation, I'm afraid you are stuck with the (much lower level) ticky-ticky profiling system.

It'd be great to improve this, but I don't know how.

Simon

comment:3 Changed 3 years ago by bgamari

However, do keep in mind that profiling isn't an all-or-nothing proposition. If you would like to use the profiler it is still possible to (with care) manually insert cost-centers or even enable automatic cost-center insertion for entire (non-performance critical) modules by using the OPTIONS_GHC pragma.

comment:4 Changed 10 months ago by guaraqe

I think the problem that the OP is reporting is that there are no cost centers in the program, yet no optimization is done. There are no manual SCCs and compilation is done with -fno-prof-count-entries -fno-prof-cafs -fno-prof-auto -O2, so we would expect that there would be no barriers for optimization.

This can be wrong if for some reason the GHC flags are not passed to the vector library, and it is still being compiled with automatic cost centers, which would indeed block fusion. If that is the case, it would be nice to have a way to pass these flags all the way down.

comment:5 in reply to:  4 Changed 9 months ago by sgraf

Cc: sgraf simonpj added

Replying to guaraqe:

I think the problem that the OP is reporting is that there are no cost centers in the program, yet no optimization is done. There are no manual SCCs and compilation is done with -fno-prof-count-entries -fno-prof-cafs -fno-prof-auto -O2, so we would expect that there would be no barriers for optimization.

This can be wrong if for some reason the GHC flags are not passed to the vector library, and it is still being compiled with automatic cost centers, which would indeed block fusion. If that is the case, it would be nice to have a way to pass these flags all the way down.

Indeed passing -prof will use profiling versions of libraries. So functions in vector will still contribute their cost-centres.

I did some digging. It seems that with profiled vector libraries, small functions like doBoundsCheck or the specialised call to pure @Id (from Data.Vector.Fusion.Util) are no longer inlined. Especially the fact that pure @Id is no longer inlined means that SpecConstr is having a hard time extending the call pattern for the crucial Step parameter through the Id constructor in calls like this:

jump $j_s8PT
  ((Data.Vector.Fusion.Util.$fMonadId1
      @ (Data.Vector.Fusion.Stream.Monadic.Step Double Double)
      (Data.Vector.Fusion.Stream.Monadic.Yield
          @ Double
          @ Double
          (GHC.Types.D# sc_s8Ur)
          (GHC.Types.D# (GHC.Prim.+## sc_s8Ur 1.0##))))

Note that $fMonadId = pure @Id. If $fMonadId was inlined, SpecConstr could probably see through the SCC and extend the call pattern to Yield.

I'm not sure which part of the simplifier or whatever is responsible for refusing to inline $fMonadId, it's unfolding in the interface file is basically

\ @ a -> {__scc {Data.Vector.Fusion.Util.pure} True False} \ (v :: a) -> v

... calling in Simon for advice.

Last edited 9 months ago by sgraf (previous) (diff)
Note: See TracTickets for help on using tickets.