Opened 17 months ago

Last modified 8 months ago

#15176 new bug

Superclass `Monad m =>` makes program run 100 times slower

Reported by: danilo2 Owned by:
Priority: highest Milestone:
Component: Compiler Version: 8.4.2
Keywords: Cc: MikolajKonarski, alpmestan, sgraf, maoe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by danilo2)

Hi! I've just encountered a very bizarre error.

General description

The code:

class LayersFoldableBuilder__ t (layers :: [Type]) m where
    buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)

instance Monad m => LayersFoldableBuilder__ t '[] m where
    buildLayersFold__ = \_ a -> a
    {-# INLINE buildLayersFold__ #-}

instance ( MonadIO m
         , Storable.Storable (Layer.Cons l ())
         , Layer.StorableLayer l m
         , LayerFoldableBuilder__ (EnabledLayer t l) t m l
         , LayersFoldableBuilder__ t ls m )
     => LayersFoldableBuilder__ t (l ': ls) m where
    buildLayersFold__ = \ptr mr -> do
        let fs   = buildLayersFold__ @t @ls ptr'
            ptr' = Ptr.plusPtr ptr $ Layer.byteSize @l
        layerBuild__ @(EnabledLayer t l) @t @m @l ptr $! fs mr
    {-# INLINE buildLayersFold__ #-}

This is a typeclass LayersFoldableBuilder__ and ALL of its instances. Please note, that every instance has a Monad m or MonadIO m constraint. The program which uses this code heavily runs in 40ms. If we only add constraint Monad m => to the class definition:

class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
    buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)

The program runs in 3.5s , which is almost 100 times slower.

Unfortunatelly I do not have minimal example, but it is reproducible. It is a part of the Luna Language codebase: https://github.com/luna/luna-core/blob/60bf6130691c23e52b97b067b52becb8fdb0c72e/core/src/Data/Graph/Traversal/Scoped.hs#L102

it was introduced in the commit 60bf6130691c23e52b97b067b52becb8fdb0c72e on branch static-layers . However, building this is simple: stack bench luna-core . After invoking it we see the described results.

Why its important and what should we do to fix it

  1. I am writing this because I care about Haskell community. I want GHC and Haskell to be widely used. Right now, the only thing I hear from all companies around our company is that impredicative performance, even when following rules "how to write efficient code" is the biggest pain people have. Haskell is gathering attention - pure functional programming, immutability etc - are great. But it will not become a popular choice unless we care about predictive performance.
  1. Such performance changes are unacceptable when thinking about Haskell and GHC as production ready systems. We need a clear way how to write high performance Haskell without the need to benchmark every part of our programs even when refactoring things. GHC has enough information to discover that we want high performance here and there (even by looking at INLINE pragmas) and should warn us about lack of optimization. We should also have a way to force GHC to apply optimizations in particular places - for example by explicit marking code to be always specialized during compilation, so GHC would never fall back to dict-passing in such places. Such possibility would solve MANY related problems and user fears.
  1. The point 2 also applies to strictness. In my opinion, having more clear strictness resolution rules / tools is important. Right now the only way to know if strictness analysis did a good job and we are not constantly boxing / unboxing things is to read core, which is tedious and 99% of Haskell users do not even know how to do it (We've got 10 really, really good Haskellers here, 2 of them are capable of reading core, but not very fluently). I would love to chat more about these topics, because they are crucial for growing Haskell community and making Haskell more popular choice, which is waht we want, right? We don't want Haskell to be just a research project with "all its users being its authors" at the same time, am I

What happens in core

I inspected core and have found that indeed, after adding the constraint, GHC does not apply all optimizations to the definitions. To be honest, I completely don't understand it, because the code uses everywhere explicit INLINE pragma to be sure everything is optimized away in the compilation stage:

--------------------------------------------------------------------------------
SLOW, with Monad m =>
--------------------------------------------------------------------------------

-- RHS size: {terms: 5, types: 12, coercions: 4, joins: 0/0}
buildLayersFold__ [InlPrag=INLINE]
  :: forall t (layers :: [*]) (m :: * -> *).
     LayersFoldableBuilder__ t layers m =>
     SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
[GblId[ClassOp],
 Arity=1,
 Caf=NoCafRefs,
 Str=<S,U>,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=1,unsat_ok=False,boring_ok=True)
         Tmpl= \ (@ t_ao0O)
                 (@ (layers_ao0P :: [*]))
                 (@ (m_ao0Q :: * -> *))
                 (v_B1 [Occ=Once]
                    :: LayersFoldableBuilder__ t_ao0O layers_ao0P m_ao0Q) ->
                 v_B1
                 `cast` (Data.Graph.Traversal.Scoped.N:LayersFoldableBuilder__[0]
                             <t_ao0O>_N <layers_ao0P>_N <m_ao0Q>_N
                         :: (LayersFoldableBuilder__
                               t_ao0O layers_ao0P m_ao0Q :: Constraint)
                            ~R# (SomePtr
                                 -> m_ao0Q (Fold.Result t_ao0O)
                                 -> m_ao0Q (Fold.Result t_ao0O) :: *))}]
buildLayersFold__
  = \ (@ t_ao0O)
      (@ (layers_ao0P :: [*]))
      (@ (m_ao0Q :: * -> *))
      (v_B1 :: LayersFoldableBuilder__ t_ao0O layers_ao0P m_ao0Q) ->
      v_B1
      `cast` (Data.Graph.Traversal.Scoped.N:LayersFoldableBuilder__[0]
                  <t_ao0O>_N <layers_ao0P>_N <m_ao0Q>_N
              :: (LayersFoldableBuilder__
                    t_ao0O layers_ao0P m_ao0Q :: Constraint)
                 ~R# (SomePtr
                      -> m_ao0Q (Fold.Result t_ao0O)
                      -> m_ao0Q (Fold.Result t_ao0O) :: *))



--------------------------------------------------------------------------------
FAST, without Monad m =>
--------------------------------------------------------------------------------

-- RHS size: {terms: 8, types: 25, coercions: 0, joins: 0/0}
Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__
  :: forall t (layers :: [*]) (m :: * -> *).
     LayersFoldableBuilder__ t layers m =>
     Monad m
[GblId[ClassOp],
 Arity=1,
 Caf=NoCafRefs,
 Str=<S(SL),U(U,A)>,
 RULES: Built in rule for Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__: "Class op $p1LayersFoldableBuilder__"]
Data.Graph.Traversal.Scoped.$p1LayersFoldableBuilder__
  = \ (@ t_ao0P)
      (@ (layers_ao0Q :: [*]))
      (@ (m_ao0R :: * -> *))
      (v_B1 :: LayersFoldableBuilder__ t_ao0P layers_ao0Q m_ao0R) ->
      case v_B1 of v_B1
      { Data.Graph.Traversal.Scoped.C:LayersFoldableBuilder__ v_B2
                                                              v_B3 ->
      v_B2
      }

-- RHS size: {terms: 8, types: 25, coercions: 0, joins: 0/0}
buildLayersFold__
  :: forall t (layers :: [*]) (m :: * -> *).
     LayersFoldableBuilder__ t layers m =>
     SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
[GblId[ClassOp],
 Arity=1,
 Caf=NoCafRefs,
 Str=<S(LS),U(A,U)>,
 RULES: Built in rule for buildLayersFold__: "Class op buildLayersFold__"]
buildLayersFold__
  = \ (@ t_ao0P)
      (@ (layers_ao0Q :: [*]))
      (@ (m_ao0R :: * -> *))
      (v_B1 :: LayersFoldableBuilder__ t_ao0P layers_ao0Q m_ao0R) ->
      case v_B1 of v_B1
      { Data.Graph.Traversal.Scoped.C:LayersFoldableBuilder__ v_B2
                                                              v_B3 ->
      v_B3
      }

Attachments (4)

original.ticky (16.5 KB) - added by osa1 13 months ago.
Ticky output of program in comment:10 without Monad m
patch.ticky (37.4 KB) - added by osa1 13 months ago.
Ticky output of program in comment:10 with Monad m
bench-test-orig.prof (121.3 KB) - added by osa1 13 months ago.
Prof output of program in comment:10 without Monad m
bench-test-patch.prof (128.6 KB) - added by osa1 13 months ago.
Prof output of program in comment:10 with Monad m

Download all attachments as: .zip

Change History (24)

comment:1 Changed 17 months ago by danilo2

Description: modified (diff)

comment:2 Changed 17 months ago by simonpj

I agree that predictable performance is super-important. So we give a high priority to unexpected performance lossage.

I'd love to know what is going on in your application, but I wonder if you or someone else can help characterise what is going on, and perhaps distil into a smaller test case?

What I can say is that the difference between your two classes is the difference between

newtype LayersFoldableBuilder__ t (layers :: [Type]) m
  = MkD (SomePtr -> m (Fold.Result t) -> m (Fold.Result t))

and

data LayersFoldableBuilder__ t (layers :: [Type]) m
  = MkD (Monad m)
        (SomePtr -> m (Fold.Result t) -> m (Fold.Result t))

That is, without the superclass a LayersFoldableBuilder__ dictionary is represented just by the buildLayersFold__ function itself; when you add the superclass, it turns into a heap-allocated pair of that function and a Monad m dictionary.

This might change perf slightly, but a factor of 100 is ridiculous. I have literally no idea why that is happening.

The change is so egregious that profiling ought to lead you right to it. (Or use -ticky which is faster and less invasive.)

You are using 8.4, right? Does it happen with 8.2? (Or is that hard to find out?)

Also does it also happen if, instead of adding Monad m => you add a dummy method to the class, like this

class LayersFoldableBuilder__ t (layers :: [Type]) m where
    buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)
    dummy :: t -> t

comment:3 Changed 16 months ago by bgamari

Owner: set to bgamari

I will try to do a bit of characterisation of this.

comment:4 Changed 14 months ago by thomie

Type of failure: Compile-time performance bugRuntime performance bug

comment:5 Changed 14 months ago by bgamari

Milestone: 8.6.18.8.1

This is still waiting on a suitable block of my time.

comment:6 Changed 14 months ago by osa1

Owner: changed from bgamari to osa1

I'll look into this.

comment:7 Changed 14 months ago by osa1

I managed to reproduce this. The original instructions no longer work (the git repo disappeared, the branch doesn't exist in the new repo etc.) so here is what I did to reproduce:

diff --git a/core/src/Data/Graph/Fold/Layer.hs b/core/src/Data/Graph/Fold/Layer.hs
index 28d6c6cd..45150ff1 100644
--- a/core/src/Data/Graph/Fold/Layer.hs
+++ b/core/src/Data/Graph/Fold/Layer.hs
@@ -141,7 +141,7 @@ instance Monad m => Fold.Builder (Scoped s) m (SmallVectorA t alloc n a)

 -- === FoldableLayers === --

-class LayersFoldableBuilder__ t (layers :: [Type]) m where
+class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
     buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)

 instance Monad m => LayersFoldableBuilder__ t '[] m where
diff --git a/core/src/Data/Graph/Fold/LayerMap.hs b/core/src/Data/Graph/Fold/LayerMap.hs
index 4b12bbf6..e5e54e45 100644
--- a/core/src/Data/Graph/Fold/LayerMap.hs
+++ b/core/src/Data/Graph/Fold/LayerMap.hs
@@ -117,7 +117,7 @@ instance Monad m => Fold.Builder (Scoped s) m (SmallVectorA t alloc n a)

 -- === FoldableLayers === --

-class LayersFoldableBuilder__ t (layers :: [Type]) m where
+class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
     buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)

 instance Monad m => LayersFoldableBuilder__ t '[] m where
diff --git a/core/src/Data/Graph/Fold/Scoped.hs b/core/src/Data/Graph/Fold/Scoped.hs
index 2fade0f3..2e6b51df 100644
--- a/core/src/Data/Graph/Fold/Scoped.hs
+++ b/core/src/Data/Graph/Fold/Scoped.hs
@@ -131,7 +131,7 @@ instance Monad m => Fold.Builder (Scoped t) m (SmallVectorA s alloc n a)

 -- === FoldableLayers === --

-class LayersFoldableBuilder__ t (layers :: [Type]) m where
+class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
     buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)

 instance Monad m => LayersFoldableBuilder__ t '[] m where
diff --git a/core/src/Data/Graph/Fold/ScopedMap.hs b/core/src/Data/Graph/Fold/ScopedMap.hs
index 217c55a6..4a3d34c8 100644
--- a/core/src/Data/Graph/Fold/ScopedMap.hs
+++ b/core/src/Data/Graph/Fold/ScopedMap.hs
@@ -129,7 +129,7 @@ instance Monad m => Fold.Builder (ScopedMap s) m (SmallVectorA t alloc n a)

 -- === FoldableLayers === --

-class LayersFoldableBuilder__ t (layers :: [Type]) m where
+class Monad m => LayersFoldableBuilder__ t (layers :: [Type]) m where
     buildLayersFold__ :: SomePtr -> m (Fold.Result t) -> m (Fold.Result t)

 instance Monad m => LayersFoldableBuilder__ t '[] m where
  • Run benchmarks again

Most of the benchmarks are not effected, but there are three benchmarks which are effected quite significantly by this change:

Before the patch:

benchmarking ir/discovery/generic/10e6
time                 61.47 ms   (61.14 ms .. 61.80 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 61.02 ms   (60.79 ms .. 61.20 ms)
std dev              367.5 μs   (224.7 μs .. 582.3 μs)

benchmarking ir/discovery/partitions single var/10e6
time                 93.94 ms   (93.22 ms .. 94.75 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 93.57 ms   (92.95 ms .. 93.94 ms)
std dev              746.7 μs   (377.0 μs .. 1.245 ms)

benchmarking ir/discovery/partitions unify/10e6
time                 518.7 ms   (508.2 ms .. 523.9 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 515.6 ms   (512.3 ms .. 516.9 ms)
std dev              2.350 ms   (717.7 μs .. 3.196 ms)
variance introduced by outliers: 19% (moderately inflated)

After the patch:

benchmarking ir/discovery/generic/10e6
time                 1.309 s    (1.283 s .. 1.326 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.320 s    (1.312 s .. 1.334 s)
std dev              13.24 ms   (767.0 μs .. 16.27 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking ir/discovery/partitions single var/10e6
time                 1.355 s    (1.351 s .. 1.359 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.357 s    (1.356 s .. 1.359 s)
std dev              1.415 ms   (1.209 ms .. 1.452 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking ir/discovery/partitions unify/10e6
time                 5.459 s    (5.438 s .. 5.501 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.444 s    (5.435 s .. 5.452 s)
std dev              11.24 ms   (7.336 ms .. 13.71 ms)
variance introduced by outliers: 19% (moderately inflated)

Summary:

  • ir/discovery/generic/10e6: 21x increase
  • ir/discovery/partitions single var/10e6: 14x increase
  • ir/discovery/partitions unify/10e6: 10x increase

No ideas why yet.

comment:8 Changed 14 months ago by osa1

So, I've built two versions of the package with -dump-simpl -dump-to-file -dsuppress-uniques (which took about an hour to build twice on my i7-8700K desktop). Unfortunately there are 71 modules in this package and none of the files are identical between two versions (there are at least very minor identifier changes) so this is still not easy to debug, but while checking some random files I found some relevant changes between two versions:

This expression in the original version:

$dLayersFoldableBuilder__
  `cast` (Data.Graph.Fold.Layer.N:LayersFoldableBuilder__[0]
              <t>_N <layers>_N <m>_N
          :: (LayersFoldableBuilder__ t layers m :: Constraint)
             ~R# (SomePtr -> m (Fold.Result t) -> m (Fold.Result t) :: *))

becomes this after Monad =>:

buildLayersFold__
  @ t
  @ layers
  @ m
  $dLayersFoldableBuilder__

If my understanding is correct, in the original version the typeclass dictionary is represented as its method, but with Monad m => it's not as it not has one more field for the Monad m dictionary. This means one layer of indirection in the Monad m => version.

There are lots of changes similar to this. Not sure if this by itself explains 20x increase in runtime though.

comment:9 Changed 14 months ago by simonpj

I'd use -ticky which should rapidly identify the culprit.

comment:10 Changed 13 months ago by osa1

I be able to get more useful RTS stats and smaller Core I made a smaller reproducer:

{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE FlexibleContexts  #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE Strict #-}

module Main where

import Prologue

import qualified Control.Monad.Exception as Exception
import qualified Data.Graph.Data.Graph.Class as Graph
import qualified Data.Graph.Fold.Partition as Partition
import qualified Luna.IR as IR
import qualified Luna.Pass as Pass
import qualified Luna.Pass.Scheduler as Scheduler

import Luna.Pass (Pass)
import Luna.Pass.Basic (Compilation)

type OnDemandPass stage pass m =
    ( MonadIO m
    , Typeable pass
    , Pass.Compile stage pass m
    , Exception.MonadException Scheduler.Error m
    )

runPass :: forall stage pass m . OnDemandPass stage pass m
        => Pass stage pass () -> m ()
runPass !pass = Scheduler.evalT $ do
    Scheduler.registerPassFromFunction__ pass
    Scheduler.runPassSameThreadByType @pass
{-# INLINE runPass #-}

runPass' :: Pass Compilation Pass.BasicPass () -> IO ()
runPass' p = Graph.encodeAndEval @Compilation (runPass p)
{-# INLINE runPass' #-}

partitionsUnify :: Int -> IO ()
partitionsUnify i = runPass' $ do
    !a <- IR.var "a"
    !b <- IR.var "b"
    !u <- IR.unify a b
    let go !0 = let !o = pure () in o
        go !j = do
            !_ <- Partition.partition u
            go $! j - 1
    go i

main :: IO ()
main = partitionsUnify (10^6)

Put this in core/test/Main.hs and add this to luna-core.cabal:

executable bench-test
    main-is: Main.hs

    hs-source-dirs: test/

    build-depends:
        ansi-terminal -any,
        base -any,
        containers -any,
        convert -any,
        deepseq -any,
        ghc -any,
        layered-state -any,
        luna-autovector -any,
        luna-core -any,
        luna-cpp-containers -any,
        luna-data-storable -any,
        luna-data-typemap -any,
        luna-exception -any,
        luna-foreign-utils -any,
        luna-generic-traversable -any,
        luna-generic-traversable2 -any,
        luna-memory-manager -any,
        luna-memory-pool -any,
        luna-tuple-utils -any,
        mtl -any,
        primitive -any,
        prologue -any,
        structs -any,
        unboxed-ref >=0.4.0.0,
        vector -any

    ghc-options: -O2 -ticky -rtsopts -Wall

Results: (with and without Monad =>)

============= With Monad => =======================================================

luna git:(master) $ time (cabal-run bench-test +RTS -s)
  27,544,258,632 bytes allocated in the heap
      19,561,928 bytes copied during GC
         205,496 bytes maximum residency (2 sample(s))
          33,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)
                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     26366 colls,     0 par    0.270s   0.268s     0.0000s    0.0008s
  Gen  1         2 colls,     0 par    0.002s   0.002s     0.0008s    0.0011s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   13.449s  ( 13.487s elapsed)
  GC      time    0.272s  (  0.269s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   13.721s  ( 13.757s elapsed)

  %GC     time       2.0%  (2.0% elapsed)

  Alloc rate    2,048,118,786 bytes per MUT second

  Productivity  98.0% of total user, 98.0% of total elapsed

( cabal-run bench-test +RTS -s; )  13,72s user 0,04s system 99% cpu 13,761 total

============= Original ============================================================

luna git:(master) $ time (cabal-run bench-test +RTS -s)
   3,952,215,688 bytes allocated in the heap
       2,071,824 bytes copied during GC
         200,320 bytes maximum residency (2 sample(s))
          33,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3790 colls,     0 par    0.042s   0.043s     0.0000s    0.0008s
  Gen  1         2 colls,     0 par    0.001s   0.002s     0.0009s    0.0010s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    1.595s  (  1.605s elapsed)
  GC      time    0.043s  (  0.044s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    1.638s  (  1.650s elapsed)

  %GC     time       2.6%  (2.7% elapsed)

  Alloc rate    2,478,513,730 bytes per MUT second

  Productivity  97.4% of total user, 97.3% of total elapsed

( cabal-run bench-test +RTS -s; )  1,64s user 0,01s system 99% cpu 1,654 total

I'll now try with -ticky.

Changed 13 months ago by osa1

Attachment: original.ticky added

Ticky output of program in comment:10 without Monad m

Changed 13 months ago by osa1

Attachment: patch.ticky added

Ticky output of program in comment:10 with Monad m

Changed 13 months ago by osa1

Attachment: bench-test-orig.prof added

Prof output of program in comment:10 without Monad m

Changed 13 months ago by osa1

Attachment: bench-test-patch.prof added

Prof output of program in comment:10 with Monad m

comment:11 Changed 13 months ago by osa1

I attached four files for ticky and prof outputs of the program in comment:10 with and wihtout Monad m => patch. I can't make sense of the ticky output -- it's really hard to see what's wrong in a hundred lines long Core function but perhaps someone else can figure it out.

One other thing I tried was to test the patch with -O0, and the numbers are almost identical:

=== ORIGINAL ===================================================================

luna git:(master) $ time (cabal-run bench-test +RTS -s)
  77,264,754,848 bytes allocated in the heap
     114,241,080 bytes copied during GC
         240,688 bytes maximum residency (2 sample(s))
          33,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     74218 colls,     0 par    0.256s   0.246s     0.0000s    0.0002s
  Gen  1         2 colls,     0 par    0.001s   0.001s     0.0003s    0.0006s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   25.042s  ( 25.168s elapsed)
  GC      time    0.257s  (  0.247s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   25.298s  ( 25.415s elapsed)

  %GC     time       1.0%  (1.0% elapsed)

  Alloc rate    3,085,464,250 bytes per MUT second

  Productivity  99.0% of total user, 99.0% of total elapsed

( cabal-run bench-test +RTS -s; )  25,30s user 0,12s system 99% cpu 25,423 total


=== PATCHED  ===================================================================

luna git:(master) $ time (cabal-run bench-test +RTS -s)
  77,200,755,440 bytes allocated in the heap
     114,115,976 bytes copied during GC
         241,064 bytes maximum residency (2 sample(s))
          33,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     74218 colls,     0 par    0.263s   0.254s     0.0000s    0.0002s
  Gen  1         2 colls,     0 par    0.000s   0.001s     0.0003s    0.0006s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   25.487s  ( 25.573s elapsed)
  GC      time    0.263s  (  0.254s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   25.750s  ( 25.827s elapsed)

  %GC     time       1.0%  (1.0% elapsed)

  Alloc rate    3,029,012,929 bytes per MUT second

  Productivity  99.0% of total user, 99.0% of total elapsed

( cabal-run bench-test +RTS -s; )  25,75s user 0,08s system 100% cpu 25,831 total

So it seems to me that with the different dictionary representation we're losing some optimization opportunities. I guess we could try to enable all optimizations again (with -O2) and selectively disable single optimization passes to see which one makes these two versions more similar. That may give an idea about which optimization is not applicable with the Monad m => patch.

comment:12 Changed 13 months ago by simonpj

When generating .ticky files I do this

ghc -O -ticky -dverbose-core2core -ddump-stg Foo.hs >& Foo.stg
./Foo +RTS -rFoo.ticky

The ticky files are utterly useless without the accompanying -dverbose-core2core -ddump-stg output, produced in the very same run of GHC.

Would you like to do that and upload the results?

comment:13 Changed 13 months ago by osa1

I already have Core outputs, they're 140+ files in total (70+ files for each version) and generating them takes about 30 mins for one version (the program is huge). I'll start generating verbose and STG files now but it'll take a few hours probably (just -ddump-simpl takes 30 minutes).

comment:14 Changed 13 months ago by simonpj

No no, don't do that!! Here are the big hitters in the .ticky files

==== Without Monad m (original.ticky) =========

	   15000000  984000000          0   2 SS                   vx{v s21YX} (main:Main) (fun) in r1YhB
	    9000000  520000000          0   4 pSSS                 $w$dLayersFoldableBuilder__{v r1YhB} (main:Main) (fun)
	   24000000  320000000          0   3 SpM                  main:Main.$w$sgo1{v r19} (fun)
	   16000000  240000000          0   3 SpM                  main:Main.$w$sgo3{v r1a} (fun)
	    4000000  232000000          0   2 SS                   sat_s2229{v} (main:Main) (fun) in r1YhB
	    3000000  144000000          0   2 SS                   sat_s21Zw{v} (main:Main) (fun) in s21YX
	    9000000   72000000          0  46 ++++++++++++++++++++ luna-core-0.0.6-inplace:Luna.IR.Term.$fStorable1UniTerm10{v rihy9} (fun)
	    1000000   48000000          0   2 SS                   lvl1355{v r1Yhx} (main:Main) (fun)
	    2000000   32000000          0   2 SI                   lvl1206{v r1Yf3} (main:Main) (fun)
	    2000000   32000000          0   1 M                    luna-core-0.0.6-inplace:Luna.IR.Term.$WUniTermVar{v rihwq} (fun)
	    1000000   24000000          0   2 SI                   lvl1176{v r1Yez} (main:Main) (fun)
	    1000000   24000000          0   1 M                    luna-core-0.0.6-inplace:Luna.IR.Term.$WUniTermUnify{v rihwr} (fun)


=========== With Monad m (patched.ticky) ==============
	   60000000 2496000000          0   7 +++++M.              luna-core-0.0.6-inplace:Data.Graph.Fold.Deep.$fLayerBuilderDeepmType_$clayerBuild{v rGoW} (fun)
	   36000000 1584000000          0   3 >S.                  $s$fFunctorStateT_$cfmap{v rmuaf} (luna-core-0.0.6-inplace:OCI.Pass.Definition.Class) (fun)
	   24000000 1152000000          0   1 S                    sat_s2hor{v} (main:Main) (fun) in r2gUO
	   20000000  960000000          0   1 S                    sat_s2hqt{v} (main:Main) (fun) in r2gVh
	    6000000  480000000          0   3 pSS                  $wlvl{v r2gUO} (main:Main) (fun)
	    5000000  400000000          0   3 pSS                  $wlvl3{v r2gVh} (main:Main) (fun)
	    8000000  384000000          0   1 S                    sat_s2hoV{v} (main:Main) (fun) in r2gUQ
	    6000000  384000000          0   3 SSS                  $s$fReadertlayerm_$cread__3{v r2hmG} (main:Main) (fun)
	    6000000  384000000          0   3 SSS                  $s$fReadertlayerm_$cread__2{v r2hmw} (main:Main) (fun)
	   24000000  320000000          0   3 SpM                  main:Main.$w$sgo3{v r8e} (fun)
	    6000000  288000000          0   3 .SS                  luna-core-0.0.6-inplace:OCI.Pass.Definition.Class.$fApplicativePass5{v rmu8N} (fun)
	   12000000  288000000          0   2 .S                   lvl1{v rmual} (luna-core-0.0.6-inplace:OCI.Pass.Definition.Class) (fun)
	   16000000  240000000          0   3 SpM                  main:Main.$w$sgo1{v r8b} (fun)
	    2000000  160000000          0   3 pSS                  $wlvl1{v r2gUQ} (main:Main) (fun)
	    8000000  120000000          0   7 ++++++M              $w$cbuild1{v riLVV} (luna-core-0.0.6-inplace:Luna.IR.Term) (fun)
	    9000000   96000000          0   3 SSM                  sat_sHlw{v} (luna-core-0.0.6-inplace:Data.Graph.Fold.Deep) (fun) in rGoW
	    1000000   48000000          0   2 SS                   lvl1702{v r2hn9} (main:Main) (fun)
	    2000000   32000000          0   2 SI                   $cpeekByteOff1{v r3Y9N} (luna-core-0.0.6-inplace:Luna.IR.Term.Core) (fun)
	    2000000   32000000          0   1 M                    luna-core-0.0.6-inplace:Luna.IR.Term.$WUniTermVar{v rihWA} (fun)
	    1000000   32000000          0   1 .                    sat_siM5u{v} (luna-core-0.0.6-inplace:Luna.IR.Term) (fun) in riLVV

So we are getting a log of allocation in

  • luna-core-0.0.6-inplace:Data.Graph.Fold.Deep.$fLayerBuilderDeepmType_$clayerBuild
  • $s$fFunctorStateT_$cfmap{v rmuaf} (luna-core-0.0.6-inplace:OCI.Pass.Definition.Class)

Also, in the original version, the big allocation is in Main.

So I'd generate Main.stg and the STG files for Data.Graph.Fold.Deep, and perhaps OCI.Pass.Definition.Class. to begin with

comment:15 Changed 13 months ago by MikolajKonarski

Cc: MikolajKonarski added

comment:16 Changed 13 months ago by alpmestan

Cc: alpmestan added

comment:17 Changed 10 months ago by osa1

Milestone: 8.8.1

Unfortunately we won't be able to get back to this for 8.8.1, and I don't know if this is doable in its current form so I'm removing the milestone. We should focus on finding a smaller reproducer as the current reproducer is taking forever to build and have dozens (if not hundreds) of dependencies (although the change we need to do is not in a dependency, it may still related with code generated for dependencies. In any case less dependencies is always better when debugging).

comment:18 Changed 10 months ago by osa1

Owner: osa1 deleted

comment:19 Changed 10 months ago by sgraf

Cc: sgraf added

comment:20 Changed 8 months ago by maoe

Cc: maoe added
Note: See TracTickets for help on using tickets.