Opened 18 months ago

Last modified 12 months ago

#14941 new bug

Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)

Reported by: nh2 Owned by: davide
Priority: normal Milestone:
Component: Compiler Version: 8.2.2
Keywords: Cc: nh2
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by nh2)

selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm)
  => Int -> v e ->           vm (PrimState m) e -> Int -> Int -> m ()      -- slow

selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e)
  => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m ()      -- fast

These two functions are identical except one has VG.Mutable v ~ vm as a constraint, the other one has it in the type signature right of the =>.

The second function is 10x faster.

I would expect them to be equally fast.

The code of the functions is identical, I change only the type declaration.

The slowness happens because with the first function, inlining of primitives like unsafeRead does not happen, and thus also it boxes the Int#s back to Ints when calling unsafeRead.

In particular, in -ddump-simpl, the slow version has

$wpartitionLoop2_rgEy
  :: forall (m :: * -> *) (vm :: * -> * -> *) e.
     (PrimMonad m, MVector vm e, Ord e) =>
     vm (PrimState m) e
     -> GHC.Prim.Int#
     -> e
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> m Int
$wpartitionLoop2_rgEy
  = \... -> 
          let {
            endIndex_a7Ay :: Int
            ...
            endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in
  ...
                                     (VGM.basicUnsafeRead
                                        ...
                                        endIndex_a7Ay)
  ...

while the fast version has

$w$spartitionLoop2_rgUN
  :: GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.Int#
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)

...

$spartitionLoop2_rgUP
  :: VG.Mutable VU.Vector (PrimState IO) Int
     -> Int
     -> Int
     -> Int
     -> Int
     -> Int
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)

So with VG.Mutable v (PrimState m) e in the type signature, GHC managed to inline + specialise it all the way down to concrete types (VUM.MVector, IO, and Int as the element type), and consequently inline basicUnsafeRead on unboxed Int#.

But with VG.Mutable v ~ vm, ghc keeps vm (PrimState m) e all the way, passes around dictionaries, thus eventually cannot inline basicUnsafeRead and re-packs already unboxed values, like endIndex_a7Ay = GHC.Types.I# ww2_sfZn, before passing them into the non-inlined call of basicUnsafeRead, thus making a tight loop allocate that normally wouldn't allocate.

Why might rewriting the type signature in such a trivial way make this happen?


I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee (edit: in which case I commented out the quickcheck and criterion related stuff because their deps don't build there yet).

Reproducer:

Trying to increase -funfolding-use-threshold or -funfolding-keeness-factor did not change the situation.

Attachments (1)

simple_case.zip (1.7 KB) - added by davide 12 months ago.
Simple case and benchmark.

Download all attachments as: .zip

Change History (11)

comment:1 Changed 18 months ago by mpickering

I can reproduce this but I had to delete the criterion dependency in the cabal file.

comment:2 Changed 18 months ago by nh2

Description: modified (diff)

comment:3 Changed 18 months ago by mpickering

I copied the core of selectVectorDestructive2 from the good and bad examples and diffed them. They looked basically the same apart from lots and lots of code like the following:

In bad:

1# ->                                                                                                                                                                  
                                  case $wlvl_rnHf                                                                                                                                                                                                                            
                                         @ VUM.MVector                                                                                                                                                                                                                       
                                         @ Int                                                                                                                                                                                                                               
                                         @ IO                                                                                                                                                                                                                                
                                         Data.Vector.Unboxed.Base.$fMVectorMVectorInt                                                                                                                                                                                        
                                         ((Data.Vector.Primitive.Mutable.MVector                                                                                                                                                                                             
                                             @ ghc-prim-0.5.2.0:GHC.Prim.RealWorld                                                                                                                                                                                           
                                             @ Int                                                                                                                                                                                                                           
                                             ww1_smP2                                                                                                                                                                                                                        
                                             ww2_smP3                                                                                                                                                                                                                        
                                             ww3_smP4)                                                                                                                                                                                                                       
                                          `cast` (Sym (Data.Vector.Unboxed.Base.N:R:MVectorsInt[0]                                                                                                                                                                           
                                                           <ghc-prim-0.5.2.0:GHC.Prim.RealWorld>_N) ; Sym (Data.Vector.Unboxed.Base.DR:MVectorsInt0[0]                                                                                                                                                                                                                                                           
                                                                                                               (Control.Monad.Primitive.D:R:PrimStateIO[0]))                                                                                                                                                                                                                                                      
                                                  :: (Data.Vector.Primitive.Mutable.MVector                                                                                                                                                                                  
                                                        ghc-prim-0.5.2.0:GHC.Prim.RealWorld                                                                                                                                                                                  
                                                        Int :: *)                                                                                                                                                                                                            
                                                     ~R# (VUM.MVector (PrimState IO) Int :: *)))                                                                                                                                                                             
                                         ww4_smP8                                                                                                                                                                                                                            
                                  of wild_00 {                                                                                                                                                                                                                               
                                  } 

In good:

1# -> case $wlvl_rnRh ww2_smWB ww3_smWC ww4_smWG of wild_00 { }

So it looks like on every call the MVector is being constructed rather than created one and shared? Anyone know immediately why this is such a problem?

comment:4 Changed 15 months ago by nh2

This seems to be unrelated to type families and only a problem with ~ in general, because:

I now also found a case where using (val ~ Int) => produces slower code than subsituting val with Int syntactically, or using SPECIALISE.

myfun :: forall val . (Eq val, Show val, val ~ Int) => Mytype1 val -> Mytype2 val -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO ()

is slow, but

{-# SPECIALIZE myfun :: Mytype1 Int -> Mytype2 Int -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO () #-}

myfun :: forall val . (Eq val, Show val, val ~ Int) => Mytype1 val -> Mytype2 val -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO ()

is fast (in my case, a 30% performance difference).

Diffing the output of -ddump-simpl -ddump-to-file -dsuppress-idinfo -dsuppress-coercions -dsuppress-module-prefixes -dsuppress-uniques shows how the call looks different from a module that calls myfun:

val ~ Int version:

$d~~ :: (Int :: *) ~~ (Int :: *)
$d~~ = Eq# @ * @ * @ Int @ Int @~ ...

...
                       $wmyfun
                         @ Int
                         $fEqInt
                         ($d~~ `cast` ...)
                         y
                         wild
                         mything1
                         ipv7
                         mything2
                         ww1
                         w9)
...

SPECIALISEd val ~ Int version:

                       myfun
                         y
                         wild
                         mything1
                         ipv7
                         mything2
                         ww1

I would have expect that after the simplifier has run, GHC would have used all information it has to produce the best core (e.g. use the fact that it knows val ~ Int even without SPECIALISE), but it did not.

comment:5 Changed 15 months ago by simonpj

I think I see. Can you offer a small example?

I think something like this is happening. We have

...(\(g :: Int ~ val).   ...f (d |> Num g)... ) ...

where d :: Num Int

Now, we can't float that call up to the definition of f because it mentions g. But we could in principle specialise f for Num Int, and then use that specialised version at the call.

I'm not certain this is exactly what is happening for you. Hence wanting a test case. (You don't need to exhibit worse perf; just that a specialisation is created and used without the equality, but not with.)

comment:6 Changed 12 months ago by davide

Owner: set to davide

comment:7 Changed 12 months ago by davide

I've created a simple case where this happens:

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeFamilies #-}

module F (f) where

f :: Int -> Int                           -- FAST
-- f :: forall a. (a ~ Int) => a -> a     -- SLOW
f x = x + x
{-# NOINLINE f #-}

Compiling with:

$ ghc-8.4.3 -O -ddump-simpl -dsuppress-coercions F.hs

I get this core (note worker-wrapper transformation):

-- RHS size: {terms: 4, types: 1, coercions: 0, joins: 0/0}
F.$wf [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>]
F.$wf = \ (ww_s1ay :: GHC.Prim.Int#) -> GHC.Prim.+# ww_s1ay ww_s1ay

-- RHS size: {terms: 10, types: 4, coercions: 0, joins: 0/0}
f [InlPrag=NOUSERINLINE[0]] :: Int -> Int
[GblId,
 Arity=1,
 Caf=NoCafRefs,
 Str=<S(S),1*U(U)>m,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
         Tmpl= \ (w_s1av [Occ=Once!] :: Int) ->
                 case w_s1av of { GHC.Types.I# ww1_s1ay [Occ=Once] ->
                 case F.$wf ww1_s1ay of ww2_s1aC { __DEFAULT ->
                 GHC.Types.I# ww2_s1aC
                 }
                 }}]
f = \ (w_s1av :: Int) ->
      case w_s1av of { GHC.Types.I# ww1_s1ay ->
      case F.$wf ww1_s1ay of ww2_s1aC { __DEFAULT ->
      GHC.Types.I# ww2_s1aC
      }
      }

Swapping to f :: forall a. (a ~ Int) => a -> a gives:

-- RHS size: {terms: 10, types: 21, coercions: 12, joins: 0/0}
f [InlPrag=NOINLINE] :: forall a. ((a :: *) ~ (Int :: *)) => a -> a
[GblId, Arity=2, Caf=NoCafRefs, Str=<S(S),1*U(1*U)><S,U>m]
f = \ (@ a_a13U)
      ($d~_a13W :: (a_a13U :: *) ~ (Int :: *))
      (eta_B1 :: a_a13U) ->
      case GHC.Types.HEq_sc
             @ * @ * @ a_a13U @ Int ($d~_a13W `cast` <Co:5>)
      of co_a14c
      { __DEFAULT ->
      (GHC.Num.$fNumInt_$c+
         (eta_B1 `cast` <Co:2>) (eta_B1 `cast` <Co:2>))
      `cast` <Co:3>
      }

I ran a benchmark to confirm the performance difference:

benchmarking Int -> Int
time                 12.47 ns   (12.43 ns .. 12.55 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 12.53 ns   (12.48 ns .. 12.59 ns)
std dev              173.4 ps   (140.2 ps .. 239.2 ps)
variance introduced by outliers: 17% (moderately inflated)

benchmarking (a ~ Int) => a -> a
time                 15.72 ns   (15.69 ns .. 15.76 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 15.80 ns   (15.74 ns .. 16.01 ns)
std dev              327.8 ps   (135.0 ps .. 691.2 ps)
variance introduced by outliers: 32% (moderately inflated)

Changed 12 months ago by davide

Attachment: simple_case.zip added

Simple case and benchmark.

comment:8 Changed 12 months ago by davide

Regarding simple example

f :: forall a. (a ~ Int) => a -> a, the difference in performance is somewhat expected. This may be a different issue than the example given in the ticket description. In short, a ~ Int is a proof that type a is equal to type Int. In core, a ~ Int is a regular boxed GADT meaning that it could be bottom i.e. an invalid prove (this is the main mechanism behind -fdefer-type-errors). Unboxing a ~ b at corresponds to checking the proof which is required to coerce the input binding from a to Int. Normally the (a ~ Int) would be optimized away (as described here in section 7.3), but that requires a worker wrapper transformation that never happens. Removing NOINLINE allows f to be optimized across modules, which closes the performance gap.

Regarding original example

Unlike my simple example, all the code is in one module, so I expect the equality proof VG.Mutable v ~ vm to be optimized away (again see here section 7.3). With ghc 3.2.2, when compiling the slow version, I see selectVectorDestructive2 is specialized to $sselectVectorDestructive2 :: Int -> Vector Int -> MVector (PrimState IO) Int -> Int -> Int -> IO () (pass 2). This is good, but for some reason myread and partitionLoop2 are not specialized even though they are used by $sselectVectorDestructive2:

$sselectVectorDestructive2 =
...
    let 
        
        $dMVector =
          Data.Vector.Generic.Base.$p1Vector
            @Vector
            @Int
            Data.Vector.Unboxed.Base.$fVectorVectorInt
    in
...
          (Main.myread
                @IO
                @MVector
                @Int
                Control.Monad.Primitive.$fPrimMonadIO
                $dMVector
                GHC.Classes.$fOrdInt
                GHC.Show.$fShowInt
                v
                begin)
...
          (Main.partitionLoop2
            @IO
            @MVector
            @Int
            Control.Monad.Primitive.$fPrimMonadIO
            $dMVector
            GHC.Classes.$fOrdInt
            GHC.Show.$fShowInt
            v
            begin
            pivot
            (GHC.Types.I# ...)

In the fast version, myread and partitionLoop2 are specialized in this pass. I noticed 2 other differences:

  • fast version floats $dMVector to a top level binding.
  • fast version specializes to Mutable Vector (PrimState IO) Int instead of MVector (PrimState IO) Int. Note Mutable is a type family and Mutable Vector = MVector

comment:9 Changed 12 months ago by simonpj

Thanks for the simpler example. I think I now understand what is going on. It's all to do with strictness analysis.

Consider this function, and suppose it is strict in x:

f1 :: Int -> blah
f1 x = ...(case x of I# y -> ...)...

The strictness analysis sees that f1 is strict in x and we get a w/w split

f1 :: Int -> blah
f1 x = case x of I# y -> $wf y

$wf1 :: Int# -> blah
$wf1 y = let x = I# y in ...original body of f...

Now suppose that we have

type family F a
type instance F Bool = Int

f2 :: F Bool -> blah
f2 x = ...same as before...

In fact the strictness analysis still sees that f2 is strict in x, and worker wrapper works too -- all by using the family instances. We get this

f2 :: F Bool -> blah
f2 x = case (x |> g1) of I# y -> $wf2 y

$wf2 :: Int# -> blah
$wf2 y = let x = (I# y) |> sym g
         in ..as before...

Here g is a coercion g :: F Bool ~ Int, constructed from the family instances. This coersionn is generated by topNormaliseType_maybe, itself called from deepSplitCprType_maybe in WwLib, during worker/wrapper generation.

But it's harder if the coercion is purely local. Let's start with this (yes I know you can't write (~#) in source Haskell but imagine this is Core:

f3 :: forall a. (a ~# Int) => a -> blah
f3 a (g :: (a ~# Int) (x :: a) = ...as before...

What we'd like is this:

f3 :: forall a. (a ~# Int) => a -> blah
f3 a (g :: (a ~# Int) (x :: a) = case (x |> g) of I# y -> $wf3 a g y

$wf3 :: forall a. (a ~# Int) => Int# -> blah
$wf3 a g y = let x = (I# y) |> sym g
            in ...as before...

but alas neither the demand analyser, nor the worker/wrapper generator are clever enough to do this. We need a kind of normaliseType that can take some local "givens" as well as the top-level family instance envs.

This has the same ring as something Ryan wanted in the coverage checker.

It's not obvious exactly what "should work". Eg what about (Int ~# a), or even ([Int] ~# [a])?

Finally, there is the question of (~) vs (~#). I think we are very aggressive about unboxing (~) constraints, so I'd like this:

f4 :: forall a. (a ~ Int) => a -> blah
f4 a (g :: (a ~ Int) (x :: a)
  = case g of MkTwiddle (g2 :: a ~# Int) ->
    case (x |> g2) of I# y -> $wf4 a g2 y

$wf4 :: forall a. (a ~# Int) => Int# -> blah
$wf4 a g2 y = let x = (I# y) |> sym g2
                  g = MkTwiddle g2
            in ...as before...

Making all this happen will take some work though.

How important is it? I'm tempted to say "don't write types like that"! (Although the type checker goes to some trouble to support them well.)

Last edited 12 months ago by simonpj (previous) (diff)

comment:10 in reply to:  9 Changed 12 months ago by nh2

Replying to simonpj:

How important is it? I'm tempted to say "don't write types like that"!

From the user's perspective, I can put forward two use cases:

  1. Type level let bindings

(..., VG.Mutable v ~ vm) => ... -> vm (PrimState m) e -> ...

From the issue description. This is essentially type-signature let-binding, allowing me to write the type signature more clearly/legibly, especially if vm is used multiple times.

  1. Refactorings

For the code in comment 4, (val ~ Int) =>, this arose from a refactoring where I replaced Int by val across a large code base, and then in the top-level-ish function set val ~ Int to check if the performance was unimpacted.

I'd probably be OK without this being fixed, now that I know what's going on, but if it's not fixed, we somehow have to do a much better job at giving warnings or educating people that you can't just apply a substitution principle when "cleaning up" type signaturees with ~. I incorrectly assumed this would be type-level only, and as a result had to start a multi-day investigation of where my performance went because I expected any mistake on my side but not this. Especially when writing code using vector whose type family heavy API, in my opinion, almost begs for ~ to be used to achieve readable signatures.

Note: See TracTickets for help on using tickets.