Opened 22 months ago
Last modified 16 months ago
#14941 new bug
Switching direct type family application to EqPred (~) prevents inlining in code using vector (10x slowdown)
Reported by: | nh2 | Owned by: | davide |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | 8.2.2 |
Keywords: | Cc: | nh2 | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description (last modified by )
selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e, VG.Mutable v ~ vm) => Int -> v e -> vm (PrimState m) e -> Int -> Int -> m () -- slow selectVectorDestructive2 :: (PrimMonad m, VG.Vector v e, Ord e, Show e) => Int -> v e -> VG.Mutable v (PrimState m) e -> Int -> Int -> m () -- fast
These two functions are identical except one has VG.Mutable v ~ vm
as a constraint, the other one has it in the type signature right of the =>
.
The second function is 10x faster.
I would expect them to be equally fast.
The code of the functions is identical, I change only the type declaration.
The slowness happens because with the first function, inlining of primitives like unsafeRead
does not happen, and thus also it boxes the Int#
s back to Int
s when calling unsafeRead
.
In particular, in -ddump-simpl
, the slow version has
$wpartitionLoop2_rgEy :: forall (m :: * -> *) (vm :: * -> * -> *) e. (PrimMonad m, MVector vm e, Ord e) => vm (PrimState m) e -> GHC.Prim.Int# -> e -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> m Int $wpartitionLoop2_rgEy = \... -> let { endIndex_a7Ay :: Int ... endIndex_a7Ay = GHC.Types.I# ww2_sfZn } in ... (VGM.basicUnsafeRead ... endIndex_a7Ay) ...
while the fast version has
$w$spartitionLoop2_rgUN :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.MutableByteArray# GHC.Prim.RealWorld -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #) ... $spartitionLoop2_rgUP :: VG.Mutable VU.Vector (PrimState IO) Int -> Int -> Int -> Int -> Int -> Int -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, Int #)
So with VG.Mutable v (PrimState m) e
in the type signature, GHC managed to inline + specialise it all the way down to concrete types (VUM.MVector
, IO
, and Int
as the element type), and consequently inline basicUnsafeRead
on unboxed Int#
.
But with VG.Mutable v ~ vm
, ghc keeps vm (PrimState m) e
all the way, passes around dictionaries, thus eventually cannot inline basicUnsafeRead
and re-packs already unboxed values, like endIndex_a7Ay = GHC.Types.I# ww2_sfZn
, before passing them into the non-inlined call of basicUnsafeRead
, thus making a tight loop allocate that normally wouldn't allocate.
Why might rewriting the type signature in such a trivial way make this happen?
I have tested this on GHC 8.0.2, GHC 8.2.2, and GHC 8.5 HEAD commit cc4677c36ee (edit: in which case I commented out the quickcheck and criterion related stuff because their deps don't build there yet).
Reproducer:
- https://github.com/nh2/haskell-quickselect-median-of-medians/blob/0efd6293e779fda2d864ec3d75329fb16b8af6d9/Main.hs#L506
- Running instructions are in this commit message; for short:
stack exec -- ghc -O --make Main.hs -rtsopts -ddump-simpl -dsuppress-coercions -fforce-recomp -ddump-to-file -fno-full-laziness && ./Main +RTS -sstderr
- For that file, I have pregenerated
-dverbose-core2core
output here: https://github.com/nh2/haskell-quickselect-median-of-medians/tree/0efd6293e779fda2d864ec3d75329fb16b8af6d9/slowness-analysis - I originally just wanted to write a fast median-of-medians implementation on ZuriHac 2017, but got totally derailed by this performance problem. The version of it that I link here is a total mess because of me mauling it to track down this performance issue.
Trying to increase -funfolding-use-threshold
or -funfolding-keeness-factor
did not change the situation.
Attachments (1)
Change History (11)
comment:1 Changed 22 months ago by
comment:2 Changed 22 months ago by
Description: | modified (diff) |
---|
comment:3 Changed 22 months ago by
I copied the core of selectVectorDestructive2
from the good and bad examples and diffed them. They looked basically the same apart from lots and lots of code like the following:
In bad:
1# -> case $wlvl_rnHf @ VUM.MVector @ Int @ IO Data.Vector.Unboxed.Base.$fMVectorMVectorInt ((Data.Vector.Primitive.Mutable.MVector @ ghc-prim-0.5.2.0:GHC.Prim.RealWorld @ Int ww1_smP2 ww2_smP3 ww3_smP4) `cast` (Sym (Data.Vector.Unboxed.Base.N:R:MVectorsInt[0] <ghc-prim-0.5.2.0:GHC.Prim.RealWorld>_N) ; Sym (Data.Vector.Unboxed.Base.DR:MVectorsInt0[0] (Control.Monad.Primitive.D:R:PrimStateIO[0])) :: (Data.Vector.Primitive.Mutable.MVector ghc-prim-0.5.2.0:GHC.Prim.RealWorld Int :: *) ~R# (VUM.MVector (PrimState IO) Int :: *))) ww4_smP8 of wild_00 { }
In good:
1# -> case $wlvl_rnRh ww2_smWB ww3_smWC ww4_smWG of wild_00 { }
So it looks like on every call the MVector
is being constructed rather than created one and shared? Anyone know immediately why this is such a problem?
comment:4 Changed 19 months ago by
This seems to be unrelated to type families and only a problem with ~
in general, because:
I now also found a case where using (val ~ Int) =>
produces slower code than subsituting val
with Int
syntactically, or using SPECIALISE
.
myfun :: forall val . (Eq val, Show val, val ~ Int) => Mytype1 val -> Mytype2 val -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO ()
is slow, but
{-# SPECIALIZE myfun :: Mytype1 Int -> Mytype2 Int -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO () #-} myfun :: forall val . (Eq val, Show val, val ~ Int) => Mytype1 val -> Mytype2 val -> Mytype3 -> Mytype4 -> Mytype5 -> Mytype6 -> IO ()
is fast (in my case, a 30% performance difference).
Diffing the output of -ddump-simpl -ddump-to-file -dsuppress-idinfo -dsuppress-coercions -dsuppress-module-prefixes -dsuppress-uniques
shows how the call looks different from a module that calls myfun
:
val ~ Int
version:
$d~~ :: (Int :: *) ~~ (Int :: *) $d~~ = Eq# @ * @ * @ Int @ Int @~ ... ... $wmyfun @ Int $fEqInt ($d~~ `cast` ...) y wild mything1 ipv7 mything2 ww1 w9) ...
SPECIALISE
d val ~ Int
version:
myfun y wild mything1 ipv7 mything2 ww1
I would have expect that after the simplifier has run, GHC would have used all information it has to produce the best core (e.g. use the fact that it knows val ~ Int
even without SPECIALISE
), but it did not.
comment:5 Changed 19 months ago by
I think I see. Can you offer a small example?
I think something like this is happening. We have
...(\(g :: Int ~ val). ...f (d |> Num g)... ) ... where d :: Num Int
Now, we can't float that call up to the definition of f
because it mentions g
. But we could in principle specialise f
for Num Int
, and then use that specialised version at the call.
I'm not certain this is exactly what is happening for you. Hence wanting a test case. (You don't need to exhibit worse perf; just that a specialisation is created and used without the equality, but not with.)
comment:6 Changed 16 months ago by
Owner: | set to davide |
---|
comment:7 Changed 16 months ago by
I've created a simple case where this happens:
{-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE TypeFamilies #-} module F (f) where f :: Int -> Int -- FAST -- f :: forall a. (a ~ Int) => a -> a -- SLOW f x = x + x {-# NOINLINE f #-}
Compiling with:
$ ghc-8.4.3 -O -ddump-simpl -dsuppress-coercions F.hs
I get this core (note worker-wrapper transformation):
-- RHS size: {terms: 4, types: 1, coercions: 0, joins: 0/0} F.$wf [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>] F.$wf = \ (ww_s1ay :: GHC.Prim.Int#) -> GHC.Prim.+# ww_s1ay ww_s1ay -- RHS size: {terms: 10, types: 4, coercions: 0, joins: 0/0} f [InlPrag=NOUSERINLINE[0]] :: Int -> Int [GblId, Arity=1, Caf=NoCafRefs, Str=<S(S),1*U(U)>m, Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True, WorkFree=True, Expandable=True, Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False) Tmpl= \ (w_s1av [Occ=Once!] :: Int) -> case w_s1av of { GHC.Types.I# ww1_s1ay [Occ=Once] -> case F.$wf ww1_s1ay of ww2_s1aC { __DEFAULT -> GHC.Types.I# ww2_s1aC } }}] f = \ (w_s1av :: Int) -> case w_s1av of { GHC.Types.I# ww1_s1ay -> case F.$wf ww1_s1ay of ww2_s1aC { __DEFAULT -> GHC.Types.I# ww2_s1aC } }
Swapping to f :: forall a. (a ~ Int) => a -> a
gives:
-- RHS size: {terms: 10, types: 21, coercions: 12, joins: 0/0} f [InlPrag=NOINLINE] :: forall a. ((a :: *) ~ (Int :: *)) => a -> a [GblId, Arity=2, Caf=NoCafRefs, Str=<S(S),1*U(1*U)><S,U>m] f = \ (@ a_a13U) ($d~_a13W :: (a_a13U :: *) ~ (Int :: *)) (eta_B1 :: a_a13U) -> case GHC.Types.HEq_sc @ * @ * @ a_a13U @ Int ($d~_a13W `cast` <Co:5>) of co_a14c { __DEFAULT -> (GHC.Num.$fNumInt_$c+ (eta_B1 `cast` <Co:2>) (eta_B1 `cast` <Co:2>)) `cast` <Co:3> }
I ran a benchmark to confirm the performance difference:
benchmarking Int -> Int time 12.47 ns (12.43 ns .. 12.55 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 12.53 ns (12.48 ns .. 12.59 ns) std dev 173.4 ps (140.2 ps .. 239.2 ps) variance introduced by outliers: 17% (moderately inflated) benchmarking (a ~ Int) => a -> a time 15.72 ns (15.69 ns .. 15.76 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 15.80 ns (15.74 ns .. 16.01 ns) std dev 327.8 ps (135.0 ps .. 691.2 ps) variance introduced by outliers: 32% (moderately inflated)
comment:8 Changed 16 months ago by
Regarding simple example
f :: forall a. (a ~ Int) => a -> a
, the difference in performance is somewhat expected. This may be a different issue than the example given in the ticket description. In short, a ~ Int
is a proof that type a
is equal to type Int
. In core, a ~ Int
is a regular boxed GADT meaning that it could be bottom i.e. an invalid prove (this is the main mechanism behind -fdefer-type-errors). Unboxing a ~ b
at corresponds to checking the proof which is required to coerce the input binding from a
to Int
. Normally the (a ~ Int)
would be optimized away (as described here in section 7.3), but that requires a worker wrapper transformation that never happens. Removing NOINLINE
allows f
to be optimized across modules, which closes the performance gap.
Regarding original example
Unlike my simple example, all the code is in one module, so I expect the equality proof VG.Mutable v ~ vm
to be optimized away (again see here section 7.3). With ghc 3.2.2, when compiling the slow version, I see selectVectorDestructive2
is specialized to
$sselectVectorDestructive2 :: Int -> Vector Int -> MVector (PrimState IO) Int -> Int -> Int -> IO ()
(pass 2). This is good, but for some reason myread and partitionLoop2 are not specialized even though they are used by $sselectVectorDestructive2
:
$sselectVectorDestructive2 = ... let $dMVector = Data.Vector.Generic.Base.$p1Vector @Vector @Int Data.Vector.Unboxed.Base.$fVectorVectorInt in ... (Main.myread @IO @MVector @Int Control.Monad.Primitive.$fPrimMonadIO $dMVector GHC.Classes.$fOrdInt GHC.Show.$fShowInt v begin) ... (Main.partitionLoop2 @IO @MVector @Int Control.Monad.Primitive.$fPrimMonadIO $dMVector GHC.Classes.$fOrdInt GHC.Show.$fShowInt v begin pivot (GHC.Types.I# ...)
In the fast version, myread and partitionLoop2 are specialized in this pass. I noticed 2 other differences:
- fast version floats
$dMVector
to a top level binding. - fast version specializes to
Mutable Vector (PrimState IO) Int
instead ofMVector (PrimState IO) Int
. NoteMutable
is a type family andMutable Vector = MVector
comment:9 follow-up: 10 Changed 16 months ago by
Thanks for the simpler example. I think I now understand what is going on. It's all to do with strictness analysis.
Consider this function, and suppose it is strict in x:
f1 :: Int -> blah f1 x = ...(case x of I# y -> ...)...
The strictness analysis sees that f1
is strict in x
and we get a w/w split
f1 :: Int -> blah f1 x = case x of I# y -> $wf y $wf1 :: Int# -> blah $wf1 y = let x = I# y in ...original body of f...
Now suppose that we have
type family F a type instance F Bool = Int f2 :: F Bool -> blah f2 x = ...same as before...
In fact the strictness analysis still sees that f2
is strict in x
,
and worker wrapper works too -- all by using the family instances.
We get this
f2 :: F Bool -> blah f2 x = case (x |> g1) of I# y -> $wf2 y $wf2 :: Int# -> blah $wf2 y = let x = (I# y) |> sym g in ..as before...
Here g
is a coercion g :: F Bool ~ Int
, constructed from the family
instances. This coersionn is generated by topNormaliseType_maybe
,
itself called from deepSplitCprType_maybe
in WwLib
, during worker/wrapper
generation.
But it's harder if the coercion is purely local. Let's start with this
(yes I know you can't write (~#)
in source Haskell but imagine this
is Core:
f3 :: forall a. (a ~# Int) => a -> blah f3 a (g :: (a ~# Int) (x :: a) = ...as before...
What we'd like is this:
f3 :: forall a. (a ~# Int) => a -> blah f3 a (g :: (a ~# Int) (x :: a) = case (x |> g) of I# y -> $wf3 a g y $wf3 :: forall a. (a ~# Int) => Int# -> blah $wf3 a g y = let x = (I# y) |> sym g in ...as before...
but alas neither the demand analyser, nor the worker/wrapper generator
are clever enough to do this. We need a kind of normaliseType
that
can take some local "givens" as well as the top-level family instance
envs.
This has the same ring as something Ryan wanted in the coverage checker.
It's not obvious exactly what "should work". Eg what about (Int ~# a)
,
or even ([Int] ~# [a])
?
Finally, there is the question of (~)
vs (~#)
. I think we are very
aggressive about unboxing (~)
constraints, so I'd like this:
f4 :: forall a. (a ~ Int) => a -> blah f4 a (g :: (a ~ Int) (x :: a) = case g of MkTwiddle (g2 :: a ~# Int) -> case (x |> g2) of I# y -> $wf4 a g2 y $wf4 :: forall a. (a ~# Int) => Int# -> blah $wf4 a g2 y = let x = (I# y) |> sym g2 g = MkTwiddle g2 in ...as before...
Making all this happen will take some work though.
How important is it? I'm tempted to say "don't write types like that"! (Although the type checker goes to some trouble to support them well.)
comment:10 Changed 16 months ago by
Replying to simonpj:
How important is it? I'm tempted to say "don't write types like that"!
From the user's perspective, I can put forward two use cases:
- Type level let bindings
(..., VG.Mutable v ~ vm) => ... -> vm (PrimState m) e -> ...
From the issue description. This is essentially type-signature
let
-binding, allowing me to write the type signature more clearly/legibly, especially ifvm
is used multiple times.
- Refactorings
For the code in comment 4,
(val ~ Int) =>
, this arose from a refactoring where I replacedInt
byval
across a large code base, and then in the top-level-ish function setval ~ Int
to check if the performance was unimpacted.
I'd probably be OK without this being fixed, now that I know what's going on, but if it's not fixed, we somehow have to do a much better job at giving warnings or educating people that you can't just apply a substitution principle when "cleaning up" type signaturees with ~
. I incorrectly assumed this would be type-level only, and as a result had to start a multi-day investigation of where my performance went because I expected any mistake on my side but not this. Especially when writing code using vector
whose type family heavy API, in my opinion, almost begs for ~
to be used to achieve readable signatures.
I can reproduce this but I had to delete the criterion dependency in the cabal file.