Opened 12 years ago
Last modified 12 months ago
#1600 new task
Optimisation: CPR the results of IO
Reported by: | simonmar | Owned by: | |
---|---|---|---|
Priority: | lowest | Milestone: | |
Component: | Compiler | Version: | 6.6.1 |
Keywords: | CPRAnalysis | Cc: | pho@…, mail@…, maoe@…, akio, osa1, maurerl@…, nomeata, michalt |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | #8598 | Differential Rev(s): | Phab:D4244 |
Wiki Page: |
Description
GHC currently cannot unbox the result of a function in the IO monad. For example:
facIO :: Int -> IO Int facIO n = if n < 2 then return 1 else do n' <- facIO (n-1); return (n*n')
the Int
argument is unboxed fine, but not the result. It ought to be possible to do this: the CPR analysis needs to somehow look inside the unboxed pair returned by IO-monadic code.
Change History (84)
comment:1 Changed 11 years ago by
Architecture: | Unknown → Unknown/Multiple |
---|
comment:2 Changed 11 years ago by
Operating System: | Unknown → Unknown/Multiple |
---|
comment:3 Changed 11 years ago by
Milestone: | 6.10 branch → 6.12 branch |
---|
comment:4 Changed 11 years ago by
comment:5 Changed 10 years ago by
Cc: | pho@… added |
---|
comment:6 Changed 10 years ago by
difficulty: | Moderate (1 day) → Moderate (less than a day) |
---|
comment:7 Changed 10 years ago by
Milestone: | 6.12 branch → 6.12.3 |
---|
comment:8 Changed 10 years ago by
Type of failure: | → Runtime performance bug |
---|
comment:9 Changed 9 years ago by
Milestone: | 6.12.3 → 6.14.1 |
---|---|
Priority: | normal → low |
comment:10 Changed 9 years ago by
Milestone: | 7.0.1 → 7.0.2 |
---|
comment:11 Changed 9 years ago by
Milestone: | 7.0.2 → 7.2.1 |
---|
comment:12 Changed 8 years ago by
Milestone: | 7.2.1 → 7.4.1 |
---|
comment:13 Changed 8 years ago by
Milestone: | 7.4.1 → 7.6.1 |
---|---|
Priority: | low → lowest |
comment:14 Changed 7 years ago by
Milestone: | 7.6.1 → 7.6.2 |
---|
comment:15 Changed 7 years ago by
Time to revisit this optimization.
Is this more of feature request though?
comment:16 Changed 6 years ago by
Cc: | mail@… added |
---|
comment:17 Changed 6 years ago by
Looking at examples for code where nested CPR could, would and should work, I found this ticket. At first glance, this looks good. But isn’t it the case that a usage like
main = do _ <- facIO 1000
will not do any calculation (besides counting down from 1000, and allocating a lot of thunks)?
This shows that the transformation wanted here cannot be obtained without either offering two variants of the function (would inflate code size), or requires an analysis that calculating the factorial even if not needed will still always be better than allocating the thunks.
comment:18 Changed 6 years ago by
Maybe this code is a slightly better example for something where we want to avoid the allocations:
facIO :: Int -> IO Int facIO n = if n < 2 then return 1 else do n' <- facIO (n-1); return $! n*n'
The worker Core currently (7.6.3) looks as
FactIO.$wa [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #) [GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL] FactIO.$wa = \ (ww_snv :: GHC.Prim.Int#) (w_snx :: GHC.Prim.State# GHC.Prim.RealWorld) -> case GHC.Prim.<# ww_snv 2 of _ { GHC.Types.False -> case FactIO.$wa (GHC.Prim.-# ww_snv 1) w_snx of _ { (# ipv_amL, ipv1_amM #) -> case ipv1_amM of _ { GHC.Types.I# y_amd -> (# ipv_amL, GHC.Types.I# (GHC.Prim.*# ww_snv y_amd) #) } }; GHC.Types.True -> (# w_snx, lvl_roe #) }
and there is a very obvious packing/unpacking going on.
A nested CPR should probably first be able to fix this case; and then venture into detecting cheap-and-total-code (as required for the variant without $!
) beyond constructors.
comment:19 Changed 6 years ago by
I’m working on nested cpr now. There a few related tickets (#1885 and #2289), but I’ll report my progress here, mostly because this has the lowest number and the fine factIO
example.
Building on a patch from SPJ I now have a branch that works in simple cases. I have some doubts about correctness (i.e. bothDmdType
is used both for type applications and cases, but I believe these need to treat the question of convergence differently).
Also, it does not work for the factIO
example yet, probably because there is an unboxed tuple to begin with. Need to investigate.
The testsuite goes through (reporting, as it seems, only differences in, well, demand signatures), but nofib’s typecheck
fills the memory quickly, so the analysis clearly is not conservative yet.. Need to investigate.
comment:20 Changed 6 years ago by
Analysing the typecheck
divergence yields a very interesting example of how nested CPR can go wrong, which I’d like to document here.
It completely wreak havoc with this innocent function:
repeat :: x -> [x] repeat x = x : repeat x
(This requires nesting of sum types, but just imagine for this example that []
was a stream data type with only one constructor, if you want).
The analyizer figurs out that its demand signature is DmdType <L,U>tm2(d,tm2(d,tm2(d,d)))
, i.e. it lazily uses its argument and will, with guaranteed convergence, produce a :
constructor, and evaluating the second parameter thereof will also converge to a :
, and so on. The signature could actually be infinite; my code cuts them off a a certain depth. This signature is certainly correct.
So it seems this is eligible to a worker-wrapper-transformation. But when we do it we end up with (using non-core patterns for clarity):
$wrepeat :: x -> (# a, a, a, [a] #) $wrepeat x = case x : repeat x of (x1:x2:x3:r) -> (# x1, x2, x3, r#) repeat x :: -> [x] repeat x = case $wrepeat x of (# x1, x2, x3, r#) -> (x1:x2:x3:r)
and now this diverges.
I’m not entirely sure who is at fault here. Since the analysis yields a correct result, most likely the w/w transformation. But that does not seem to have enough information to prevent this. Needs more thinking.
comment:21 Changed 6 years ago by
One solution, which seems to make sense, is to make the analysis a little bit less precise and more careful with recursive definition. If I make sure that an Id
that is a LoopBreaker
never has a DmdResult
that Converges
(by lub’ing it with Diverges
), both the example above and the typecheck program of nofib compiles.
comment:22 follow-up: 23 Changed 6 years ago by
Great example! Graham Hutton would be interested, since he wrote a paper about worker/wrapper (JFP 2009).
A neat approach might be to say that, when doing the fixpoint on a recursive group, trim off any nested CPR info from the demand signatures we use when processing the RHSs.
Simom
comment:23 Changed 6 years ago by
Replying to simonpj:
A neat approach might be to say that, when doing the fixpoint on a recursive group, trim off any nested CPR info from the demand signatures we use when processing the RHSs.
Woudn’t that be far too imprecise? That would completely prevent nested CPR in loops, like factIO
above...
Nevermind, you mean we should trim the signature on the binders when analyzing their RHSs. But isn’t that almost what I suggest with the LoopBreaker
?
comment:24 Changed 6 years ago by
Run nofib the first time, mixed results:
binary-trees -1.3% +0.6% +7.7% +7.5% +0.0% hidden -1.3% -0.5% -0.3% -0.5% +0.0% mandel -1.2% -9.2% 0.10 0.10 +0.0% nucleic2 -1.0% -2.6% 0.11 0.11 +0.0% reverse-complem -1.3% +4.9% 0.21 0.22 +0.0% -------------------------------------------------------------------------------- Min -1.4% -9.2% -6.3% -6.6% -4.5% Max -1.0% +4.9% +7.7% +7.5% +0.0% Geometric Mean -1.3% -0.0% +0.5% +0.6% -0.1%
The gain in mandel seems to come from a a few more unboxed results in Data.Complex
, in particular the return types of $w$sphase
and $w$smagnitued
turns from Double
to Double#
(and likewise the Float
-specializations).
And why the new chance for that? Because $fFloatingComplex
(which is a CAF for D# 0.0
) and other constants are inlined (and thus no longer shared). Not sure why that part changed, and it seems that this is not related to nestedCPR...
comment:25 Changed 6 years ago by
Owner: | set to nomeata |
---|
Interesting corner case: With nested CPR enabled, GHC.TopHandler
in base fails to compile with:
ghc-stage1: panic! (the 'impossible' happened) (GHC version 7.7.20131203 for x86_64-unknown-linux): mkWWcpr: non-algebraic or open body type a{tv a28d} [tv] but CPR type tm1()
The problem is this code:
case {__pkg_ccall_GC base shutdownHaskellAndExit GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld #)}_d2VI 255 ds_d2VF eta_XG of _ [Occ=Dead, Dmd=<L,A>] { (# ds_d2VG [OS=OneShot] #) -> (# ds_d2VG, GHC.Tuple.() #)
stemming from
foreign import ccall "shutdownHaskellAndSignal" shutdownHaskellAndSignal :: CInt -> CInt -> IO ()
which then is unsafeCoerce#
’ed to a
, with note explanation
-- we have to use unsafeCoerce# to get the 'IO a' result type, since the -- compiler doesn't let us declare that as the result type of a foreign export.
There are a few ways to attack this issue:
- We allow
a
in the return type of a foreign export (but maybe overkill to do that just for this code). - We try to forget CPR information when things pass through
unsafeCoerce#
(but that is just a`cast`
in Core, and may be hard to detect reliably. Maybe every coercion that has aUnivCo representational
inside?) - We re-write that code in base, e.g. instead of
unsafeCoerce
use... >> error "I’m still alive?"
. - We simply turn the
panic
into a warning.
I guess 2. makes most sense, because we don’t want other instances of unsafeCoerce#
to cause this error either. For now I’ll do 4, to not get stuck compiling base.
comment:26 Changed 6 years ago by
Related Tickets: | → #8598 |
---|
Also see ticket #8598 (CPR after IO action) which is not directly related to nested cpr, but will become more relevant when want to do nested CPR inside IO
, such as in this example.
comment:27 Changed 6 years ago by
Finally I got the code in a shape good enough for controlled experiments.
I started with a merge of branch better-ho-cardinality
and master (at [4025d66/ghc]). This is my baseline.
The next measurement is with the nested CPR analysis, but without changing the worker-wrapper code, i.e. the CPR information is only used as far as before. Here is the result (skipping lines with +0.0%
):
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- cacheprof +0.0% +0.6% +0.3% +0.6% -3.4% -------------------------------------------------------------------------------- Min +0.0% -0.0% -8.9% -8.7% -3.4% Max +0.1% +0.6% +9.1% +9.1% +3.0% Geometric Mean +0.0% +0.0% -0.4% -0.4% -0.0%
I did not look at cacheprof
, but I am confident that I did not break anything serious by introducing nested CPR to the analysis.
Next measurement is is with the worker/wrapper code handling nested CPR, but using -fnested-cpr-off
to make sure that all CPR information is actually what we had before. I used this to ensure that my mkWWcpr
-implementation did not regress over the old one (it turned out that I did mess up in various ways, so this is important):
-------------------------------------------------------------------------------- cacheprof +0.0% -0.6% -0.3% -0.6% +1.8% -------------------------------------------------------------------------------- Min +0.0% -0.6% -1.4% -1.7% +0.0% Max +0.0% +0.0% +2.3% +2.4% +1.8% Geometric Mean -0.0% -0.0% +0.1% +0.1% +0.0%
And now the exciting part: The same code, but now without -fnested-cpr-off
:
-------------------------------------------------------------------------------- anna +0.5% +0.1% 0.18 0.19 +0.0% boyer2 +0.5% +0.4% 0.01 0.01 +0.0% bspt +1.1% +0.2% 0.01 0.01 +0.0% cacheprof +0.5% -0.7% +0.6% +1.2% -2.6% calendar +0.5% +0.1% 0.00 0.00 +0.0% comp_lab_zift +0.5% +0.1% +0.0% +0.0% +0.0% infer +0.4% -1.2% 0.09 0.09 +0.0% para +0.4% +0.2% -1.5% -1.5% +0.0% prolog +0.5% +0.2% 0.00 0.00 +0.0% reptile +0.7% +0.3% 0.02 0.02 +0.0% -------------------------------------------------------------------------------- Min +0.4% -1.2% -8.0% -8.0% -4.4% Max +1.1% +0.4% +6.9% +6.9% +4.2% Geometric Mean +0.5% -0.0% +0.0% -0.0% -0.1%
so most of the 101 benchmarks are not affected by nested CPR (in its current, prelimary) form at all (besides a quite reliable increase of binary sizes by +0.5%
– does Size include .hi
-files?). When it changes the Allocs number the effect is small and indecisive.
It may be that fixing #8598 would improve matters (or at least allow nested CPR to occur more often, whether for the better or for the worse, I don’t know.) I might look at that next, but probably off master, as it is an independent feature.
Of course it is also quite likely that the nested CPR analysis needs more tuning by looking at code where we want it to fire.
comment:28 Changed 6 years ago by
The change of allocations in cacheprof
is weird: the output of -ddump-simpl
is identical (ignoring unique numbers and strictness annotations). So there is a change hidden in the libraries somewhere... hopefully nothing to worry about.
comment:29 Changed 6 years ago by
Found one possible reason for the allocation increase: boyer2 was losing a no-let-escape, because the insight from Note [CPR for sum types]
was not applied inside a nested CPR result. Fixed that now, nofibs are running.
comment:30 Changed 6 years ago by
Unfortunately, that was not it. The only visible change left is that addtoLUT
in Lisplikefns
is getting nestedly CPR’ed, without real gain (but also without loss, one would think).
There is one other lead: With nested CPR, lots of "show" methods get a nested CPR property. E.g. the show method for (,)
has now a return demand tm2(tm(d),d)
(instead of m2
as before), so the worker returns (# GHC.Prim.Char#, [GHC.Types.Char] #)
instead of (# GHC.Types.Char, [GHC.Types.Char] #)
, likely causing a re-boxing of the (
character.
I’ll disable nested CPR information inside a sum type to confirm this theory.
comment:31 Changed 6 years ago by
It helps, i.e. removes all increases of Allocs (I don’t trust cacheprof
, the allocations there seem to vary even from run to run):
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- cacheprof +0.0% +0.2% +1.5% +0.9% -1.8% gamteb +0.1% -0.2% 0.06 0.06 +0.0% infer +0.0% -1.2% 0.09 0.09 +0.0% pic +0.0% -0.6% 0.01 0.01 +0.0% -------------------------------------------------------------------------------- Min -0.0% -1.2% -5.2% -5.2% -4.0% Max +0.6% +0.2% +3.3% +3.0% +9.1% Geometric Mean +0.0% -0.0% +0.1% +0.1% +0.0%
So on the one hand: Nice, no regression due to nested CPR, and some improvements (although very minor – is that even significant?)
But the “fix” is not well-targeted, it rather is a heuristic. Unfortunately, I don’t see anything smarter to do here if we do not do whole-program compilation, or do not provide multiple implementations of the the function with varying degrees of CPRness.
An alternative, not well-aimed fix would be to zap the CPR property for all top-level constants (and not just for thunks). This would make CPR much more robust against allocation increase, but that would prevent a lot of CPR where we really want it, i.e. where the constant is on the cold path and the unshared Int#
cell is worth having an unboxed result type (i.e. in a naive fac
function).
comment:32 Changed 6 years ago by
Here is a nother reason why nested CPR is not very successful: The requirement of definite termination is not easy to meet. One would think that the extended Euclidean algorithm is a good candidate for nested CPR:
extendedEu2 :: Int -> Int -> (Int, Int) extendedEu2 a 0 = (1, 0) extendedEu2 a b = (t, s - q * t) where (q, r) = quotRem a b (s, t) = extendedEu2 b r
but it is not: With a return type of dm(d,dm(d))
we cannot do more than unbox the tuple. Now with a few iterations between the code and the core, one can find a strictness annotation that makes it work: With
extendedEu :: Int -> Int -> (Int, Int) extendedEu a 0 = (1, 0) extendedEu a b = let b' = s - q * t in b' `seq` (t, b') where (q, r) = quotRem a b (s, t) = extendedEu b r
we infer dm(tm(d),tm(d))
and the worker gets type GHC.Types.Int -> GHC.Prim.Int# -> (# GHC.Prim.Int#, GHC.Prim.Int# #)
.
So likely nested CPR might help those who are careful to use strictness annotation and use strict data types (which some people are doing almost exclusively), but not a lot with the usual lazy programming.
comment:33 Changed 6 years ago by
The numbers are a bit more interesting if I enabled nested CPR inside unboxed tuples, i.e. in code involving IO or ST:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- banner +2.1% +0.1% 0.00 0.00 +0.0% compress2 +2.0% +0.1% +11.5% +9.8% +20.8% expert +2.1% +0.1% 0.00 0.00 +0.0% fibheaps +2.0% -0.3% 0.05 0.05 +0.0% fluid +2.5% +0.1% 0.01 0.01 +0.0% gamteb +2.1% -0.2% 0.06 0.06 +0.0% grep +2.0% +0.1% 0.00 0.00 +0.0% infer +2.0% -1.2% 0.09 0.09 +0.0% k-nucleotide +1.5% -6.9% +0.1% +0.2% +0.0% maillist +2.1% -0.3% 0.10 0.12 +0.0% pic +2.2% -0.5% 0.01 0.01 +0.0% rfib +2.1% +0.1% 0.03 0.03 +0.0% scs +2.3% +0.2% +1.0% +1.4% +0.0% tak +2.1% -0.1% 0.02 0.02 +0.0% treejoin +2.1% +0.1% +0.0% +0.0% +0.0% wave4main +2.1% +11.3% -0.5% +0.0% -7.1% -------------------------------------------------------------------------------- Min +1.5% -6.9% -13.2% -13.3% -33.3% Max +2.5% +11.3% +16.5% +16.0% +20.8% Geometric Mean +2.0% +0.0% +0.2% +0.3% -0.3%
One particular good result (k-nucleotide
), and one bad wave4main
, and otherwise a slight general improvement. The change in k-nucleotide
’s core is too large to spot the reason for the improvement.
Diffing the -ddump-simpl
of wave4main
shows only one change. It stems from this function
tabulate :: (Int -> x) -> (Int, Int) -> Array Int x tabulate f (l,u) = array (l,u) [(i, f i) | i <- [l..u]]
where in the (inlined) array
a worker for go
gets its return type changed from (# GHC.Prim.State# s_aTM, GHC.Arr.Array GHC.Types.Int x_aqE #)
to (# GHC.Prim.State# s_aTM, GHC.Prim.Int#, GHC.Prim.Int#, GHC.Prim.Int#, GHC.Prim.Array# x_aqE #)
. Which looks good, but the worker is tail-recursive, and the boxing in the wrapper is not cancelled at the use site of go
, so there is nothing gain by moving the constructor applications from the worker to the wrapper.
But some isolated testing indicates that this costs 96 bytes of allocation per run, so I doubt that this is the main cause for the 11% increase; there might be something hidden in the libraries.
comment:34 Changed 6 years ago by
Trac seems to think I write too much, just lost a somewhat long text here :-(.
I was explaining why nested CPR kills a let-no-espcape in this code taken from scs
’s LinearAlgebra
:
v_zipWith :: (a -> b -> c) -> Array Int a -> Array Int b -> Array Int c v_zipWith f a b | compatible = listArray (bounds a) (zipWith f (elems a) (elems b)) | otherwise = error "error" where compatible = bounds a == bounds b
Point taken, I’ll be brief: CPR can kill the no-let-escape property; hence more CPR kills more of that. The problem occurs if the $j
gets a more detailed CPR type than the expression it is part of, or the expression is somewhere where CPR w/w cannot happen (e.g. in an argument to runStRep
). This problem is not new and some work-arounds for it exist in the current code ([CPR for Sun types]
). But maybe this needs a generally better story.
(Sidenote: Inlining runSTRep
would have helped here, but was disabled by simonmar in 920dbbddf57ff02e0734943bb93dd4cecc5568e0/base.)
comment:35 Changed 6 years ago by
(Update to sidenote: inlining runSTRep
may allow it to get a CPR
property, but this does not magically help with join-points, also see NestedCPR/wave4main).
comment:36 Changed 6 years ago by
(Removed post; looks like one of the trees was dirty in libraries/base
from my experiments with inlining runSTRep
. Re-running nofib....)
comment:37 Changed 6 years ago by
New numbers, after uprooting a bug where things unrelated to CPR (namely things alrady returning an unboxed tuple, with nothing to be CPRed inside) would suddenly get an INLINE
flag, including some thunks. Finally a measurable positive change in the geometric mean!
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- ansi +0.3% -0.1% 0.00 0.00 +0.0% compress2 +0.6% -0.8% 0.11 0.11 -8.0% fibheaps +0.3% -0.3% 0.03 0.03 +0.0% gamteb +0.4% -0.2% 0.04 0.04 +0.0% grep +0.3% -0.1% 0.00 0.00 +0.0% hpg +0.4% -3.0% 0.13 0.13 +0.0% infer +0.3% -1.2% 0.03 0.03 +0.0% k-nucleotide +0.2% -6.9% -1.6% -1.4% +0.0% maillist +0.4% -0.8% 0.04 0.04 +0.8% mkhprog +0.4% -0.3% 0.00 0.00 +0.0% pic +0.3% -0.7% 0.00 0.00 +0.0% pretty +0.4% -0.1% 0.00 0.00 +0.0% rfib +0.3% -0.2% 0.01 0.01 +0.0% scc +0.3% -0.1% 0.00 0.00 +0.0% spectral-norm +0.4% -0.1% +0.3% +0.3% +0.0% sphere +0.4% -4.7% 0.04 0.04 +0.0% symalg +0.3% -0.1% 0.01 0.01 +0.0% tak +0.3% -0.3% 0.01 0.01 +0.0% transform +0.3% +0.2% -2.2% -2.2% +0.0% -------------------------------------------------------------------------------- Min +0.2% -6.9% -5.3% -28.2% -11.2% Max +0.7% +0.2% +2.1% +2.1% +50.0% Geometric Mean +0.3% -0.2% -0.8% -2.3% +0.2%
Surprisingly to me, the bug fix also fixed a +11% increase in wave4main’s allocations, which I thought were caused by join-point losses.
A quick glance into transform
shows that f_list_cmp
lost its (non-nested) CPR property, accounting for most of the increase according to ticky. Will investigate tomorrow.
comment:38 Changed 6 years ago by
Fixed the transform
issue (when analysing a complex case expression where we do the scrunitee first to feed nested CPR information into the pattern match variable, I was not making sure that the case binder gets at least a flat CPR property, even if the scrunitee has none), and we are finally where Simon expected nested CPR to be: No increased allocations any more (or at least none that are not surpassed by the gains). Still a very small effect, though.
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- ansi +0.3% -0.1% 0.00 0.00 +0.0% awards +0.3% -0.1% 0.00 0.00 +0.0% compress2 +0.6% -0.8% 0.11 0.11 -8.0% fibheaps +0.3% -0.3% 0.03 0.03 +0.0% gamteb +0.4% -0.2% 0.04 0.04 +0.0% grep +0.3% -0.1% 0.00 0.00 +0.0% hpg +0.4% -3.0% 0.13 0.13 +0.0% infer +0.3% -1.2% 0.03 0.03 +0.0% k-nucleotide +0.2% -6.9% -0.7% -0.5% +0.0% maillist +0.4% -0.8% 0.04 0.04 -4.2% mkhprog +0.4% -0.3% 0.00 0.00 +0.0% pic +0.3% -0.7% 0.00 0.00 +0.0% pretty +0.4% -0.1% 0.00 0.00 +0.0% rfib +0.3% -0.2% 0.01 0.01 +0.0% scc +0.3% -0.1% 0.00 0.00 +0.0% spectral-norm +0.4% -0.1% +0.0% +0.0% +0.0% sphere +0.4% -4.7% 0.04 0.04 +0.0% symalg +0.3% -0.1% 0.01 0.01 +0.0% tak +0.3% -0.3% 0.01 0.01 +0.0% -------------------------------------------------------------------------------- Min +0.2% -6.9% -5.3% -27.3% -8.0% Max +0.7% +0.0% +1.4% +1.4% +50.0% Geometric Mean +0.3% -0.2% -0.7% -2.3% +0.2%
comment:39 Changed 6 years ago by
Bummer. While finding out what happened to wave4main
, I noticed that I introduced a bug (well, imprecision) where Converges
information was lost. This prevented nested CPR in wave4main
, and everything looked good. Fixing that bug gives us back the +11.3% for wave4main
, and also a +0.2% for scs
. So it remains all very heuristical...
The change in scs
is hard to pin-point, as there is quite a bit of CPR’ing going on which moves allocations between different ticky-ticky-counters. But in the summary, we see an increase in ALLOC_FUN_gds
again, and there are $wgo
popping up where there were none, so this indicates lost join points. (Also interesting: ALLOC_CON_ctr
goes up, but ALLOC_CON_gds
goes down.)
comment:40 Changed 6 years ago by
Small status update:
My work on nested CPR has now reached somewhat conclusive state. The performance changes are not great, but it is still nice to have, especially if users who would re-write their code for performance would no longer have to do so.
It should not go in for 7.8 – the gains are too small, and there still might be corner cases bugs. So I plan to merge it somewhen after 7.8. Until then, the branch wip/nested-cpr
contains the changes in mostly cleaned-up, individually validated patches. I might occasionally rebase that branch to keep the conflicts small.
comment:41 Changed 6 years ago by
Cc: | maoe@… added |
---|
comment:43 follow-up: 46 Changed 5 years ago by
Owner: | nomeata deleted |
---|
I think the conclusion was that Nested CPR was not worth it. The branch is still there if someone wants to pick it up. Unassigning.
comment:44 Changed 5 years ago by
Milestone: | 7.10.1 → 7.12.1 |
---|
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
comment:45 Changed 5 years ago by
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
comment:46 Changed 4 years ago by
Replying to nomeata:
I think the conclusion was that Nested CPR was not worth it. The branch is still there if someone wants to pick it up. Unassigning.
Why is it not worth it?
If there is a chance that the branch can be improved to the point that it can be merged, I'm interested in trying. The lack of nested CPR has been the most frequent reason that I have to use unboxed types manually.
comment:47 Changed 4 years ago by
Cc: | akio added |
---|
comment:48 Changed 4 years ago by
I do not know how much the code changed in master... just give it a try!
comment:49 Changed 4 years ago by
Yes, do rebase wip/nested-cpr
on master, do a like-for-like nofib
comparison. I'd quite like to see this land.
Simon
comment:51 Changed 4 years ago by
Milestone: | 8.0.1 |
---|
comment:52 Changed 3 years ago by
I've rebased nomeata's branch: https://github.com/takano-akio/ghc/compare/afdde48...f8eb150
Some nofib results:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- mandel +1.9% -25.0% 0.047 0.047 0.0% gcd +2.0% -21.6% 0.030 0.031 0.0% k-nucleotide +1.7% -18.0% +3.4% +2.7% 0.0% wave4main +2.0% +12.3% 0.179 0.179 0.0% kahan +2.0% +7.8% +1.4% +1.3% 0.0% solid +2.0% -6.4% 0.081 0.081 0.0% sphere +2.0% +5.8% 0.040 0.040 0.0% hpg +2.0% +4.1% 0.102 0.102 0.0% symalg +1.9% +3.6% 0.007 0.007 0.0% gamteb +2.4% -4.8% 0.026 0.026 0.0% -------------------------------------------------------------------------------- Min +1.5% -25.0% -0.6% -0.4% -20.0% Max +2.8% +12.3% +12.7% +12.8% +12.0% Geometric Mean +2.0% -0.5% +5.0% +4.7% +0.2%
I'll look into the allocation regressions next.
comment:53 follow-up: 54 Changed 3 years ago by
Oh, that’s an old branch. I’m happy that a second pair of eyes looks at it. How tedious was it to rebase?
comment:54 Changed 3 years ago by
Replying to nomeata:
How tedious was it to rebase?
It took me a few days, but in the process I learned a lot about nested CPR and about the demand analyzer in general.
comment:55 Changed 3 years ago by
Cc: | osa1 added |
---|
I wonder why there's a consistent 2% increase in binary size; I was not expecting that.
I have not looked at the code... when you are ready, seeing it all as one Phab diff (not lots) would be helpful.
Omer may be thinking about nested CPR, and has an implementation on a branch somewhere. This is a particular instance of nested CPR, so it'd be good to sync up with him (cc'd).
Simon
comment:56 Changed 3 years ago by
Regarding wave4main
, the main problem seems to be destruction of a join point.
The program contains an expression that looks like this:
runRW# (\ s -> letrec go s1 = let ... in (# s2, Array a b c #) in go s)
After a nested CPR transformation, it becomes:
runRW# (\ s -> letrec $wgo s1 = let ... in (# s2, a, b, c #) in case $wgo s of (# s3, a', b', c' #) -> (# s3, Array a' b' c' #))
destroying the join-point property of go
. This is basically the same as what @nomeata found 2 years ago.
I see two ways to move forward:
- Wait for SequentCore to be implemented.
- Have special cases for
runRW#
in the CPR analyzer and in the simplifier, as discussed in https://ghc.haskell.org/trac/ghc/ticket/10678#comment:10
I'm mildly interested in doing (2), but I don't think I fully understand the idea. In particular, the above comment mentions an example expression case (runRW e) of BN# farr -> farr
that the simplifier would be able to transform. However, since runRW#
returns an unboxed pair rather than a single boxed value, it seems that it would have to spot a much more complicated pattern to accomplish this transformation.
comment:57 Changed 3 years ago by
Cc: | maurerl@… added |
---|
Let's do (1); it's nearly ready!
Akio, Luke (now in cc), can you work together on this? Luke has a pretty stable join-point branch which never destroys join points. Best just to build on that I think.
Simon
comment:58 Changed 3 years ago by
I've rebased the branch on top of the join points work. In nofib, allocations are generally down but the difference is not very big:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- bspt +0.0% +0.1% 0.007 0.007 0.0% compress2 +1.5% -0.9% 0.158 0.158 -3.7% fluid +1.6% -0.2% 0.007 0.007 0.0% infer +0.0% -1.2% 0.039 0.039 0.0% pic +0.1% -0.6% 0.005 0.005 0.0% solid +0.0% -6.6% 0.159 0.159 0.0% -------------------------------------------------------------------------------- Min -0.0% -6.6% -28.7% -28.8% -3.7% Max +1.6% +0.1% +4.3% +4.3% +6.7% Geometric Mean +0.1% -0.1% -14.7% -14.5% +0.0%
I'll look more carefully to see why the improvement is so small. I may have messed up something while rebasing.
comment:59 Changed 3 years ago by
Indeed. Start with one or two poster-child examples to show what the goal is; and check they work as expected. Only then go in with nofib.
Also this is a long ticket. It would be good to write a decription somewhere (on the wiki page?) of what exactly the patch does.
comment:60 Changed 3 years ago by
It seems like the above results were bogus because it used libraries compiled with the wrong compiler (I had stage=2
in build.mk. I removed it and typed make
. Is it not sufficient to cause a full build?).
After a clean build, the allocation numbers look very good:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- compress2 +2.0% -0.9% 0.170 0.170 -3.7% fft2 +1.0% -0.2% 0.078 0.078 0.0% fluid +2.1% -0.4% 0.006 0.006 0.0% gamteb +0.4% -4.5% 0.032 0.032 0.0% gcd +0.7% -21.4% 0.046 0.046 0.0% infer -0.1% -1.2% 0.054 0.054 0.0% integer +1.0% -1.5% -15.0% -15.0% 0.0% mandel +1.0% -24.4% 0.051 0.051 0.0% rfib +1.0% -0.3% 0.012 0.012 0.0% solid +0.4% -6.6% 0.130 0.131 0.0% -------------------------------------------------------------------------------- Min -0.2% -24.4% -21.1% -21.3% -3.7% Max +2.1% +0.1% +39.3% +39.3% +17.3% Geometric Mean +0.2% -0.7% -4.9% -4.9% +0.1%
Should I worry about the increase in the code size? If not I'll clean up the branch and submit a Diff.
comment:61 follow-up: 62 Changed 3 years ago by
(I had stage=2 in build.mk. I removed it and typed make. Is it not sufficient to cause a full build?).
make won't necessarily rebuild your libraries, due to GHC's recompilation avoidance. It knows about changes in the source code and the flags, but it doesn't know about changes in GHC itself. So when I'm making measurements like this, I always delete the library object files manually:
find libraries/*/dist-install | grep 'o$' | xargs rm
comment:62 Changed 3 years ago by
Replying to simonmar:
make won't necessarily rebuild your libraries, due to GHC's recompilation avoidance.
Oh I see... Thank you for the tip!
comment:63 Changed 3 years ago by
Good looking numbers esp for mandel, gcd. Can you dig in to see what is really happening there? How does the inner loop look?
For code size, it's worth investigating to get insight. Look at the "module sizes" in the nofib-analyse log, and take a look at modules that got significantly bigger. Is that code-size increase because of an actual optimisation or is it just accidental?
comment:64 Changed 3 years ago by
mandel and gcd
The main difference seems to come from this change, which makes integer-gmp
allocate less by inlining wrappers for S#
, Jp#
etc. Actually this small change seems to be responsible for the much of the allocation improvement in this branch:
gcd +0.8% -21.4% 0.039 0.040 0.0% integer +1.1% -1.5% +5.0% +5.1% 0.0% mandel +1.1% -24.4% 0.070 0.070 0.0% -------------------------------------------------------------------------------- Min -0.0% -24.4% -38.7% -38.5% 0.0% Max +1.1% 0.0% +13.7% +13.6% +14.4% Geometric Mean +0.3% -0.6% -6.9% -6.9% +0.1%
Perhaps I should try go get this change merged, separately from the rest of the nested CPR work.
Binary size increase
I had a brief look at nofib/real/fluid/Jcb_method.hs
, which showed a +60% increase in the binary size, but I wasn't able to figure out why the change happened. SpecConstr now seems to duplicate a big function, but I don't know what caused this (yet).
Wiki page
I think the content under wiki:NestedCPR is mostly valid, because all I have done so far is essentially just to rebase nomeata's branch on top of the join-point commit.
comment:65 Changed 3 years ago by
The main difference seems to come from this change...
Interesting! There are two changes in that commit:
- Making data con wrappers have
ug_unsat_ok = unSaturatedOk
. I buy this. Can you try the effect of this change alone? It's entirely unrelated to nested-CPR. It would be good to isolate that perf bump from the perf changes due to nested CPR.
- Making data con wrapper have
ug_boring_ok = boringCxtOk
. I'm not sure I buy this in full. InNote [Inline data constructor wrappers aggresively]
there's a claim that we get better nested-CPR info. That's true if we have a polymorphic bang argument (e.g.data T a = MkT !a
) but for a monomorphic one likedata S = MkS !Int
we'll unpack it.
So I think the benefit comes only for polymorphic bangs. And there is a cost to aggressive inlining, so we don't want to do it gratuitously.
comment:66 Changed 3 years ago by
There are two changes in that commit:
Actually the only change is the latter one. The former change is already in HEAD, as of 2be364ac8c.
Making data con wrapper have ug_boring_ok = boringCxtOk. I'm not sure I buy this in full. In Note [Inline data constructor wrappers aggresively] there's a claim that we get better nested-CPR info.
I don't think the improvement has anything to do with nested CPR, because I'm testing this change in isolation, without any other changes from the nested CPR branch.
The big difference in nofib comes from the GHC.Integer.Types
module in integer-gmp
. This module has this definition:
data Integer = S# !Int# -- ^ iff value in @[minBound::'Int', maxBound::'Int']@ range | Jp# {-# UNPACK #-} !BigNat -- ^ iff value in @]maxBound::'Int', +inf[@ range | Jn# {-# UNPACK #-} !BigNat -- ^ iff value in @]-inf, minBound::'Int'[@ range
Note the redundant bang in the S#
constructor. This causes a wrapper for this constructor to be created:
GHC.Integer.Type.$WS# = \ (dt [Occ=Once] :: Int#) -> case dt of dt { __DEFAULT -> GHC.Integer.Type.S# dt }
And this wrapper remains un-inlined in various places like:
quotRemInteger = \ (n :: Integer) (ds :: Integer) -> join { fail1 [Dmd=<L,1*C1(U(U,U))>] :: Void# -> (# Integer, Integer #) [LclId[JoinId(1)], Arity=1, Str=<L,A>, Unf=OtherCon []] fail1 _ [Occ=Dead, OS=OneShot] = case n of n1 { __DEFAULT -> join { fail2 [Dmd=<L,1*C1(U(U,U))>] :: Void# -> (# Integer, Integer #) [LclId[JoinId(1)], Arity=1, Str=<L,A>, Unf=OtherCon []] fail2 _ [Occ=Dead, OS=OneShot] = case n1 of wild { S# ds3 -> case ds3 of ds4 { __DEFAULT -> case ds of { S# d# -> case quotRemInt# ds4 d# of { (# ipv, ipv1 #) -> (# GHC.Integer.Type.$WS# ipv, GHC.Integer.Type.$WS# ipv1 #) ...
leading to unnecessary allocation of thunks. It should be possible to fix this problem by doing one of the following:
- Inline constructor wrappers like
$WS#
aggressively, at least when it's fully applied. - Do not create a redundant wrapper in a case like this.
- Remove this particular bang from integer-gmp.
I have tried (1) and (3), and they seem to have very similar effects. (3) is definitely the easiest to implement, but I wonder if (2) is more correct. I'd like to hear some suggestions about this.
comment:68 follow-up: 69 Changed 3 years ago by
Akio, I've fixed that wrapper issue.
What's next on this ticket?
comment:69 Changed 3 years ago by
Replying to simonpj:
Akio, I've fixed that wrapper issue.
Thank you!
What's next on this ticket?
After rebasing the branch on top of your fix, the nofib improvements are again consistent but very small:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- hpg +0.2% -1.1% 0.212 0.212 0.0% infer +0.3% -1.2% 0.054 0.054 0.0% solid +0.3% -6.6% 0.181 0.181 0.0% sphere +0.3% -1.7% 0.086 0.086 0.0% -------------------------------------------------------------------------------- Min +0.2% -6.6% -23.2% -23.4% -19.2% Max +0.7% +0.1% +16.3% +16.2% 0.0% Geometric Mean +0.3% -0.1% -0.1% -0.1% -0.3%
The branch can perform nested CPR in simple cases, but it fails to transform some cases I would like it to. Currently I'm trying to make extendSigsWithLam
a bit more aggressive by allowing it to generate nested CPR information.
comment:70 Changed 3 years ago by
it fails to transform some cases I would like it to.
Can you give some specific programs that you would like to be optimised and aren't? A list of specific examples would help to make sure that everyone on this thread was talking about the same things.
comment:71 Changed 3 years ago by
I made a wiki page with a few specific examples: NestedCPR/Akio2017. I think I'm now done with updating extendSigsWithLam
. I'm going to run nofib again and look for more improvements to make.
comment:72 Changed 3 years ago by
OK great, thank you! Neither example on your status page use I/O, and I think it may be there that the main gains are to be found.
comment:73 Changed 3 years ago by
simonpj: I've added two examples that use IO.
After a bit of work I managed to make a big improvement on x2n1
:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- bspt -0.1% +0.1% 0.007 0.007 0.0% cacheprof -0.1% +0.1% +28.4% +28.4% +0.9% grep -0.1% -0.2% 0.000 0.000 0.0% maillist -0.1% -0.3% 0.053 0.053 0.0% hpg -0.1% -0.4% 0.205 0.205 0.0% pic -0.1% -0.6% 0.007 0.007 0.0% compress2 +0.2% -0.9% 0.196 0.196 -19.2% infer -0.1% -1.2% 0.069 0.069 0.0% sphere -0.1% -1.7% 0.049 0.049 0.0% solid -0.1% -6.6% 0.127 0.169 0.0% x2n1 -0.1% -84.7% 0.006 0.006 0.0% -------------------------------------------------------------------------------- Min -0.1% -84.7% -37.7% -35.7% -19.2% Max +0.2% +0.1% +34.3% +34.2% +0.9% Geometric Mean -0.1% -1.9% -2.6% -1.3% -0.2%
I wonder if this improvement justifies the added complexity to the compiler (about 400 lines of code added under compiler/
). Right now I don't see any way to make the analysis/transformation more effective.
comment:74 Changed 2 years ago by
What is the status of this, akio?
I wonder if this improvement justifies the added complexity to the compiler (about 400 lines of code added under
compiler/
). Right now I don't see any way to make the analysis/transformation more effective.
Well, 2.6% improvement in runtime is quite sizeable. This sounds quite worthwhile to me. Good work!
Simon, do you agree?
comment:75 Changed 2 years ago by
I'm currently cleaning up the branch before submitting a Diff.
Well, 2.6% improvement in runtime is quite sizeable.
The runtime numbers are probably unreliable because I didn't try to keep machine unloaded while running nofib.
comment:76 Changed 2 years ago by
Cc: | nomeata added |
---|
I still feel that this is the Right Thing To Do. The performance results are not so exciting... but it would not surprise me if there was more to be had.
I'm currently cleaning up the branch before submitting a Diff.
Great. I'd like Joachim to review your code, if he is willing; after all, he had the first go at this. And sometimes a review leads to improvements and clean ups.
On your wiki page NestedCPR/Akio2017 you speak of "nested strictness", "more aggressive worker/wrapper", "correct handling of strict contructor field", etc, but in each case I have little idea of what you mean. Could you expand with more specifics?
Really I'd like to have a page describing the design. (E.g. do you have a convergence analyser? Could we use that for other things?) Having such a description would make it much easier to read the code.
There's a paper in here somewhere.
comment:77 Changed 2 years ago by
Let me know when the branch is cleaned up. I can have a look (although I probably forgot everything since then…).
comment:79 Changed 2 years ago by
I rebased the branch here - https://github.com/mpickering/ghc/tree/nested-cpr
I get 7 test failures running in simpleCore.
There is also a core lint error when running a fresh build. I put it up on phab as well for easier viewing.
comment:80 Changed 21 months ago by
Differential Rev(s): | → Phab:D4244 |
---|
comment:81 Changed 19 months ago by
Cc: | michalt added |
---|
comment:82 Changed 13 months ago by
Sebastian says that this patch got parked because of a few reasons. Anyone interested in picking up this ticket can see a source of ideas in the linked differential. https://phabricator.haskell.org/D4244
He says:
- I wasn't sure I understood everything that was going in that patch
- There were several 'improvements' that were independent of nested CPR that we wanted to test in isolation (#15001, D4563, D4565)
- I felt strongly against complicating demand analysis any further. I remember there was this horrible special case for
case
expressions (klick) that IMO shows that CPR analysis, as opposed to Usage/Strictness analysis, is a forward analysis. Also there's this 'virgin run', which basically means we run the demand analyser twice anyway, because CPR relies on strictness info being available. That's why I strongly argue that we should split off CPR analysis first before implementing nested CPR.
comment:83 Changed 13 months ago by
Thanks Matthew. If anyone wants to pick this up I'd be happy to help. But I'm too swamped to lead on it.
comment:84 Changed 12 months ago by
Keywords: | CPRAnalysis added |
---|
See also #2387 and #2289