#7994 closed bug (fixed)
Make foldl into a good consumer
Reported by: | simonpj | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | 7.6.3 |
Keywords: | Cc: | tkn.akio@… | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | None/Unknown | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
I really want foldl
to be a good consumer, but our arity/cardinality analysis still isn't up to it. Here's a test case, derived from nofib's x2n1
:
module Foo( foo ) where import Data.Complex foo x = sum [f n | n <- [1 .. x]] f :: Int -> Complex Double {-# NOINLINE f #-} f n = mkPolar 1 ((2*pi)/fromIntegral n) ^ n
With the patch below (which is what I'd like to use), we get very obviously bad code:
Foo.foo :: GHC.Types.Int -> Data.Complex.Complex GHC.Types.Double Foo.foo = \ (x_aia :: GHC.Types.Int) -> case x_aia of _ { GHC.Types.I# y_aye -> case GHC.Prim.># 1 y_aye of _ { GHC.Types.False -> letrec { go_aD8 [Occ=LoopBreaker] :: GHC.Prim.Int# -> Data.Complex.Complex GHC.Types.Double -> Data.Complex.Complex GHC.Types.Double go_aD8 = \ (x_aD9 :: GHC.Prim.Int#) -> let { ds_doR [Lbv=OneShot] :: Data.Complex.Complex GHC.Types.Double -> Data.Complex.Complex GHC.Types.Double ds_doR = case GHC.Prim.==# x_aD9 y_aye of _ { GHC.Types.False -> go_aD8 (GHC.Prim.+# x_aD9 1); GHC.Types.True -> GHC.Base.id @ (Data.Complex.Complex GHC.Types.Double) } } in let { ds_aCs :: Data.Complex.Complex GHC.Types.Double ds_aCs = Foo.f (GHC.Types.I# x_aD9) } in \ (ds2_aCu :: Data.Complex.Complex GHC.Types.Double) -> ds_doR (Data.Complex.$fFloatingComplex_$s$c+ ds2_aCu ds_aCs); } in go_aD8 1 (Data.Complex.:+ @ GHC.Types.Double (GHC.Types.D# 0.0) Data.Complex.$fFloatingComplex1); GHC.Types.True -> Data.Complex.:+ @ GHC.Types.Double (GHC.Types.D# 0.0) Data.Complex.$fFloatingComplex1 } }
The local go
function should have arity 2.
The patch below is the one I'd like to apply to base
:
simonpj@cam-05-unx:~/code/HEAD/libraries/base$ git diff diff --git a/Data/List.hs b/Data/List.hs index e7e8602..a2e7ac0 100644 --- a/Data/List.hs +++ b/Data/List.hs @@ -1,5 +1,5 @@ {-# LANGUAGE Trustworthy #-} -{-# LANGUAGE CPP, NoImplicitPrelude, MagicHash #-} +{-# LANGUAGE CPP, NoImplicitPrelude, ScopedTypeVariables, MagicHash #-} ----------------------------------------------------------------------------- -- | @@ -995,11 +995,15 @@ unfoldr f b = -- ----------------------------------------------------------------------------- -- | A strict version of 'foldl'. -foldl' :: (b -> a -> b) -> b -> [a] -> b +foldl' :: forall a b. (b -> a -> b) -> b -> [a] -> b #ifdef __GLASGOW_HASKELL__ +{-# INLINE foldl' #-} +foldl' k z xs = foldr (\(v::a) (fn::b->b) (z::b) -> z `seq` fn (k z v)) (id :: b -> b) xs z +{- foldl' f z0 xs0 = lgo z0 xs0 where lgo z [] = z lgo z (x:xs) = let z' = f z x in z' `seq` lgo z' xs +-} #else foldl' f a [] = a foldl' f a (x:xs) = let a' = f a x in a' `seq` foldl' f a' xs @@ -1022,6 +1026,17 @@ foldl1' _ [] = errorEmptyList "foldl1'" -- ----------------------------------------------------------------------------- -- List sum and product +-- | The 'sum' function computes the sum of a finite list of numbers. +sum :: (Num a) => [a] -> a +-- | The 'product' function computes the product of a finite list of numbers. +product :: (Num a) => [a] -> a + +{-# INLINE sum #-} +sum = foldl (+) 0 +{-# INLINE product #-} +product = foldl (*) 1 + +{- {-# SPECIALISE sum :: [Int] -> Int #-} {-# SPECIALISE sum :: [Integer] -> Integer #-} {-# INLINABLE sum #-} @@ -1048,6 +1063,7 @@ product l = prod l 1 prod [] a = a prod (x:xs) a = prod xs (a*x) #endif +-} -- ----------------------------------------------------------------------------- -- Functions on strings diff --git a/GHC/List.lhs b/GHC/List.lhs index 049aa2a..87c93ae 100644 --- a/GHC/List.lhs +++ b/GHC/List.lhs @@ -1,6 +1,6 @@ \begin{code} {-# LANGUAGE Trustworthy #-} -{-# LANGUAGE CPP, NoImplicitPrelude, MagicHash #-} +{-# LANGUAGE CPP, NoImplicitPrelude, ScopedTypeVariables, MagicHash #-} {-# OPTIONS_HADDOCK hide #-} ----------------------------------------------------------------------------- @@ -179,11 +179,17 @@ filterFB c p x r | p x = x `c` r -- can be inlined, and then (often) strictness-analysed, -- and hence the classic space leak on foldl (+) 0 xs +foldl :: forall a b. (b -> a -> b) -> b -> [a] -> b +{-# INLINE foldl #-} +foldl k z xs = foldr (\(v::a) (fn::b->b) (z::b) -> fn (k z v)) (id :: b -> b) xs z + +{- foldl :: (b -> a -> b) -> b -> [a] -> b foldl f z0 xs0 = lgo z0 xs0 where lgo z [] = z lgo z (x:xs) = lgo (f z x) xs +-} -- | 'scanl' is similar to 'foldl', but returns a list of successive -- reduced values from the left:
Change History (23)
comment:1 Changed 6 years ago by
Cc: | tkn.akio@… added |
---|
comment:2 Changed 6 years ago by
comment:3 Changed 6 years ago by
So, this is what we want to happen, somehow.
We want to transform code of the form
let f = λ x. let h = f (g1 x) in λ y... h (g2 y) ... in ...f 1 2..... f 3 4 ...
into
let f = λ x y. let h = f (g1 x) in ... h (g2 y) ... in ...f 1 2..... f 3 4 ...
If g1
is cheap, this is already done (see [Arity analysis]
in CoreArity
). But in order to implement foldl
in terms of foldr
and get list fusion for it (which would be nice), this needs to work also when g1
is expensive.
In a slightly more general setting, the transformation would not be ok, e.g. in
let f = λ x. let h = f (g1 x) in λ y... h (g2 y) ... in ...map (f 1)....
or
let f = λ x. let h = f (g1 x) in λ y... map h... in ...f 1 2..... f 3 4 ...
because the expensive g1
would no longer be shared.
So we want an analysis that
- looks at the call-sites of
f
- notices that the demand put on
f
from the body of the let,C(C¹(S))
, is actually better than the vanilla demandC(S)
, - makes sure that with this demand put on
f
, its definition (the rhs) now also has this demand (i.e. no sharing lost in the right hand sid), - uses this to mark the
λ y
as one-shot, which will presumably allow the desired optimization.
The demand analyser currently analyses the function before the body. One could do a second round, but there is a risk of exponential blow-up.
comment:4 Changed 6 years ago by
The demand analyser currently analyses the function before the body. One could do a second round, but there is a risk of exponential blow-up.
In this case, the let is a recursive let anyways, so we are going to do multiple iterations. Do we really need to avoid re-analysis so hard?
comment:5 Changed 6 years ago by
I gave it a shot, analysing the body and the rhs of a let in alternation until a fixed point is found with regard to
- demand on the rhs
- rhs’s demand type
- body’s demand type
and as expected, the resulting core looks good:
$wgo_s1oi [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Double# -> GHC.Prim.Double# -> Data.Complex.Complex GHC.Types.Double [LclId, Arity=3, Str=DmdType <S,U><L,U><L,U>] $wgo_s1oi = \ (w_s1o3 :: GHC.Prim.Int#) (ww1_s1oa :: GHC.Prim.Double#) (ww2_s1of :: GHC.Prim.Double#) -> case Foo.f (GHC.Types.I# w_s1o3) of _ [Occ=Dead] { Data.Complex.:+ ww4_s1nS ww5_s1nX -> case ww4_s1nS of _ [Occ=Dead] { GHC.Types.D# ww7_s1re -> case ww5_s1nX of _ [Occ=Dead] { GHC.Types.D# ww9_s1rg -> case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.==# w_s1o3 ww_s1om) of _ [Occ=Dead] { GHC.Types.False -> $wgo_s1oi (GHC.Prim.+# w_s1o3 1) (GHC.Prim.+## ww1_s1oa ww7_s1re) (GHC.Prim.+## ww2_s1of ww9_s1rg); GHC.Types.True -> Data.Complex.:+ @ GHC.Types.Double (GHC.Types.D# (GHC.Prim.+## ww1_s1oa ww7_s1re)) (GHC.Types.D# (GHC.Prim.+## ww2_s1of ww9_s1rg)) } } } }; } in $wgo_s1oi 1 0.0 0.0;
But also as expected, for other code, with more nested lets, compilation time becomes unacceptably large.
Has anyone considered doing the fixed-pointing not locally for every let, but across the whole top-level-binding? I.e. initialize all demands and strictness signatures with their strongest possible value (absDmd
and botDmdType
), and then analyse the whole expression in repeated passes over all of it, until nothing changes any more. (Or, a variant: Do that initialisation, but also do local fixed-pointing, starting from the stored values, so that when an inner let is re-analysed, but nothing has changed, then there is only one iteration.)
comment:6 Changed 6 years ago by
(Or, a variant: Do that initialisation, but also do local fixed-pointing, starting from the stored values, so that when an inner let is re-analysed, but nothing has changed, then there is only one iteration.)
Ah, so that is what ae_virigin
is all about... doing the same trick in my code helps in keeping compilation times down; but I must be introducing bugs somewhere; compiled with my code GHC itself goes into infinite loops – but that is a problem for another day.
comment:7 Changed 6 years ago by
Ok, here is a very different approach to solving the original problem of this ticket, namely making foldl
a good consumer.
Instead of trying hard to make the compiler sufficiently smart, why not help him a little bit? So here is what I did:
I created a new magic function oneShot
(similar to lazy
etc.). Semantically, oneShot = λ f λ x. f x
, but the unfolding will put a OneShot
annotation on the binder for x. So essentially, it allows the programmer to tell the compiler: I promise I will only call this function once (and if I break my promise, I won’t blame you).
So now one would define
foldl k z xs = foldr (\v fn -> oneShot (\z -> fn (k z v))) id xs z
(because at this point, we know that the result of foldr ... id xs
is not going to be used multiple times) and voilà – good code:
Foo.$wfoo :: GHC.Prim.Int# -> (# GHC.Types.Double, GHC.Types.Double #) [GblId, Arity=1, Str=DmdType <L,U>, Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True, ConLike=True, WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 249 30}] Foo.$wfoo = \ (ww_s1ox :: GHC.Prim.Int#) -> case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.># 1 ww_s1ox) of _ [Occ=Dead] { GHC.Types.False -> letrec { $wgo_s1ot [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Double# -> GHC.Prim.Double# -> (# GHC.Types.Double, GHC.Types.Double #) [LclId, Arity=3, Str=DmdType <S,U><L,U><L,U>] $wgo_s1ot = \ (w_s1oc :: GHC.Prim.Int#) (ww1_s1oj :: GHC.Prim.Double#) (ww2_s1oo :: GHC.Prim.Double#) -> case Foo.f (GHC.Types.I# w_s1oc) of _ [Occ=Dead] { Data.Complex.:+ ww8_aYl ww9_aYm -> case ww8_aYl of _ [Occ=Dead] { GHC.Types.D# ww11_s1sa -> case ww9_aYm of _ [Occ=Dead] { GHC.Types.D# ww13_s1sc -> case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.==# w_s1oc ww_s1ox) of _ [Occ=Dead] { GHC.Types.False -> $wgo_s1ot (GHC.Prim.+# w_s1oc 1) (GHC.Prim.+## ww1_s1oj ww11_s1sa) (GHC.Prim.+## ww2_s1oo ww13_s1sc); GHC.Types.True -> (# GHC.Types.D# (GHC.Prim.+## ww1_s1oj ww11_s1sa), GHC.Types.D# (GHC.Prim.+## ww2_s1oo ww13_s1sc) #) } } } }; } in $wgo_s1ot 1 0.0 0.0; GHC.Types.True -> (# Foo.foo1, Data.Complex.$fFloatingComplex1 #) }
comment:8 Changed 6 years ago by
Hmm, it seems that the oneShot-information is thrown away before generating the interface, so the use of oneShot
does not help across module boundaries for now; this is the unfolding
foldl :: forall a b. (b -> a -> b) -> b -> [a] -> b {- Arity: 3, HasNoCafRefs, Strictness: <L,C(C1(U))><L,U><S,1*U>, Inline: INLINE (sat-args=3), Unfolding: InlineRule (3, False, False) (\ @ a @ b k :: b -> a -> b z :: b xs :: [a] -> GHC.Base.foldr @ a @ (b -> b) (\ ds :: a ds1 :: b -> b tpl :: b -> ds1 (k tpl ds)) (GHC.Base.id @ b) xs z) -}
comment:9 Changed 6 years ago by
The trick with oneShot
is neat, and it works for foo x = sum $ [f i | i <- [1 .. x]]
and foo x = sum $ [f i | i <- [1 .. x] , odd i ]
(note the filter), but does not yield optimal code for nested iterations like foo x = sum $ concat [[f i | i <- [1 .. n]]| n <- [1..x]]
, where we get:
letrec { go_a1mU [Occ=LoopBreaker] :: GHC.Prim.Int# -> Data.Complex.Complex GHC.Types.Double -> Data.Complex.Complex GHC.Types.Double [LclId, Arity=1, Str=DmdType <L,U>] go_a1mU = \ (x_a1mV :: GHC.Prim.Int#) -> let { n_X1nv :: Data.Complex.Complex GHC.Types.Double -> Data.Complex.Complex GHC.Types.Double [LclId, Str=DmdType] n_X1nv = case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.==# x_a1mV ww_s1ro) of _ [Occ=Dead] { GHC.Types.False -> go_a1mU (GHC.Prim.+# x_a1mV 1); GHC.Types.True -> GHC.Base.id @ (Data.Complex.Complex GHC.Types.Double) } } in case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.># 1 x_a1mV) of _ [Occ=Dead] { GHC.Types.False -> letrec { go1_X1nG [Occ=LoopBreaker] :: GHC.Prim.Int# -> Data.Complex.Complex GHC.Types.Double -> Data.Complex.Complex GHC.Types.Double [LclId, Arity=2, Str=DmdType <L,U><L,1*U(U(U),U(U))>] go1_X1nG = \ (x1_X1nI :: GHC.Prim.Int#) (eta_B1 :: Data.Complex.Complex GHC.Types.Double) -> let { a_s1pu [Dmd=<L,U(U(U),U(U))>] :: Data.Complex.Complex GHC.Types.Double [LclId, Str=DmdType] a_s1pu = case eta_B1 of _ [Occ=Dead] { Data.Complex.:+ ww2_a10M ww3_a10N -> case ww2_a10M of _ [Occ=Dead] { GHC.Types.D# ww5_s1ub -> case ww3_a10N of _ [Occ=Dead] { GHC.Types.D# ww7_s1ud -> case Foo.f (GHC.Types.I# x1_X1nI) of _ [Occ=Dead] { Data.Complex.:+ ww9_a10Z ww10_a110 -> case ww9_a10Z of _ [Occ=Dead] { GHC.Types.D# ww12_s1uf -> case ww10_a110 of _ [Occ=Dead] { GHC.Types.D# ww14_s1uh -> Data.Complex.:+ @ GHC.Types.Double (GHC.Types.D# (GHC.Prim.+## ww5_s1ub ww12_s1uf)) (GHC.Types.D# (GHC.Prim.+## ww7_s1ud ww14_s1uh)) } } } } } } } in case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.==# x1_X1nI x_a1mV) of _ [Occ=Dead] { GHC.Types.False -> go1_X1nG (GHC.Prim.+# x1_X1nI 1) a_s1pu; GHC.Types.True -> n_X1nv a_s1pu }; } in go1_X1nG 1; GHC.Types.True -> n_X1nv }; } in go_a1mU 1 Foo.foo1;
I guess we’d want n_X1nv
to have arity one here, and be eta-expanded, so that it turns into a join-point, do we?
comment:10 Changed 6 years ago by
Interesting: Even without the oneShot
marker surviving the unfolding, nofib shows some extreme changes in allocations:
Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- bernouilli +0.1% -4.2% 0.12 0.12 +0.0% fft2 +0.0% +125.4% 0.04 0.05 +0.0% gen_regexps +0.0% -9.4% 0.00 0.00 +0.0% integrate +0.1% +144.2% 0.14 0.14 +1.0% minimax +0.0% -3.8% 0.00 0.00 +0.0% simple +0.1% -7.5% 0.15 0.15 -14.7% x2n1 +0.1% -45.2% 0.00 0.00 -50.0% -------------------------------------------------------------------------------- Min -0.9% -45.2% -27.5% -27.7% -50.0% Max +0.2% +144.2% +3.1% +3.1% +100.0% Geometric Mean +0.0% +0.8% -5.9% -5.7% -0.3%
comment:11 Changed 6 years ago by
Heh, in the previous numbers, I passed the arguments to nofib-analyze
in the wrong order. So please swap signs.
I now made sure that the oneShot
makes it into the unfolding for foldl
, which is a clear win. Now, relative to master
again, I get this very nice result:
-------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem bernouilli -0.1% +1.5% 0.14 0.14 +0.0% cryptarithm2 -0.0% -0.5% 0.01 0.01 +0.0% fem -0.0% -2.8% 0.02 0.02 +0.0% fft2 -0.1% -55.9% 0.04 0.04 +0.0% gen_regexps -0.0% -33.6% 0.00 0.00 +0.0% integrate -0.1% -59.4% 0.13 0.13 -1.0% minimax -0.0% -15.6% 0.00 0.00 +0.0% nucleic2 +0.9% +1.7% 0.05 0.05 +0.0% scs -0.0% -0.9% +1.0% +0.5% +0.0% simple -0.1% -9.4% 0.14 0.14 +0.0% x2n1 -0.1% -74.5% 0.01 0.01 +0.0% -------------------------------------------------------------------------------- Min -0.2% -74.5% -3.0% -3.0% -50.0% Max +0.9% +1.7% +4.0% +3.4% +10.4% Geometric Mean -0.0% -3.7% -0.2% -0.3% -0.6%
So it seems to be well worth turning foldl
into a good consumer, even if the resulting code is not perfect for complicated cases like foldl .. .. $ concat $ ..
.
The increase for bernouilli
comes from this code line:
sum $ zipWith (*) powers (tail $ tail combs)
where originally, because sum
is not a good consumer, no list fusion happens and a sum
specialized to Integer
is used, while after the change, sum
is fused with zipWith
, but not further, so now we have
foldr2 (\x y r eta -> r (eta + (x * y))) id powers (tail $ tail combs) 0.
This means that we are have elimiated one intermediate list, but we are allocating PAP ’s on each iteration, which is more expensive (three arguments instead of two!). This is verified by looking at ticky numbers. Maybe foldr2 should have been inlined here. Or we just live with it.
Is this (explicit oneShot
ness-annotations using a magic function) a path we want to go?
comment:12 Changed 6 years ago by
I have come up with a stand-alone analysis that tries to find out the “caller arity” of a function. Quite a bit like the usage demand, but different from the demand analyser in a few aspects: It does not bother about the demand put on function arguments by a function (so much less fixed-pointing due to that), and it currently looks only for tail-calls (or almost-tail-calls; some post-processing of return values a after the “tail call” is ok, as long as it does not do any relevant calls).
It is an analysis phase that puts the result in the IdInfo, and the simplifier will eta-expand functions where the „caller arity” is higher than the manifest arity.
It has zero effect on the current code base.
If I implement foldl
via foldr
(but without the use of oneShot
), I get essentially the same as the result as without the analysis, but with the explicit oneShot
:
Min -0.2% -74.5% -3.0% -3.0% -50.0% Max +0.9% +1.8% +2.5% +2.5% +14.8% Geometric Mean -0.0% -3.7% -0.3% -0.3% -0.5%
It still does not help with bernoulli
, and probably nothing will: Implementing foldl
as foldr
will always yield bad code if list fusion does not work all the way and the build
(or in this case, build2
) is not completely gone. But if list fusion works, the results can be very good
The oneShot
change is much simpler, has little risk of introducing bugs and does not increase compilation times. The “caller arity” pass, OTOH, may optimise other code as well (but at least nofib does not seem to contain much such code), and can likely be made more precise. But the conclusion is that foldl
can be made a good consumer!
comment:13 Changed 6 years ago by
JFTR: Simon discussed this with me and encouraged me to pursue the analysis (but don’t merge it before 7.8 has been branched off).
But even more so I’d like to record a few things about the oneShot
approach:
- Would also be nice, and can still be revived if we find cases where the analysis fails.
- It should always be inlined when passed two arguments.
- Could be made to inline very very late, but then it should have a strictness signature.
- In that case, one would have to make sure it does not get in the way of CPR or other analysis.
comment:15 Changed 6 years ago by
With the patch above, the benefits from foldl-via-buildr and the call arity analysis are
Min -0.2% -74.5% -9.6% -10.6% -50.0% Max +0.9% +1.7% +4.1% +4.1% +6.3% Geometric Mean -0.0% -4.0% -0.6% -0.7% -0.7%
with only one increase: nucleic2
. It comes from a use of maximum
, and it goes away when that is changed from NOINLINE [1]
to INLINE [1]
(GHC was inlining maximum
while foldl
was implemented naively, but not any more, so let's give it a push). Then we get the very pleasing result of no single benchmark with allocation increase:
Min -0.1% -74.5% -6.8% -8.3% -50.0% Max +0.2% 0.0% +38.5% +38.5% 0.0% Geometric Mean -0.0% -4.1% +7.7% +7.7% -0.8%
This is the current state in the branch wip/T7994-calledArity
of ghc, and the branch wip/T7994
of base.
The effect on compilation times needs to be measured, and quite possibly improved (e.g. not a full simplifier run after it, but only eta-expansion where possible and then a gentle simplification of the RHS).
comment:16 Changed 6 years ago by
Previously I said that my analysis makes no change in nofib if one does not also change foldl
to use foldr
. That is not true anymore with the latest code (but almost so), there are a few allocation improvemts, up to -1.3%
(in anna
, stemming from take 30 [... | ...]
) – not a surprise, take
is defined with a continuation passing takeFB :: (a -> b -> b) -> b -> a -> (Int# -> b) -> Int# -> b
.
comment:17 Changed 6 years ago by
Despite my initial worries, this analysis is actually quite cheap: A 1.8% increase in compile time (with unchanged libraries), according to nofib (is that the best way to measure it)?
Is it worth improving that at the expense of more code in GHC? E.g replace the simplifier pass after CallArity by custom one that just does eta-expansion, and only gently simplifies the expression therein? Or should I just leave it like this?
comment:18 Changed 6 years ago by
Another measurement of compile time changes, this time of running make
in a clean an configured GHC tree: changes from 31m34 to 32m7 an increase of 1,75%, confirming the numbers measured in nofib.
comment:19 Changed 6 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Heh, forgot to reference the ticket in the push. This has now been implemented in b63facef165b957183b65604ef99b2b8574747a5/base and requires the calla arity analysis (cdceadf365335fdee5ffa8e6364f7fb00bc4b0f7/ghc).
This should not stop us from using superior approaches like Takano’s when they are ready.
comment:20 Changed 6 years ago by
JFTR: I’m currently working on a presentation and short paper on this, and did some measurements, and found that while making foldl
a good consumer has a nice effect on allocations (-4.1%
, the overall effect on the runtime is negligible).
Effect of Call Arity (with the original foldl):
min mean max Allocations -1.3% -0.1% 0.0% Runtime -10.8% -0.9% +2.0%
Effect of Call Arity + defining foldl with foldr:
min mean max Allocations -74.5% -4.1% 0.0% Runtime -10.4% -0.9% +1.7%
(Wishful thinking: Maybe there are real-world programs that would benefit more noticeably, but they are just not included in nofib)
comment:21 Changed 6 years ago by
Updated numbers, now running nofib in slow mode. Now the results look slightly better. Guess that is what benchmarking is about: Tweaking the tests until the results are what we want them to be:
Effect of Call Arity (with the original foldl):
min mean max Allocations -1.3% -0.0% 0.0% Runtime -4.0% -0.0% +4.9%
Effect of Call Arity + defining foldl with foldr:
min mean max Allocations -79.0% -5.2% 0.0% Runtime -47.4% -1.9% +3.0%
Yesterday it looked as if this has magically improved, but at least with the example from the ticket, put into self-contained as follows:
I still get bad code: