Opened 4 years ago
Closed 4 years ago
#10711 closed bug (invalid)
Defining mapM_ in terms of traverse_ causes substantial blow-up in ByteCodeAsm
Reported by: | bgamari | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | |
Keywords: | Cc: | core-libraries-committee@… | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Compile-time performance bug | Test Case: | ghcirun004 |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
Phab:D924 has proposed that we redefine mapM_
, currently,
mapM_ = foldr ((>>) . f) (return ())
as,
mapM_ = foldr ((*>) . f) (return ()) = traverse_
as part of the AMP proposal.
However, this appears to have severe effects on the performance characteristics of the Assembler
monad defined in ByteCodeAsm
. In particular, the mapM_
use in ByteCodeAsm.assembleBCO
blows up severely, increasing the runtime of the ghcirun004
testcase from 4 seconds to over 5 minutes.
Intriguingly, defining (*>) = (>>)
in Assembler
's Applicative
instance (as done in Phab:D1097) restores reasonable runtime.
Change History (13)
comment:1 Changed 4 years ago by
comment:2 Changed 4 years ago by
Indeed, one can go from the slow code to the efficient, by just applying all three of the monad laws in the right order :-)
comment:3 Changed 4 years ago by
BTW, here are the relevant definitions that add up to three >>=
for one *>
:
a1 *> a2 = (id <$ a1) <*> a2 (<$) = fmap . const fmap = liftM liftM f m1 = do { x1 <- m1; return (f x1) } -- here is one (<*>) = ap ap m1 m2 = do { x1 <- m1; x2 <- m2; return (x1 x2) } -- here are two
It seems that performance-worrying code should simply not implement Functor
and Applicative
via the Monad
-derived methods. Or at least also add (*>) = (>>)
to make things less worse in this case.
comment:4 Changed 4 years ago by
Of course, one might expect the compiler to derive the good implementation from the bad. It would be valid, assuming the monad laws hold (which we generally do not do):
a1 *> a2 == (id <$ a1) <*> a2 -- inline *> == fmap (const id) a1 <*> a2 -- inline <$ == liftM (const id) a1 <*> a2 -- inline fmap == (a1 >>= (\x1 -> return (const id x1))) <*> a2 -- liftM == (a1 >>= (\_ -> return id)) <*> a2 -- inline const == (a1 >>= (\_ -> return id)) >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3))) -- inline <*> and ap == a1 >>= (\_ -> return id >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3)))) -- assoc monad law == a1 >>= (\_ -> a2 >>= (\x3 -> return (id x3))) -- first monad law == a1 >>= (\_ -> a2 >>= return) -- inline id, eta-contract == a1 >>= (\_ -> a2) -- second monad law
Maybe a bit out of reach for our current simplification infrastructure, where for example the methods >>=
and return
will quickly be replaced by their concrete implementations
comment:5 follow-up: 6 Changed 4 years ago by
I have to say that I still don't understand exactly why
a1 *> a2 == (a1 >>= (\_ -> return id)) >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3)))
is more than a constant (say 10 times) slower than a1 >> a2
for this Assembler
monad.
Experimentally bgamari's test program does ~n^{2} allocations and takes ~n^{3} total time in the Applicative version, while the Monad version runs in linear allocations and time.
comment:6 Changed 4 years ago by
Replying to rwbarton:
I have to say that I still don't understand exactly why
a1 *> a2 == (a1 >>= (\_ -> return id)) >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3)))is more than a constant (say 10 times) slower than
a1 >> a2
for thisAssembler
monad.
Me neither! An articulate explanation from someone would be v helpful
comment:7 Changed 4 years ago by
Experimentally bgamari's test program does ~n2 allocations and takes ~n3 total time in the Applicative version, while the Monad version runs in linear allocations and time.
That and the fact that in comment:4 I need the associativity law, sounds like there is a quadratic behavior due to wrongly associated binds.
Let’s see if I can evaluate my way by hand through it.
-- These fragments from bgamari’s test case let t n = Thing n cr2 let cr2 = const $ return 2 run (t 1 >> (t 2 >> t 3)) == run (Thing 1 (cr2 >=> (\_ -> (t 2 >> t 3)))) == run ((cr2 1 >=> (\_ -> (t 2 >> t 3))) 1) == run (cr2 1 >>= (\_ -> (t 2 >> t 3))) == run (return 2 >>= (\_ -> (t 2 >> t 3))) == run ((\_ -> (t 2 >> t 3)) 2) == run (t 2 >> t3) == run (Thing 2 (cr2 >=> (\_ -> t 3))) == run ((cr2 2 >=> (\_ -> t 3)) 1) == run (cr2 2 >>= (\_ -> t 3)) == run (return 2 >>= (\_ -> t 3)) == run ((\_ -> t 3) 2) == run (t 3) == run (Thing 3 cr2) == run (cr2 3) == run (return 2) == 2
For the applicative version, based on the empirical implementation, I assume that some parts of the code are kept alive for too long, and possibly be traversed multiple times. So here we go:
let cri = \_ -> return id) let ri = (\x -> return (id x)) run (t 1 *> (t 2 *> t 3)) == run ((t 1 >>= cri) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))) == run ((Thing 1 (cr2 >=> cri)) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))) == run (Thing 1 ((cr2 >=> cri) >=> (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))) == run (((cr2 >=> cri) >=> (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))) 1) == run (((cr2 >=> cri) >=> (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))) 1) == run (((cr2 >=> cri) 1 >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))) == run (((cr2 1 >>= cri) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))) == run (((return 2 >>= cri) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))) == run (((cri 2) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))) == run (((return id) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))) == run ((\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))) id) == run ((t 2 *> t 3) >>= ri) -- *¹ == run (((t 2 >>= cri) >>= (\x2 -> t3) >>= (\x3 -> return (x2 x3)))) >>= ri) == ... -- as before == run ((t 3 >>= ri) >>= ri) == run (Thing 3 (cr2 >=> ri) >>= ri) == run (Thing 3 ((cr2 >=> ri) >=> ri)) -- *² == run (((cr2 >=> ri) >=> ri) 3) == run ((cr2 >=> ri) 3 >>= ri) == run ((cr2 3 >>= ri) >>= ri) == run ((return 2 >>= ri) >>= ri) == run (ri 2 >>= ri) == run (return 2 >>= ri) == run (ri 2) == run (return 2) == 2
*¹
: I think this is interesting. Whereas above, run (a >> b)
will eventually reduce to run b
, here we get run (b >>= complex_return)
, with one complex_return
for every element in the list. This explains at least some of the blow up: We build this chain, and then we have to eventually tear it down again.
*²
And again we traverse this whole chain of pointless ri
’s.
Hmm, not sure if and how this exposition explains the quadratic blow up in allocations, though. Do we traverse the stack of >=> ri
once per element somehow, similar to a wrongly associated ++
? But I don’t see it right now.
comment:8 Changed 4 years ago by
Another factoid:
- It does not help to implement
<$
(used by the default implementation of*>
) manually. - It does not help to implement
<*>
manually - So
a1 *> a2 = (id <$ a1) <*> a2
indeed seems to be the root of the problem.
Do we need to educate people to write (*>) == (>>)
if they write (<*>) = ap
?
comment:9 Changed 4 years ago by
Indeed I went through a similar exercise last night and came to a similiar conclusion: smells fishy but I don't necessarily see anything clearly quadratic.
comment:10 Changed 4 years ago by
I think the quadratic behavior comes from the fact that the >>=
linearly traverses its LHS.
Let's define the size of an Assembler
value with:
size :: Assembler a -> Integer size (Pure _) = 1 size (Thing i k) = 1 + size (k i)
m >>= f
performs O(size(m))
operations because >>=
is recursive on its LHS. Most of the cost is hidden inside a lambda, but you will have to pay it eventually, namely when run
is finally applied.
Now we can analyze the cost of other operations:
m >> n
needsO(size(m))
operations, because it'sm >>= \_ -> n
.f <$> m
needsO(size(m))
operations, because it'sm >>= return . f
.m <*> n
needsO(size(m) + size(n))
operations, because it'sm >>= \v -> n >>= \w -> return v w
.m *> n
needsO(size(m) + size(n))
operations, because it'sconst <$> m <*> n
.
If you use mapM_
with a N
-element list of 2-sized Assembler
s, each application of >>
costs O(1)
, so the total cost is O(N)
.
If you use mapA_
instead, the LHS of an application of *>
is O(1)
large but the RHS is O(N)
large on average. This means it needs O(N)
operations on average, so the total cost is O(N*N)
.
comment:11 Changed 4 years ago by
Cc: | core-libraries-committee@… added |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Thanks akio for the analysis!
It seems pretty clear that rewriting mapM_
in terms of traverse_
does not preserve the performance characteristics of user code in general. I'm going to close this and push the matter off to the Core Libraries Committee to decide what implications this might have on their future plans (likely few given they already suspected there was the potential for performance regressions with this change).
comment:12 Changed 4 years ago by
Owner: | bgamari deleted |
---|---|
Resolution: | fixed |
Status: | closed → new |
comment:13 Changed 4 years ago by
Resolution: | → invalid |
---|---|
Status: | new → closed |
Here is a minimal testcase demonstrating the difference.
The efficient definition of
mapM_
(using(>>)
) produces this Core,Whereas the slower
Applicative
-based definition produces,Note the three
(>>=)
uses in the applicative version, compared to the single invocation in the monadic version.