Opened 4 years ago

Closed 4 years ago

#10711 closed bug (invalid)

Defining mapM_ in terms of traverse_ causes substantial blow-up in ByteCodeAsm

Reported by: bgamari Owned by:
Priority: normal Milestone:
Component: Compiler Version:
Keywords: Cc: core-libraries-committee@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case: ghcirun004
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Phab:D924 has proposed that we redefine mapM_, currently,

mapM_ = foldr ((>>) . f) (return ())

as,

mapM_ = foldr ((*>) . f) (return ()) = traverse_

as part of the AMP proposal.

However, this appears to have severe effects on the performance characteristics of the Assembler monad defined in ByteCodeAsm. In particular, the mapM_ use in ByteCodeAsm.assembleBCO blows up severely, increasing the runtime of the ghcirun004 testcase from 4 seconds to over 5 minutes.

Intriguingly, defining (*>) = (>>) in Assembler's Applicative instance (as done in Phab:D1097) restores reasonable runtime.

Change History (13)

comment:1 Changed 4 years ago by bgamari

Here is a minimal testcase demonstrating the difference.

The efficient definition of mapM_ (using (>>)) produces this Core,

mapA_3 :: Assembler ()
mapA_3 = Main.Pure ()

doTestM_go :: [Assembler Integer] -> Assembler ()
doTestM_go =
  \ ds_a1XT ->
    case ds_a1XT of _ {
      [] -> mapA_3;
      : y_a1XY ys_a1XZ ->
        let {
          k_a1VW
          k_a1VW = doTestM_go ys_a1XZ } in
        $c>>= y_a1XY (\ _ -> k_a1VW)
    }

doTestM :: Assembler ()
doTestM = doTestM_go test

Whereas the slower Applicative-based definition produces,

mapA_3 :: Assembler ()
mapA_3 = Main.Pure ()

mapA_2 :: Assembler (() -> ())
mapA_2 = Main.Pure id

lvl4_r3JE :: Integer -> Assembler (() -> ())
lvl4_r3JE = \ _ -> Main.mapA_2

doTestA_go :: [Assembler Integer] -> Assembler ()
doTestA_go =
  \ ds_a1XT ->
    case ds_a1XT of _ {
      [] -> mapA_3;
      : y_a1XY ys_a1XZ ->
        let {
          m2_a1W7
          m2_a1W7 = doTestA_go ys_a1XZ } in
        $c>>=
          ($c>>= y_a1XY lvl4_r3JE)
          (\ x1_a1W8 -> $c>>= m2_a1W7 (\ x2_a1W9 -> Pure (x1_a1W8 x2_a1W9)))
    }

doTestA :: Assembler ()
doTestA = doTestA_go test

Note the three (>>=) uses in the applicative version, compared to the single invocation in the monadic version.

comment:2 Changed 4 years ago by nomeata

Indeed, one can go from the slow code to the efficient, by just applying all three of the monad laws in the right order :-)

comment:3 Changed 4 years ago by nomeata

BTW, here are the relevant definitions that add up to three >>= for one *>:

a1 *> a2   = (id <$ a1) <*> a2
(<$)       = fmap . const
fmap       = liftM
liftM f m1 = do { x1 <- m1; return (f x1) } -- here is one
(<*>)      = ap
ap m1 m2   = do { x1 <- m1; x2 <- m2; return (x1 x2) } -- here are two

It seems that performance-worrying code should simply not implement Functor and Applicative via the Monad-derived methods. Or at least also add (*>) = (>>) to make things less worse in this case.

Last edited 4 years ago by nomeata (previous) (diff)

comment:4 Changed 4 years ago by nomeata

Of course, one might expect the compiler to derive the good implementation from the bad. It would be valid, assuming the monad laws hold (which we generally do not do):

   a1 *> a2
== (id <$ a1) <*> a2            -- inline *>
== fmap (const id) a1 <*> a2    -- inline <$
== liftM (const id) a1 <*> a2   -- inline fmap
== (a1 >>= (\x1 -> return (const id x1))) <*> a2 -- liftM
== (a1 >>= (\_ -> return id)) <*> a2 -- inline const
== (a1 >>= (\_ -> return id)) >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3))) -- inline <*> and ap
== a1 >>= (\_ -> return id >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3)))) -- assoc monad law
== a1 >>= (\_ -> a2 >>= (\x3 -> return (id x3))) -- first monad law
== a1 >>= (\_ -> a2 >>= return) -- inline id, eta-contract
== a1 >>= (\_ -> a2) -- second monad law

Maybe a bit out of reach for our current simplification infrastructure, where for example the methods >>= and return will quickly be replaced by their concrete implementations

comment:5 Changed 4 years ago by rwbarton

I have to say that I still don't understand exactly why

   a1 *> a2
== (a1 >>= (\_ -> return id)) >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3)))

is more than a constant (say 10 times) slower than a1 >> a2 for this Assembler monad.

Experimentally bgamari's test program does ~n2 allocations and takes ~n3 total time in the Applicative version, while the Monad version runs in linear allocations and time.

comment:6 in reply to:  5 Changed 4 years ago by simonpj

Replying to rwbarton:

I have to say that I still don't understand exactly why

   a1 *> a2
== (a1 >>= (\_ -> return id)) >>= (\x2 -> a2 >>= (\x3 -> return (x2 x3)))

is more than a constant (say 10 times) slower than a1 >> a2 for this Assembler monad.

Me neither! An articulate explanation from someone would be v helpful

comment:7 Changed 4 years ago by nomeata

Experimentally bgamari's test program does ~n2 allocations and takes ~n3 total time in the Applicative version, while the Monad version runs in linear allocations and time.

That and the fact that in comment:4 I need the associativity law, sounds like there is a quadratic behavior due to wrongly associated binds.

Let’s see if I can evaluate my way by hand through it.

-- These fragments from bgamari’s test case
let t n = Thing n cr2
let cr2 = const $ return 2

   run (t 1 >> (t 2 >> t 3))
== run (Thing 1 (cr2 >=> (\_ -> (t 2 >> t 3))))
== run ((cr2 1 >=> (\_ -> (t 2 >> t 3))) 1)
== run (cr2 1 >>=  (\_ -> (t 2 >> t 3)))
== run (return 2 >>= (\_ -> (t 2 >> t 3)))
== run ((\_ -> (t 2 >> t 3)) 2)
== run (t 2 >> t3)
== run (Thing 2 (cr2 >=> (\_ -> t 3)))
== run ((cr2 2 >=> (\_ -> t 3)) 1)
== run (cr2 2 >>=  (\_ -> t 3))
== run (return 2 >>= (\_ -> t 3))
== run ((\_ -> t 3) 2)
== run (t 3)
== run (Thing 3 cr2)
== run (cr2 3)
== run (return 2)
== 2

For the applicative version, based on the empirical implementation, I assume that some parts of the code are kept alive for too long, and possibly be traversed multiple times. So here we go:

let cri = \_ -> return id)
let ri  = (\x -> return (id x))

   run (t 1 *> (t 2 *> t 3))
== run ((t 1 >>= cri) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))
== run ((Thing 1 (cr2 >=> cri)) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))))
== run (Thing 1 ((cr2 >=> cri) >=> (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))))
== run (((cr2 >=> cri) >=> (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))) 1)
== run (((cr2 >=> cri) >=> (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))) 1)
== run (((cr2 >=> cri) 1 >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))))
== run (((cr2 1 >>= cri) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))))
== run (((return 2 >>= cri) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))))
== run (((cri 2) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))))
== run (((return id) >>= (\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3)))))
== run ((\x2 -> (t 2 *> t 3) >>= (\x3 -> return (x2 x3))) id)
== run ((t 2 *> t 3) >>= ri) -- *¹
== run (((t 2 >>= cri) >>= (\x2 -> t3) >>= (\x3 -> return (x2 x3)))) >>= ri)
== ... -- as before
== run ((t 3 >>= ri) >>= ri)
== run (Thing 3 (cr2 >=> ri) >>= ri)
== run (Thing 3 ((cr2 >=> ri) >=> ri)) -- *²
== run (((cr2 >=> ri) >=> ri) 3)
== run ((cr2 >=> ri) 3 >>= ri)
== run ((cr2 3 >>= ri) >>= ri)
== run ((return 2 >>= ri) >>= ri)
== run (ri 2 >>= ri)
== run (return 2 >>= ri)
== run (ri 2)
== run (return 2)
== 2

: I think this is interesting. Whereas above, run (a >> b) will eventually reduce to run b, here we get run (b >>= complex_return), with one complex_return for every element in the list. This explains at least some of the blow up: We build this chain, and then we have to eventually tear it down again.

And again we traverse this whole chain of pointless ri’s.

Hmm, not sure if and how this exposition explains the quadratic blow up in allocations, though. Do we traverse the stack of >=> ri once per element somehow, similar to a wrongly associated ++? But I don’t see it right now.

comment:8 Changed 4 years ago by nomeata

Another factoid:

  • It does not help to implement <$ (used by the default implementation of *>) manually.
  • It does not help to implement <*> manually
  • So a1 *> a2 = (id <$ a1) <*> a2 indeed seems to be the root of the problem.

Do we need to educate people to write (*>) == (>>) if they write (<*>) = ap?

comment:9 Changed 4 years ago by bgamari

Indeed I went through a similar exercise last night and came to a similiar conclusion: smells fishy but I don't necessarily see anything clearly quadratic.

comment:10 Changed 4 years ago by akio

I think the quadratic behavior comes from the fact that the >>= linearly traverses its LHS.

Let's define the size of an Assembler value with:

size :: Assembler a -> Integer
size (Pure _) = 1
size (Thing i k) = 1 + size (k i)

m >>= f performs O(size(m)) operations because >>= is recursive on its LHS. Most of the cost is hidden inside a lambda, but you will have to pay it eventually, namely when run is finally applied.

Now we can analyze the cost of other operations:

  • m >> n needs O(size(m)) operations, because it's m >>= \_ -> n.
  • f <$> m needs O(size(m)) operations, because it's m >>= return . f.
  • m <*> n needs O(size(m) + size(n)) operations, because it's m >>= \v -> n >>= \w -> return v w.
  • m *> n needs O(size(m) + size(n)) operations, because it's const <$> m <*> n.

If you use mapM_ with a N-element list of 2-sized Assemblers, each application of >> costs O(1), so the total cost is O(N).

If you use mapA_ instead, the LHS of an application of *> is O(1) large but the RHS is O(N) large on average. This means it needs O(N) operations on average, so the total cost is O(N*N).

comment:11 Changed 4 years ago by bgamari

Cc: core-libraries-committee@… added
Resolution: fixed
Status: newclosed

Thanks akio for the analysis!

It seems pretty clear that rewriting mapM_ in terms of traverse_ does not preserve the performance characteristics of user code in general. I'm going to close this and push the matter off to the Core Libraries Committee to decide what implications this might have on their future plans (likely few given they already suspected there was the potential for performance regressions with this change).

comment:12 Changed 4 years ago by bgamari

Owner: bgamari deleted
Resolution: fixed
Status: closednew

comment:13 Changed 4 years ago by bgamari

Resolution: invalid
Status: newclosed
Note: See TracTickets for help on using tickets.