Opened 3 years ago

Closed 3 years ago

#11795 closed bug (fixed)

Performance issues with replicateM_

Reported by: snoyberg Owned by: snoyberg
Priority: normal Milestone: 8.0.1
Component: libraries/base Version: 7.10.3
Keywords: Cc:
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2086
Wiki Page:

Description

When working on optimizing a program by minimizing allocations, I can into an issue with replicateM_. Consider the following code

import Control.Monad (replicateM_)
import Foreign.C.String (withCString)
import Foreign.Storable (peek)

main :: IO ()
main = withCString "foo" $ replicateM_ 10000000 . peek

When I run this program, I get:

160,042,656 bytes allocated in the heap

The result is the same whether I compile with -O0, -O, or -O2. And as expected, the total allocation increases or decreases based on the numbers of times I replicate the action. On the other hand, replacing replicateM_ with a hand-written version makes the total allocations for the program only 42KB, and does not increase with the numbers of replications.

replicateM_ :: Monad m => Int -> m a -> m ()
replicateM_ cnt0 f =
    loop cnt0
  where
    loop cnt
        | cnt <= 0  = return ()
        | otherwise = f >> loop (cnt - 1)

By contrast, Control.Monad.replicateM_ looks like:

replicateM_       :: (Monad m) => Int -> m a -> m ()
{-# INLINEABLE replicateM_ #-}
{-# SPECIALISE replicateM_ :: Int -> IO a -> IO () #-}
{-# SPECIALISE replicateM_ :: Int -> Maybe a -> Maybe () #-}
replicateM_ n x   = sequence_ (replicate n x)

I can't see an advantage to this implementation over the more direct implementation I've provided above. Unless there are objections, I'll send a patch to switch the implementation. (Since master already uses Applicative, I'll make the relevant updates to generalize the function signature too.)

Change History (12)

comment:1 Changed 3 years ago by snoyberg

Owner: set to snoyberg

comment:2 Changed 3 years ago by simonpj

Michael, this is (sadly) a well known issue. #1168 has a list of related tickets, and #9388 has ideas and preliminary work on how to limit the scope of the hack.

I think it'd be fine to switch implementations of replicateM_, but can you put a prominent notice on the master ticket #1168 so that whoever looks into it doesn't think "oh the problem has gone away". Your fix will (helpfully) cure the symptom but not the disease.

comment:3 Changed 3 years ago by snoyberg

Thanks for the quick response Simon. I'm definitely not deeply familiar with the state hack, but from what I can see the issue I'm reporting here may be orthogonal. Specifically, if I pass in -fno-state-hack, I still see the large-allocation behavior. I believe the issue here is twofold:

  • The specialize pragma is getting the way of inlining firing
  • The behavior of generating and consuming a list is less efficient than the simpler code I've provided here

I'll be happy to add comments to any other issues as you see fit, but I don't want to add confusion to an already complicated issue if this is in fact separate from the state hack.

comment:4 Changed 3 years ago by snoyberg

Actually, it would appear that Ben already modified the implementation of replicateM_ to something different than I saw:

replicateM_ 0 _   = pure ()
replicateM_ n x   = x *> replicateM_ (pred n) x

The implementation I've given here still seems to perform better, due to usage of the worker/wrapper transform. It also keeps the semantics of the current version in GHC 7.10.3 for negative values.

comment:5 Changed 3 years ago by snoyberg

Differential Rev(s): Phab:D2086

comment:6 Changed 3 years ago by bgamari

On Phab:D2086 Simon posed the very reasonable question of why these two differ so significantly in performance.

In my original reading of the patch I had assumed that the reason was that the "fast" version would enable unboxing of the Int accumulator while the "slow" version would not. Looking at the Core, however, this does not appear to be the difference: this accumulator is unboxed in both cases.

The actual difference is that the self-recursive nature of the slow version inhibits inlining, which means we never get to specialise to the action. Instead we end up with,

$w$sreplicateM_noWW
  :: forall a. Int# -> IO a -> State# RealWorld -> (# State# RealWorld, () #)
$w$sreplicateM_noWW =
  \ (@ a) (ww :: Int#) (w :: IO a) (w1 :: State# RealWorld) ->
    case ww of ds {
      __DEFAULT ->
        case (w `cast` ...) w1 of _ { (# ipv, ipv1 #) ->
        $w$sreplicateM_noWW @ a (-# ds 1#) w ipv
        };
      0# -> (# w1, () #)
    }

lvl2 :: Ptr CChar -> State# RealWorld -> (# State# RealWorld, () #)
lvl2 =
  \ (x :: Ptr CChar) (eta :: State# RealWorld) ->
    $w$sreplicateM_noWW
      @ CChar
      100000000#
      (($fStorableInt21 (x `cast` ...)) `cast` ...)
      eta

Which gives us O(n) slow calls.

This stands in stark contrast to the fast variant, which compiles down to this beautiful allocation-free loop,

$wgo12 :: Int# -> State# RealWorld -> (# State# RealWorld, () #)
$wgo12 =
  \ (ww :: Int#) (w :: State# RealWorld) ->
    case tagToEnum# @ Bool (<=# ww 0#) of _ {
      False ->
        case getForeignEncoding1
        of _ { (getForeignEncoding5, setForeignEncoding1) ->
        case (getForeignEncoding5 `cast` ...) w of _ { (# ipv, ipv1 #) ->
        case charIsRepresentable3 @ () ipv1 lvl1 (lvl2 `cast` ...) ipv
        of _ { (# ipv2, ipv3 #) ->
        case seq# @ () @ RealWorld ipv3 ipv2 of _ { (# ipv4, ipv5 #) ->
        $wgo12 (-# ww 1#) ipv4
        } } } };
      True -> (# w, () #)
    }

comment:7 Changed 3 years ago by snoyberg

Should I update my changes to refer to the above comment by Ben, which is a pretty thorough coverage of what's going on here?

comment:8 Changed 3 years ago by simonpj

Great. So it inlines bodily, and specialises to the particular call site. That is precisely what I speculated in my Phab comment. So yes, it'd good to make a Note in the code which says what my comment says, and points to this ticket for more detail.

I've learned that it's extremely helpful leaving breadcrumbs like this in the code, so the next time someone is fiddling with that code, they do so in the light of earlier learning.

comment:9 Changed 3 years ago by bgamari

Status: newmerge

comment:10 Changed 3 years ago by bgamari

Milestone: 8.0.1

comment:11 Changed 3 years ago by Ben Gamari <ben@…>

In c4a7520e/ghc:

Provide an optimized replicateM_ implementation #11795

In my testing, the worker/wrapper transformation applied here
significantly decreases the number of allocations performed when using
replicateM_. Additionally, this version of the function behaves
correctly for negative numbers (namely, it will behave the same as
replicateM_ 0, which is what previous versions of base have done).

Reviewers: bgamari, simonpj, hvr, austin

Reviewed By: bgamari, simonpj, austin

Subscribers: nomeata, simonpj, mpickering, thomie

Differential Revision: https://phabricator.haskell.org/D2086

GHC Trac Issues: #11795

comment:12 Changed 3 years ago by bgamari

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.