Opened 6 years ago

Closed 5 years ago

#8345 closed bug (fixed)

A more efficient atomicModifyIORef'

Reported by: parcs Owned by:
Priority: normal Milestone: 7.10.1
Component: Core Libraries Version: 7.6.1
Keywords: IORef Cc: joeyadams, hvr, ekmett, dfeuer, core-libraries-committee@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #5926 Differential Rev(s):
Wiki Page:

Description

atomicModifyIORef' is currently defined as:

atomicModifyIORef' ref f = do
    b <- atomicModifyIORef ref
            (\x -> let (a, b) = f x
                    in (a, a `seq` b))
    b `seq` return b

This doesn't seem to be the best we can do, though.

The following implementation exhibits a speedup of 1.7x (vanilla RTS) / 1.4x (threaded RTS) over the current implementation:

atomicModifyIORef' ref f = do
    b <- atomicModifyIORef ref $ \a ->
            case f a of
                v@(a',_) -> a' `seq` v
    b `seq` return b

The differences between this version and the current version are:

  1. the lambda is now strict in f a, and
  2. a new result tuple is no longer being allocated.

Is there a good reason why atomicModifyIORef' is not already defined this way?

Change History (19)

comment:1 Changed 6 years ago by simonmar

Cc: hvr added

It looks plausible, but before committing someone should check that it doesn't regress any of the issues that the original version was designed to avoid, see #5926. Ideally we should have tests for these.

comment:2 Changed 6 years ago by hvr

difficulty: UnknownModerate (less than a day)
Keywords: IORef added
Milestone: 7.10.1
Type: taskbug
Type of failure: None/UnknownRuntime performance bug

comment:3 Changed 6 years ago by carter

huh, i'll try to find some cycles to think about the semantics of this

comment:4 Changed 5 years ago by hvr

Cc: ekmett added

Just a thought, what about:

atomicModifyIORef' ref f =
    evaluate =<< atomicModifyIORef ref (\a -> case f a of v@(!a',_) -> v)

comment:5 Changed 5 years ago by dfeuer

Cc: dfeuer added

comment:6 Changed 5 years ago by ekmett

This looks sensible to me based on a quick sanity check and smoke test, but I don't have a good sense of how to test it for correctness robustly.

comment:7 in reply to:  6 Changed 5 years ago by dfeuer

The big difference I see between the original and hvr's version is that the original installs a thunk in the IORef and then immediately forces it, whereas hvr's version forces it, and then installs it in the IORef. I don't think this difference can be observed, but I'm no expert. I think parcs's modification is just a milder step on the way to hvr's, which looks likely to be the simplest/best, but that would take benchmarking.

comment:8 Changed 5 years ago by dfeuer

Actually, one possible complication (maybe): the current version guarantees allocation (and therefore an opportunity to switch threads). I don't think the proposed one does this. I also don't know if that's a problem.

comment:9 Changed 5 years ago by carter

a version that doesn't (in general) trigger the heap check would probably be better, at least in some cases I can probably manufacture.

comment:10 Changed 5 years ago by carter

Reason being that the definition of atomicModify does a CAS retry loop internally. (though maybe yielding to the scheduler could be a good thing! )

comment:11 in reply to:  10 Changed 5 years ago by dfeuer

Replying to carter:

Reason being that the definition of atomicModify does a CAS retry loop internally. (though maybe yielding to the scheduler could be a good thing! )

If it's CAS, then yeah, it's probably reasonably safe to avoid the heap check, since CAS can't be interrupted by anything else on the same OS thread. The only exception is if a program (knowingly or unknowingly) uses atomicModifyIORef' to allow a thread switch. But I want to emphasize that I don't know enough about these inner workings to know if this change could actually cause this sort of problem. I just raised the issue so better-informed people like you can come up with a good answer.

comment:12 Changed 5 years ago by dfeuer

It looks like my fears were unjustified. If I run this:

silly = \current -> (True, True)

{-# INLINE flipper #-}
stupid = newIORef False >>= \r ->
  let
    flop = atomicModifyIORef' r silly
  in forever flop

main = do
         flippy <- forkIO stupid
         mapM_ print [x | x <- [1..], x `rem` 10000000 == 0]

with just one thread, and try either of the above implementations, it prints things just fine. I don't see any obvious signs of allocation anywhere in the stupid thread, so I would speculate that maybe atomicModifyMutVar# always gives the scheduler a go. This looks like a real good catch by parcs.

comment:13 Changed 5 years ago by carter

ok, my remarks about the heapcheck stuff were wrong and such. please ignore them :)

comment:14 Changed 5 years ago by simonmar

@hvr's version is basically just a more concise way to write @parcs' proposed version. I'm happy with either. Someone want to put up a diff?

comment:15 in reply to:  14 Changed 5 years ago by dfeuer

Replying to simonmar:

@hvr's version is basically just a more concise way to write @parcs' proposed version. I'm happy with either. Someone want to put up a diff?

I'll be happy to do that. I realized today that if I replace evaluate with its specification, I do indeed get the same Core for parcs's and hvr's versions, which of course (conditionally) proves that what you say is true. But using evaluate itself gives somewhat different core, because it uses seq# to order things properly instead of case. I haven't inspected the Cmm or asm output, so I don't know if this makes any difference.

comment:16 Changed 5 years ago by dfeuer

Status: newpatch
Last edited 5 years ago by dfeuer (previous) (diff)

comment:17 Changed 5 years ago by thoughtpolice

Component: libraries/baseCore Libraries

Moving over to new owning component 'Core Libraries'.

comment:18 Changed 5 years ago by Herbert Valerio Riedel <hvr@…>

In 9e2cb00e5af9d86546f82a74c3d0382e65704d56/ghc:

Optimise atomicModifyIORef' implementation (#8345)

This forces the new value before installing it in the IORef.

This optimisation was originally suggested by Patrick Palka
and "exhibits a speedup of 1.7x (vanilla RTS) / 1.4x (threaded RTS)"
according to #8345

Reviewed By: austin, simonmar

Differential Revision: https://phabricator.haskell.org/D315

comment:19 Changed 5 years ago by hvr

Cc: core-libraries-committee@… added
Resolution: fixed
Status: patchclosed
Version: 7.6.37.6.1
Note: See TracTickets for help on using tickets.