Opened 6 years ago

Closed 4 years ago

Last modified 4 years ago

#8793 closed task (fixed)

Improve GHC.Event.IntTable performance

Reported by: cdk Owned by:
Priority: normal Milestone: 8.0.1
Component: Core Libraries Version: 7.6.3
Keywords: Cc: bos, hvr, ekmett, jstolarek
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1742
Wiki Page:

Description

The performance of GHC.Event.IntTable can be improved. I've managed to get some nice increases across the board. Benchmarking using criterion shows:

function, % faster than current impl. insert: 4% lookup: 26% update: 11% delete: 5%

There is one strange thing I noted. In updateWith, there is an inner loop that looks like this:

data Bucket a = Empty | Bucket Int a (Bucket a)

go _ Empty = (False, Nothing, Empty)
go cont (Bucket key val next)
    | key == k = case f val of
        Nothing -> (True, Just val, cont next)
        Just v  -> (False, Just val, cont (Bucket key v next))
    | otherwise = go (\x -> cont (Bucket key val x)) next

which returns a tuple that is immediately consumed like so:

(delete_occurred, old_val, new_bkt) <- go id ...
when (isJust old_val) $ do
    <updateIntTable>
    when delete_occurred <decIntTableSize>
return old_val

I expected that inlining the <updateIntTable> and <decIntTableSize> code blocks directly into go would result in better code than creating a tuple and then pattern matching on it afterwards. ie.

go _ Empty = return Nothing
go cont (Bucket key val next)
    | key == k = do
        case f val of
            Nothing -> <updateIntTable> (cont next) >> <decIntTableSize>
            Just v  -> <updateIntTable> (cont (Bucket key v next)
        return (Just val)
    | otherwise = go (\x -> cont (Bucket key val x)) next

which has the exact same semantics. To my suprise, this code is almost 2x slower! The core generated in both cases is exactly what I'd expect; if anything, the second version seems tighter. I'm not sure why the first version is faster, but perhaps the original author, Bryan O'Sullivan, can shed some light as he used the tupled method in the first version.

I'll attach my patch, criterion's html output for the benchmarks as well as the benchmarking code, and the core for the oddity I discussed above.

Attachments (10)

IntTable.diff (8.4 KB) - added by cdk 6 years ago.
bench.hs (1.4 KB) - added by cdk 6 years ago.
benchmark code
report.html (223.5 KB) - added by cdk 6 years ago.
criterion html output
no-tuple.dump-simpl (2.0 KB) - added by cdk 6 years ago.
slow core - no tuple (2nd method)
tuple.dump-simpl (1.8 KB) - added by cdk 6 years ago.
fast core - tuple (1st method)
IOLoop.hs (327 bytes) - added by jscholl 4 years ago.
Example loop in IO
IOLoop.simpl (867 bytes) - added by jscholl 4 years ago.
Simplifier output (no suprises here)
PureLoop.cmm (1.2 KB) - added by jscholl 4 years ago.
Generated cmm for pure version
IOLoop.cmm (1.4 KB) - added by jscholl 4 years ago.
Generated cmm for IO version
bench.tar.gz (3.8 KB) - added by jscholl 4 years ago.
Benchmark with original implementation, my changes and cdks changes

Download all attachments as: .zip

Change History (31)

Changed 6 years ago by cdk

Attachment: IntTable.diff added

Changed 6 years ago by cdk

Attachment: bench.hs added

benchmark code

Changed 6 years ago by cdk

Attachment: report.html added

criterion html output

Changed 6 years ago by cdk

Attachment: no-tuple.dump-simpl added

slow core - no tuple (2nd method)

Changed 6 years ago by cdk

Attachment: tuple.dump-simpl added

fast core - tuple (1st method)

comment:1 Changed 6 years ago by cdk

Status: newpatch

comment:2 Changed 6 years ago by hvr

Cc: hvr added
Milestone: 7.8.17.10.1
Type of failure: None/UnknownRuntime performance bug

comment:3 Changed 6 years ago by tibbe

cdk, thanks for the patch. It would be much easier to review the patch if it didn't also reformat all the code.

Please try to respect the original author's style. We don't want "edit wars" where people keep reformatting code back and forth to fit their style.

comment:4 Changed 6 years ago by bos

Hi, cdk, I'll echo Johan's feedback: your patch is not reviewable. I have read it for several minutes, and I cannot tell what or where your real changes are, as opposed to all of the gratuitous reformatting you did. (As Johan mentions, the reformatting is discourteous, as well as obscuring the substance of the change.)

If you would like this change to go in, please resubmit it as one or two minimal patches that do not reformat code except as necessary for semantic purposes, and that make it easy for a reviewer to tell what is going on.

Thanks.

comment:5 Changed 5 years ago by thoughtpolice

Cc: ekmett added
Status: patchinfoneeded

comment:6 Changed 5 years ago by thoughtpolice

Component: libraries/baseCore Libraries

Moving over to new owning component 'Core Libraries'.

comment:7 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:8 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:9 Changed 4 years ago by jscholl

I tried to collect the relevant changes from the patch, but in most cases could not reproduce the speedups cdk reported. Only for the lookup function the runtime changed measurable (it improved even more, around 78% faster than the current implementation). Pulling the ForeignPtr out of the IORef could make sense, but I could not measure an impact on performance.

A thing I find surprising is the first argument in updateWith.go, which seems to be totally unused. Is this the correct behavior? Anyway, I created a second patch which removes the unused argument (and changes the first element of the return value to a Bool, as it is just tested for being Nothing), if it is just a leftover from an earlier version of the function. The change has no impact on performance, though.

comment:10 Changed 4 years ago by jscholl

Status: infoneededpatch

Okay, I just get IndexError: pop from empty list if I try to attach a file, so I put it here...

Improving lookup:

  • GHC/Event/IntTable.hs

    a b  
    4545lookup :: Int -> IntTable a -> IO (Maybe a)
    4646lookup k (IntTable ref) = do
    4747  let go Bucket{..}
    48         | bucketKey == k = return (Just bucketValue)
     48        | bucketKey == k = Just bucketValue
    4949        | otherwise      = go bucketNext
    50       go _ = return Nothing
     50      go _ = Nothing
    5151  it@IT{..} <- readIORef ref
    52   go =<< Arr.read tabArr (indexOf k it)
     52  bkt <- Arr.read tabArr (indexOf k it)
     53  return (go bkt)
    5354
    5455new :: Int -> IO (IntTable a)
    5556new capacity = IntTable `liftM` (newIORef =<< new_ capacity)

Cleaning up updateWith:

  • GHC/Event/IntTable.hs

    a b  
    1313
    1414import Data.Bits ((.&.), shiftL, shiftR)
    1515import Data.IORef (IORef, newIORef, readIORef, writeIORef)
    16 import Data.Maybe (Maybe(..), isJust, isNothing)
     16import Data.Maybe (Maybe(..), isJust)
    1717import Foreign.ForeignPtr (ForeignPtr, mallocForeignPtr, withForeignPtr)
    1818import Foreign.Storable (peek, poke)
    1919import GHC.Base (Monad(..), (=<<), ($), const, liftM, otherwise, when)
     
    123123updateWith f k (IntTable ref) = do
    124124  it@IT{..} <- readIORef ref
    125125  let idx = indexOf k it
    126       go changed bkt@Bucket{..}
    127         | bucketKey == k =
    128             let fbv = f bucketValue
    129                 !nb = case fbv of
    130                         Just val -> bkt { bucketValue = val }
    131                         Nothing  -> bucketNext
    132             in (fbv, Just bucketValue, nb)
    133         | otherwise = case go changed bucketNext of
     126      go bkt@Bucket{..}
     127        | bucketKey == k = case f bucketValue of
     128            Just val -> let !nb = bkt { bucketValue = val } in (False, Just bucketValue, nb)
     129            Nothing  -> (True, Just bucketValue, bucketNext)
     130        | otherwise = case go bucketNext of
    134131                        (fbv, ov, nb) -> (fbv, ov, bkt { bucketNext = nb })
    135       go _ e = (Nothing, Nothing, e)
    136   (fbv, oldVal, newBucket) <- go False `liftM` Arr.read tabArr idx
     132      go e = (True, Nothing, e)
     133  (del, oldVal, newBucket) <- go `liftM` Arr.read tabArr idx
    137134  when (isJust oldVal) $ do
    138135    Arr.write tabArr idx newBucket
    139     when (isNothing fbv) $
     136    when del $
    140137      withForeignPtr tabSize $ \ptr -> do
    141138        size <- peek ptr
    142139        poke ptr (size - 1)
Last edited 4 years ago by bgamari (previous) (diff)

comment:11 Changed 4 years ago by thomie

Thanks for picking this up. Could you maybe try submitting your patches to Phabricator, so the build bot can validate them.

A thing I find surprising is the first argument in updateWith.go, which seems to be totally unused. Is this the correct behavior?

Maybe you could tell us? This code was added in 28cf2e004da0fc809ce9efff0802b125b3501e91 (#8158).

comment:12 Changed 4 years ago by simonpj

  • Do we have any (reproducible) evidence showing that this change makes things go faster? Perhaps a benchmark test of the relevant functions themselves? That would be helpful. Once we have that evidence, we can go ahead and commit.
  • Secondly, are the changes something that a reasonable person might expect the optimiser should be able to do by itself? If so, why doesn't it? Let's not close the ticket until we know about this.

comment:13 Changed 4 years ago by jscholl

I did take another look at the lookup function and compared the core and the generated cmm in both cases. There was nothing special to see in the core as both versions compile to simple loops without any unexpected things like a dictionary floating around. The IO version only packs the result into an unboxed tuple with the state token while the pure version lacks this (of course).

The cmm of both versions sheds more light on the reason for the speedup: GHC compiles the pure version to a nice loop which does not jump back to the beginning of the function, but behind the stack check (the stack is needed to evaluate unevaluated buckets), while the IO version just calls itself recursively (i.e. jumps before the stack check). Otherwise they seemed pretty identical as far as I could tell.

And sadly my measurement of the speedup was wrong as I only took the original benchmark from cdk and modified it as necessary. The original version consumes all my memory as criterion runs benchmarks multiple times and mutable data structures are somewhat sensitive to such a behavior, especially if the insert operation extends already existing elements. So the first thing I did was replacing the lists with sets, which are less sensitive if an existing element is inserted a second time. Another important thing was the order in which benchmarks are run: If we first insert in all competing implementations, the second implementation suffers from a heap with more live data, meaning longer GC delays. So I changed the order in such a way that first one implementation is run, then another one while the first one is already no longer reachable and thus will be GCed. A third thing I noticed (but only after looking at the core) was that my change actually increases the laziness of lookup. The original implementation would determine whether the result is a Just or Nothing while my implementation returns a thunk. I though I was safe by using nfIO in the benchmark to evaluate my result, but actually I did not read that code carefully and first missed the actual place the result is evaluated - or not. Fixing this yields a more reasonable speedup of something around 10-15% (which also fits the observed differences in the cmm much better).

I can (try to) attach the modified benchmark, if this helps.

I did also take another look at the unused argument in updateWith.go and am now quite sure that it is something leftover from the development of the function which has no actual function anymore and can safely be removed. I guess it would have never been in the original commit, but it seems hard or impossible to correctly identity such unused arguments (I mean, it is used, but only in a function which does not use it...). Especially, if such an argument would be used to pass a type to a recursive function, which just passes it on and uses ScopedTypeVariables to access the type, so it never touches the argument directly...

And yes, I can hopefully try to submit the changes to Phabricator tomorrow.

comment:14 in reply to:  13 ; Changed 4 years ago by simonpj

Thank you for taking the time to get more insight. Insight is what we need to take sensible action!

The cmm of both versions sheds more light on the reason for the speedup: GHC compiles the pure version to a nice loop which does not jump back to the beginning of the function, but behind the stack check (the stack is needed to evaluate unevaluated buckets), while the IO version just calls itself recursively (i.e. jumps before the stack check).

Interesting, though I don't yet understand the details. Could you boil out a standalone example that demonstrates just this single issue? I.e. two versions of a function, one of which repeats the stack check and one of which doesn't, and show the code side by side?

but it seems hard or impossible to correctly identity such unused arguments (I mean, it is used, but only in a function which does not use it...).

Well GHC's strictness analyser should find exactly this case. I'm puzzled why it does not. Again, could you spare a moment to make a standalone reproducer for just this issue? Or at least a smallish function I can compile in isolation to see this argument not disappearing.

Thanks

Simon

Changed 4 years ago by jscholl

Attachment: IOLoop.hs added

Example loop in IO

Changed 4 years ago by jscholl

Attachment: IOLoop.simpl added

Simplifier output (no suprises here)

Changed 4 years ago by jscholl

Attachment: PureLoop.cmm added

Generated cmm for pure version

Changed 4 years ago by jscholl

Attachment: IOLoop.cmm added

Generated cmm for IO version

comment:15 in reply to:  14 Changed 4 years ago by jscholl

Interesting, though I don't yet understand the details. Could you boil out a standalone example that demonstrates just this single issue? I.e. two versions of a function, one of which repeats the stack check and one of which doesn't, and show the code side by side?

If I understand cmm correctly, we jump to the start of $wa in line 34, so the IO version repeats the stack check. The corresponding instruction in the pure version is in line 33, here we jump to cCT, so behind our stack check. This should also be possible in the IO version as the function only uses constant stack space and if it was available once, it should stay available until we deallocate it, right?

but it seems hard or impossible to correctly identity such unused arguments (I mean, it is used, but only in a function which does not use it...).

Well GHC's strictness analyser should find exactly this case. I'm puzzled why it does not. Again, could you spare a moment to make a standalone reproducer for just this issue? Or at least a smallish function I can compile in isolation to see this argument not disappearing.

No, I did not mean the optimizer in this case. The unused argument is optimized out, I was just thinking whether GHC could warn if an argument is never used (as it does with unused variables), as this sometimes indicated unfinished code which should either be removed (or documented as such) or finished. But I think this comes with too many cases where you want an unused argument, either to pass in a type via proxy or to satisfy another functions expectations. And if a function is recursive, unused arguments have to be passed on.

Changed 4 years ago by jscholl

Attachment: bench.tar.gz added

Benchmark with original implementation, my changes and cdks changes

comment:16 Changed 4 years ago by jscholl

I attached the modified benchmark from cdk which compares the current implementation found in base with cdks and my changes.

Note that the loopV_ function does not evaluate the returned results, so if someone wants to use that and makes insertWith somehow return its result lazy instead of in WHNF, these will be thrown away.

comment:17 Changed 4 years ago by simonpj

Cc: jstolarek added

OK I've had a look.

  • I have not looked at, or tried to reproduce, the benchmark. Probably we can just trust you. Let's just commit your improved code to the library. Ben?
  • That leaves the question of pure vs IO loop, which I think you are saying is the cause of the performance difference. Have you benchmarked that separately? (Your IOLoop.hs program, that is.)
  • It is indeed very odd that the "IO loop" does not generate as good code as the "pure loop". I believe that for the pure loop we are getting a C-- optimisation that turns a tail call into a jump to a label, so called "loopification". But this isn't happening for IO loop.

I've collected info about loopification here: Commentary/Compiler/Loopification. I'm copying Jan Stolarek who was the last person to work on this. It'd be great to return to it.

It'd be good to pull out this particular loopification issue into a new ticket, if you could.

comment:18 Changed 4 years ago by jscholl

Differential Rev(s): Phab:D1742

I opened #11372 for the loopification issue. And added the differential to this ticket (which I think I should have done earlier, but somehow I thought this would be done automatically if I mention this ticket on phabricator).

comment:19 Changed 4 years ago by Ben Gamari <ben@…>

In 1abb7005/ghc:

Improve GHC.Event.IntTable performance

Speed up GHC.Event.IntTable.lookup by removing the IO context from the
go helper function. This generates a little bit better code as we can
avoid repeating the stack check.

Remove unused parameter from GHC.Event.IntTable.updateWith.go and
directly return a bool instead of a maybe and then checking that whether
it is a Nothing.

Test Plan: validate

Reviewers: austin, hvr, bgamari

Reviewed By: bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1742

GHC Trac Issues: #8793

comment:20 Changed 4 years ago by bgamari

Resolution: fixed
Status: patchclosed

comment:21 Changed 4 years ago by bgamari

This has been merged into both master and ghc-8.0.

Note: See TracTickets for help on using tickets.