Opened 9 years ago

Closed 9 years ago

#4928 closed feature request (fixed)

Add primops for copying/cloning an array in the new codegen

Reported by: tibbe Owned by: ezyang
Priority: normal Milestone: 7.2.1
Component: Runtime System Version: 7.0.2
Keywords: Cc: pumpkingod@…, johan.tibell@…, daniel.is.fischer@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Daniel Peebles's benchmarks show that copying arrays using a indexArray#/writeArray# loop is much slower than using a (new) primop that calls memcpy. In addition, cloning arrays is slower than it need to be due to newArray# filling the array with a default element, just to have all the elements be overwritten by the contents of the array being cloned.

I suggest we add the following primops:

copyArray#        ::        Array# a -> Int# -> MutableArray# s a -> Int# -> Int# -> State# s -> State# s
copyMutableArray# :: MutableArray# a -> Int# -> MutableArray# s a -> Int# -> Int# -> State# s -> State# s

cloneArray#        ::        Array#   a -> Int# -> Int# -> State# s -> (# State# s, Array# a #)
cloneMutableArray# :: MutableArray# s a -> Int# -> Int# -> State# s -> (# State# s, MutableArray# s a #)

freezeArray# :: MutableArray# s a -> Int# -> Int# -> State# s -> (# State# s, Array# a #)
thawArray#   ::        Array#   a -> Int# -> Int# -> State# s -> (# State# s, MutableArray# s a #)

Note that ByteArray# versions can be created using FFI calls to memcpy and thus don't need to be primops.

Attachments (13)

copyclone.dpatch (78.4 KB) - added by pumpkin 9 years ago.
arraycopy.dpatch (85.6 KB) - added by pumpkin 9 years ago.
arraycopy-tests.dpatch (122.6 KB) - added by tibbe 9 years ago.
Tests for primops
arraycopy-complete.dpatch (89.9 KB) - added by tibbe 9 years ago.
Fixed version of arraycopy.dpatch
arraycopy-complete.2.dpatch (93.6 KB) - added by pumpkin 9 years ago.
copyarray-clonearray.dpatch (94.3 KB) - added by pumpkin 9 years ago.
copyarray-clonearray-test.dpatch (124.5 KB) - added by pumpkin 9 years ago.
unitfix.dpatch (96.6 KB) - added by pumpkin 9 years ago.
0001-Add-array-copy-clone-primops.patch (9.7 KB) - added by tibbe 9 years ago.
Squashed and converted to Git
0001-Add-test-for-array-copy-clone-primops.patch (12.0 KB) - added by tibbe 9 years ago.
Squashed and converted to Git
ArrayCopyBench.hs (1.9 KB) - added by tibbe 9 years ago.
0001-Make-assignTemp_-less-pessimistic.patch (1.6 KB) - added by tibbe 9 years ago.
0001-Always-run-cgrun068.patch (3.0 KB) - added by tibbe 9 years ago.

Download all attachments as: .zip

Change History (43)

comment:1 Changed 9 years ago by daniel.is.fischer

Cc: daniel.is.fischer@… added

Changed 9 years ago by pumpkin

Attachment: copyclone.dpatch added

comment:2 Changed 9 years ago by pumpkin

The patch is incomplete. I've written out what I think should be in the copyArray# and copyMutableArray# primops but the clones and freeze/thaw are still empty stubs (so don't call them!). I'm mostly looking for comments about the card marking adjustments, which are new and untested since my original patch.

It would be slightly smarter to set the boundary cards to the conjunction of the source and destination cards at that point, but setting them both to 1 unconditionally was simpler.

Changed 9 years ago by pumpkin

Attachment: arraycopy.dpatch added

comment:3 Changed 9 years ago by pumpkin

The new patch has all the primops listed above, but apart from some trivial testing I did in ghci, it still needs some testing love.

Current issues with the implementation lie mostly in the card copying logic. Since we allow different source and destination offsets to be specified in the primop call, and these offsets may not be aligned mod 128 (or whatever the card width is), a single source card may overlap two destination cards. So the current logic simply optimizes for the current case, and if the two offsets are aligned mod 128, it uses a memcpy/memmove, and otherwise it simply memsets the cards all to 1 (so the GC may do some extra work).

For future work, it's worth seeing how expensive the cmm loop would be that does the "right" thing for copying the subcard-shifted cards would be. Also, for tibbe's common case, we're cloning 32-element arrays, which only have one card. Calling out to memcpy for a single word (it gets rounded up to a multiple of a word) seems silly, but only profiling can tell us when it makes sense to call out to memcpy and when it's faster to just do a (partially unrolled?) memory copy loop in cmm. I expect tibbe's 1-word copy would be a bit faster in cmm, but I don't think any of his loops are tight enough to care about a single function call.

Changed 9 years ago by tibbe

Attachment: arraycopy-tests.dpatch added

Tests for primops

comment:4 Changed 9 years ago by tibbe

I've attached a patch (that depends on arraycopy.dpatch) that adds some basic tests for the primops.

Changed 9 years ago by tibbe

Attachment: arraycopy-complete.dpatch added

Fixed version of arraycopy.dpatch

comment:5 Changed 9 years ago by tibbe

Status: newpatch

All tests now pass. Please merge arraycopy-complete.dpatch to ghc and arraycopy-tests.dpatch to testsuite.

comment:6 Changed 9 years ago by tibbe

Status: patchnew

Reopening as there's some more work needed e.g. marking the arrays as frozen/unfrozen and not copying the cards to newly allocated arrays. Also, some function, like cloneArray# should probably not take a state token.

Changed 9 years ago by pumpkin

Attachment: arraycopy-complete.2.dpatch added

comment:7 Changed 9 years ago by pumpkin

Status: newpatch

This one should fix the issues tibbe mentioned, except I'm still not sure about the State# parameters so I left the types as they were.

Changed 9 years ago by pumpkin

Attachment: copyarray-clonearray.dpatch added

Changed 9 years ago by pumpkin

comment:8 Changed 9 years ago by pumpkin

These last two patches change the signature of cloneArray# to not pass around an unnecessary State# token, and update the associated test to work again with the new signature.

comment:9 Changed 9 years ago by tibbe

This should now be ready to merge. Please merge copyarray-clonearray.dpatch (ghc) and copyarray-clonearray-test.dpatch (testsuite).

comment:10 Changed 9 years ago by simonmar

Owner: set to simonmar

comment:11 Changed 9 years ago by simonmar

On the whole, the patch looks good (especially if you squash it into one patch, which I'll do before pushing unless you do). I think there might still be some problems in the implementation though:

+    I8[dst_cards_start + mutArrPtrsCardWords(n)] = I8[dst_cards_start + mutArrPtrsCardWords(dst_start + n)] | I8[src_cards_start + mutArrPtrsCardWords(src_start + n)]; \

this can't be right, because all offsets in Cmm are in bytes, and you're adding a word index (mutArrPtrsCardWords(n)).

comment:12 Changed 9 years ago by pumpkin

Good point, I'm uploading a patch that fixes that.

Changed 9 years ago by pumpkin

Attachment: unitfix.dpatch added

comment:13 Changed 9 years ago by igloo

Milestone: 7.2.1

comment:14 Changed 9 years ago by tibbe

Can this patch be merged into the 7.2.1 branch now? Are we waiting for the Git migration?

comment:15 Changed 9 years ago by tibbe

Since we've now switched to Git, should pumpkin change this into a Git patch before you can merge it Ian?

comment:16 Changed 9 years ago by igloo

The ticket's assigned to simonmar, so I wasn't looking at it. Should I be taking a look?

Changed 9 years ago by tibbe

Squashed and converted to Git

Changed 9 years ago by tibbe

Squashed and converted to Git

comment:17 Changed 9 years ago by tibbe

I've squashed and converted both patches to Git. Please review 0001-Add-array-copy-clone-primops.patch and 0001-Add-test-for-array-copy-clone-primops.patch.

comment:18 Changed 9 years ago by tibbe

To merge a patch run git am -3 <patch file>.

Changed 9 years ago by tibbe

Attachment: ArrayCopyBench.hs added

comment:19 Changed 9 years ago by tibbe

I've attached the beginnings of a benchmark harness. For now it only exercises copyArray# and thawArray#. Usage:

$ ./inplace/bin/ghc-stage2 -O2 ArrayCopyBench.hs -o bench
$ ./bench 
Usage: bench BENCHMARK ARRAYSIZE ITERS
$ ./bench thawArray# 32 1000000
thawArray#: 0.040091s

Using this benchmark and Linux's perf events we now have a better idea of where the time is spent:

$ perf record -f ./bench thawArray# 32 10000000
thawArray#: 0.408859s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.095 MB perf.data (~4160 samples) ]
$ perf report
# Samples: 4100
#
# Overhead          Command        Shared Object  Symbol
# ........  ...............  ...................  ......
#
    37.05%            bench  /lib/libc-2.11.1.so  [.] memcpy
    36.78%            bench  ./bench              [.] stg_thawArrayzh
     6.56%            bench  /lib/libc-2.11.1.so  [.] __GI_memset
     5.27%            bench  ./bench              [.] allocate
     4.34%            bench  ./bench              [.] s224_info
     2.02%            bench  ./bench              [.] s24C_info
     1.17%            bench  ./bench              [.] memcpy@plt
     1.05%            bench  ./bench              [.] memset@plt
     1.02%            bench  ./bench              [.] clearNurseries
     0.27%            bench  [kernel]             [k] sigprocmask

comment:20 Changed 9 years ago by tibbe

copyArray# spends more time in the C-- code so perhaps there's room for optimization there:

$ perf record -f ./bench copyArray# 32 10000000
copyArray#: 0.354839s
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.083 MB perf.data (~3624 samples) ]
$ perf report
# Samples: 3564
#
# Overhead          Command        Shared Object  Symbol
# ........  ...............  ...................  ......
#
    58.89%            bench  ./bench              [.] stg_copyArrayzh
    34.37%            bench  /lib/libc-2.11.1.so  [.] memcpy
     3.11%            bench  ./bench              [.] s233_info
     1.71%            bench  ./bench              [.] s235_info
     1.43%            bench  ./bench              [.] memcpy@plt
     0.03%  perf_2.6.32-gg4  [kernel]             [k] audit_syscall_exit
     0.03%  perf_2.6.32-gg4  [kernel]             [k] format_decode
     0.03%  perf_2.6.32-gg4  [kernel]             [k] memcpy_c
     0.03%            bench  [kernel]             [k] up_read
     0.03%            bench  [kernel]             [k] filemap_fault

comment:21 Changed 9 years ago by tibbe

stg_copyArrayzh and stg_copyMutableArrayzh both call MAYBE_GC but neither of them allocates. Is this correct?

comment:22 Changed 9 years ago by simonmar

Milestone: 7.4.17.2.1
Version: 7.0.17.0.2

I just pushed tibbe's patches. There are a few things still to do:

  • add the primops to the new code generator (ezyang is going to look into this)
  • enable the cgrun068 test (tibbe)
  • optimise assignTemp_ a bit (tibbe)

Changed 9 years ago by tibbe

comment:23 Changed 9 years ago by tibbe

I've attached two patches that address my two ToDos. Please commit.

comment:24 Changed 9 years ago by tibbe

I've validate and pushed the two patches. The only remaining ToDo is to implement the primop in the new codegen.

comment:25 Changed 9 years ago by tibbe

Owner: changed from simonmar to ezyang

Assigning this to ezyang as I'm done with my ToDos. Feel free to unassign if you don't have time to work on this.

comment:26 Changed 9 years ago by tibbe

Owner: ezyang deleted
Status: patchnew

comment:27 Changed 9 years ago by tibbe

Owner: set to ezyang

Convincing Trac that the status is no longer "patch".

comment:28 Changed 9 years ago by tibbe

Summary: Add primops for copying/cloning an arrayAdd primops for copying/cloning an array in the new codegen

comment:29 Changed 9 years ago by tibbe

I believe this is done in 5b1053897fa16ced293e749447e9c027d15d29f5 and 8c2ddecb4d58bc22b3bed220c514ec54091ffcbc. Can we close the ticket?

comment:30 Changed 9 years ago by simonmar

Resolution: fixed
Status: newclosed

closing, thanks.

Note: See TracTickets for help on using tickets.