Opened 8 years ago

Closed 8 years ago

#5536 closed bug (fixed)

IO performance regression in 7.2 vs 7.0

Reported by: simonmar Owned by: igloo
Priority: high Milestone: 7.4.1
Component: libraries/base Version: 7.2.1
Keywords: Cc: batterseapower
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: T5536
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The following program:

main = interact id

runs about a third slower in 7.2.1 compared to 7.0.3 (and current 7.3 is about the same as 7.2.1).

With 7.2.1:

$ ls -l stuff                             
-rw-rw-r-- 1 simonmar GHC 128464904 2011-10-07 14:59 stuff
$ ./cat <stuff >/dev/null +RTS -s      
   8,357,007,912 bytes allocated in the heap
     793,616,736 bytes copied during GC
          83,984 bytes maximum residency (1 sample(s))
          24,936 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     15966 colls,     0 par    0.94s    0.94s     0.0001s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0004s    0.0004s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    5.71s  (  5.71s elapsed)
  GC      time    0.94s  (  0.94s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    6.65s  (  6.65s elapsed)

and with 7.0.3:

$ ./cat <stuff >/dev/null +RTS -s      
./cat +RTS -s 
   5,299,570,680 bytes allocated in the heap
     250,350,336 bytes copied during GC
          53,232 bytes maximum residency (1 sample(s))
          30,776 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 10086 collections,     0 parallel,  0.50s,  0.50s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.38s  (  4.38s elapsed)
  GC    time    0.50s  (  0.50s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.88s  (  4.88s elapsed)

I strongly suspect this has to do with the changes to the encoding machinery in 7.2.1.

In binary mode, the slowdown is even worse. The program:

import System.IO
main = do
  hSetBinaryMode stdin True
  hSetBinaryMode stdout True
  getContents >>= putStr

With 7.0.3 this runs in 3.35s, with 7.2.1 it takes 5.49s, a slowdown of 63%.

After fixing this, we need to add a performance regression test.

Change History (13)

comment:1 Changed 8 years ago by igloo

Owner: set to batterseapower

Hmm, I don't know if more work needs to be done in text mode now, but I'd hope that binary mode at least would be just as fast.

Max, do you know what's going on, or do you think you will be likely to have a chance to look into it in the next month or so please? If not, just let us know, and Simon or I will see if we can see what's going on.

comment:2 Changed 8 years ago by simonmar

After this commit:

commit ea4bb083c6222e7c774fdb554f24f17a53ae350d

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Wed Jul 27 08:39:14 2011 +0100

    Eliminate a thunk in the inner loop of 'unpack'
    
    This makes quite a difference to allocation in some
    programs.

The performance of cat improved to 5.81s (19% slower than 7.0), and the binary cat improved to 4.24s (27% slower than 7.0). This is a lot better, but I hope there are more improvements to be had. This is still a pretty big penalty.

comment:3 Changed 8 years ago by batterseapower

It could definitely be to do with the change to writeBlocks/unpack: in writeBlocks (invoked from putStr) we have to do an extra test on each character to see if we have to map them to lone surrogatess in the CharBuf. Likewise, in unpack (invoked from getContents) we have to test each element of the CharBuf to see if it is a lone surrogate we should map to our private-use area.

This is a consequence of the decision to use private-use characters for encoding roundtrip characters, as opposed to using lone surrogates directly (like Python does).

One possible avenue for improvement is that in this invocation of unpack we know that the CharBuf can't contain any lone surrogates (since it came from a binary-mode handle), so we could avoid the test. Actually implementing this might be a bit tricky, though.

I did eyeball the unpack loop and noticed that surrogatifyRoundtripCharacter was returning a boxed character. This is pretty much a disaster, so I added INLINE pragmas to all of the small combinators in Failure.hs, which moved the runtime of a binary "cat" from 2.2s to 2.0s on my machine (7.0.4 clocks in at 1.5s). Is this the change Simon made in ea4bb083c6222e7c774fdb554f24f17a53ae350d? I can't actually see that commit in my repo, for some reason.

comment:4 Changed 8 years ago by batterseapower

Other optimisations:

  • Reassociate the arithmetic in (de)surrogatifyRoundtripCharacter so that GHC can constant-fold a subtraction. No change to runtime, but the Core looks better
  • Ensure that unpack returns a list with evaluated elements (due to pointer-tagging, this should prevent ping-ponging between the list consumers code and the code of desurrogatifyRoundtripCharacter). Reduces binary Cat to 1.7s.

comment:5 Changed 8 years ago by batterseapower

I've pushed those changes as 60907a3. The fact that it runs in ~1.75s vs 1.45s still indicates a 20% perf penalty though.

comment:6 Changed 8 years ago by batterseapower

Removing the extra tests get us back to 7.0.4 performance (in fact, a bit better) so this is indeed the source of the problem. I can't see any further easy improvements. Perhaps we could avoid doing the surrogatify/desurrogatify steps in writeBlocks/unpack for a corresponding Handle in binary mode. Then when switching from binary to non-binary mode and vice-versa we would need to do the correct transformation to fix up any buffered characters.

comment:7 Changed 8 years ago by simonmar

I managed to fail to commit that change somehow, sorry about that. FWIW it was just a seq in the unpack loop, no INLINEs. Your version has improved performance a little for me, the perf drop is now 17% on both text/binary versions. (64-bit Linux)

So maybe we just have to eat this. It seems like a high price to pay for something that is almost never needed, though.

comment:8 Changed 8 years ago by batterseapower

Yes it is a bit annoying. I do wonder, though, why two simple arithmetic comparisons are penalising the loops so much, especially considering that those loops already do a number of *other* tests.

I wonder if the fact that with this test the Core looks something like:

case (case x of A -> C# ...; B -> C# ...; C -> C# ...) of c {
  loop (c:cs) x
}

Is causing a problem? Even though we can't improve the core by doing a case-of-case transformation, doing so would avoid some stack manipulations at the cmm level.

Perhaps the right thing to do is look at the generated ASM and see what is so different in the two versions.

comment:9 Changed 8 years ago by simonpj

That would be a great idea, thanks. The code from a case-of-case should be pretty good; it it isn't we should fix ssomething.

comment:10 Changed 8 years ago by simonmar

Oh, I hadn't realised the Core looked like that. We can do much better!

The code generated for this:

case (case x <# y of True -> C# ...; False -> C# ...) of c {
  loop (c:cs) x
}

makes a case continuation for the outer case. Each branch of the inner case builds a C#, and then jumps directly to the continuation. That's not bad, but we don't need a continuation at all. We'd much rather generate

case (case x <# y of True -> e1; False -> e2) of z {
  loop (C# z : cs) x

So this would all be within a single function, no continuations.

But I don't see how to get that without writing it explicitly - that is, making desurrogatifyRoundtripChar work with Char#. GHC would have to notice that all the branches of the inner case return an explicit C# constructor. Simon, do we have anything that does that?

I did manage to get GHC to do the case-of-case transformation though, which results in better code. By changing unpack to look like this:

              case desurrogatifyRoundtripCharacter c of { C# c# ->
              unpackRB (C# c#:acc) (i-1) }

GHC now does case-of-case, duplicating the final tail call in each branch. This might be the best we can do here in fact, since speed is much more important than saving a bit of code size. It's bad that this gives different code than just using seq or bang-patterns, though.

I'll validate and commit the change (with a comment).

comment:11 Changed 8 years ago by simonmar

commit 56254d7bee159eaa76701a0088fd598f84faba11

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Thu Oct 13 11:10:42 2011 +0100

    tweak unpack/unpack_nl to generate better Core (#5536)

The results are much better now: indistinguishable from the 7.0.3 performance.

I want to add a perf regression test for this before I close it.

comment:12 Changed 8 years ago by igloo

Owner: changed from batterseapower to igloo

comment:13 Changed 8 years ago by igloo

difficulty: Unknown
Resolution: fixed
Status: newclosed
Test Case: T5536

Perf test added.

Note: See TracTickets for help on using tickets.