Opened 5 years ago

Last modified 20 months ago

#10012 new bug

Cheap-to-compute values aren't pushed into case branches inducing unnecessary register pressure

Reported by: bgamari Owned by: bgamari
Priority: normal Milestone:
Component: Compiler Version: 7.8.4
Keywords: CodeGen Cc: michalt, bollu
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8048 Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

When characterizing bytestring's Builder interface I noticed that some benchmarks involving repeated appends (e.g. Host endian/1MB of Word8 in chunks of 16) perform inexplicably much worse than others. A glance at the assembly revealed that a large number of Word8s are being evaluated and saved to registers, only to be later stored. E.g.,

# First we evaluate a number of Word8s, saving them in registers
movzbl %bl,%ebx
leaq 1(%r14),%rcx
movzbl %cl,%ecx
leaq 2(%r14),%rdx
movzbl %dl,%edx
leaq 3(%r14),%rsi
movzbl %sil,%esi
leaq 4(%r14),%r9
movzbl %r9b,%r9d
leaq 5(%r14),%r10
movzbl %r10b,%r10d
...
# Eventually we run out of registers and start spilling to the stack
movq %rax,64(%rsp)
leaq 7(%r14),%rax
movzbl %al,%eax
movq %rbx,72(%rsp)
...

# Only after evaluating all of the needed words do we actually consume
# them
movq %rax,-152(%r12)
movq 72(%rsp),%rax
movq %rax,-144(%r12)
movq 80(%rsp),%rax
movq %rax,-136(%r12)
movq 88(%rsp),%rax
movq %rax,-128(%r12)
...
movq %rsi,-56(%r12)
movq %r9,-48(%r12)
movq %r10,-40(%r12)
movq %r11,-32(%r12)
movq %r14,-24(%r12)

This is due to the fact that the Words are bound outside of a case analysis and GHC is reluctant to push them inside of the branches. The float-in pass will only float a binding into a case if the value is "small" and at least one branch doesn't use the binding. Unfortunately the case expression in question has only two branches.

This is demonstrated in the attached testcase.

Also of interest, https://www.haskell.org/pipermail/ghc-devs/2015-January/007997.html

Attachments (1)

TestCase.hs (3.8 KB) - added by bgamari 5 years ago.
Test case

Download all attachments as: .zip

Change History (21)

Changed 5 years ago by bgamari

Attachment: TestCase.hs added

Test case

comment:1 Changed 5 years ago by bgamari

Description: modified (diff)

comment:2 Changed 5 years ago by bgamari

Description: modified (diff)

comment:3 Changed 5 years ago by bgamari

Description: modified (diff)

comment:4 Changed 5 years ago by simonpj

Here's some more dialogue between Ben and me:

I think you are talking about code like this:

let t1=e1; t2=e2; ...; tn=en in
if ... then
  do { write r1a t1; ...; write rna tn }
else
  do { write r1b t1; ...; write rnb tn }

What you would prefer is to *duplicate* the ti=ei bindings, and sink them really close to their consumers. (NB: That can sometimes *increase* register pressure if ei have lots of free variables.)

Correct.

Question: the ti=ei bindings are used in both branches of the conditional. Did they begin duplicated, and became shared by CSE? Or did they start shared? If the latter, it becomes delicate indeed: duplicating work to (hopefully) improve (but possible worsen) register pressure seems dodgy.

Unfortunately they started shared. This code is sadly a bit difficult to optimize. The case analysis in question examines whether there is enough room in a buffer to accomoate the requested data to be written; the bindings being shared are the values to be written to the buffer. If there is room in the buffer then we perform the write immediately; otherwise we return a closure so that the user can continue writing into a new buffer.

This situation affects performance in a rather unfortunate way: in principle we should be able to perform better the more data we statically know needs to be written. Unfortunately, as it stands now, the more data we write, the more performance suffers due to register pressure

Last edited 5 years ago by simonpj (previous) (diff)

comment:5 Changed 4 years ago by bgamari

Owner: set to bgamari

comment:6 Changed 3 years ago by bgamari

I was thinking about this a bit more recently; to recap, the issue here is that GHC is unwilling to push bindings into case analyses when this would imply duplication. This leaves GHC no other option but to allocate the shared binding on the heap.

To see an example of this let's look at bytestring's Builder. The key branch here is Data.ByteString.Builder.Internal.ensureFree,

-- | The result of a build action
data BuildSignal a =
    -- | We wrote all of the content we were asked to
    Done !(Ptr Word8) a
    -- | The buffer we were provided is full, but here's a continuation
    -- BuildStep to pick up where we left off
  | BufferFull !Int !(Ptr Word8) (BuildStep a)

newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)
instance Monoid Builder where ...
type BuildStep a = BufferRange -> IO (BuildSignal a)

ensureFree :: Int -> Builder
ensureFree minFree =
    builder step
  where
    step :: BuildStep r -> BuildStep r
    step k br@(BufferRange op ope) =
      | ope `minusPtr` op < minFree  -> return $ bufferFull minFree op k
      | otherwise                    -> k br

The idea here is that we are filling a pre-allocated buffer (described by the BufferRange) with bytes; ensureFree verifies that the buffer has at least minFree free bytes remaining.

The trouble comes when we write something like,

twoWord64s :: Word64 -> Word64 -> Builder
twoWord64s a b = word64 (f a) <> word64 (f b)
  where
    -- just some cheap to evaluate function that we really don't want 
    -- to build a thunk for
    f = (+1)

which in STG turns into something like,

twoWord64s' :: Word64 -> Word64
            -> forall r. BuildStep r -> BufferRange -> IO (BuildSignal r)
twoWord64s' a b cont br =
    let fa, fb :: Word64
        fa = a + 1
        fb = b + 1
    in case br of
         BufferRange op ope ->
           case ope `minusPtr` op < minFree  of
             True  -> return $ bufferFull 16 {- bytes -} op $ do
                          -- paraphrasing here
                          writeWord (advanceBufferRange br 0) fa
                          writeWord (advanceBufferRange br 8) fb
                          cont (advanceBufferRange br 16)
             False -> do  -- paraphrasing here
                          writeWord (advanceBufferRange br 0) fa
                          writeWord (advanceBufferRange br 8) fb
                          cont (advanceBufferRange br 16)

It seems that one interesting (albeit slightly inelegant) way to addressing this issue is to provide a means for the library author to indicate that a given case should be considered "cheap" to push through. That is, you might define ensureFree as,

ensureFree :: Int -> Builder
ensureFree minFree =
    builder step
  where
    step k br@(BufferRange op ope) =
      case ope `minusPtr` op < minFree of
        True    -> return $ bufferFull minFree op k
        False   -> k br
      {-# INLINE_THROUGH #-}
      -- Telling GHC "it's okay if we lose sharing across 
      -- the branches of this case, I would far prefer code duplication
      -- to allocation"

I don't believe there are too many places where this sort of pragma would be useful, but when it is needed I suspect it would be very useful indeed.

Last edited 3 years ago by bgamari (previous) (diff)

comment:7 Changed 3 years ago by nomeata

In comment:4 simon (or ben?) says “duplicating work”, but it seems that the question at hand is not about loss of sharing, but rather code size, right? Floating a let binding into the branches of a case does not duplicate any work, only code.

A {-# INLINE_THROUGH #-} pragma seems to be a bit unspecific. Do we really want the compiler to inline through everything, no matter how large? Or should this just shift the heuristics a bit? But what is so special about this case that makes this a worthwhile thing here, and not in other cases (sic)? Maybe the heuristics can be improved in general?

It seems to be more a general question of whether the user wants more speed or smaller code.

comment:8 Changed 3 years ago by simonpj

There's something wrong with the example: far from being used in both branches, fa and fb are dead bindings.

comment:9 in reply to:  8 Changed 3 years ago by bgamari

Replying to simonpj:

There's something wrong with the example: far from being used in both branches, fa and fb are dead bindings.

Oh dear, yes, sorry about that. I was a bit hurried in submitting. This should be fixed now.

comment:10 in reply to:  7 Changed 3 years ago by bgamari

Replying to nomeata:

In comment:4 simon (or ben?) says “duplicating work”, but it seems that the question at hand is not about loss of sharing, but rather code size, right? Floating a let binding into the branches of a case does not duplicate any work, only code.

Correct, the primary concern is code size.

A {-# INLINE_THROUGH #-} pragma seems to be a bit unspecific. Do we really want the compiler to inline through everything, no matter how large? Or should this just shift the heuristics a bit?

Indeed it is intentionally unspecific because I haven't yet found enough other potential use-cases for such a pragma to know what semantics would make the most sense. My first inclination was to essentially render the case invisible to the simplifier and float freely into the branches but simply shifting the float-in heuristics would also be a reasonable option (albeit perhaps a bit harder to reason about).

But what is so special about this case that makes this a worthwhile thing here, and not in other cases (sic)? Maybe the heuristics can be improved in general?

Unfortunately I haven't been able to pin this down very clearly beyond it occurring in an "inner loop".

It seems to be more a general question of whether the user wants more speed or smaller code.

Right, but not all case analyses are created equal: examples like this one which occur in an inner loop deserve more inlining than others, yet I'm not sure GHC is in a position to be able to discern this. This is why I proposed simply providing a tool for nudging the compiler in the right direction.

Clearly following this approach of adding source-level knobs to guide arbitrary simplifier heuristics to its logical conclusion is a path to madness; unfortunately I don't see a more general way to robustly address the issue at hand.

comment:11 Changed 3 years ago by bgamari

But what is so special about this case that makes this a worthwhile thing here, and not in other cases (sic)? Maybe the heuristics can be improved in general?

I'd also like to point out that we already decide to inline the BuildStep k itself into ensureRoom/step (although I admittedly don't yet understand why). It seems odd that we would inline this yet not its free variables.

comment:12 Changed 3 years ago by bgamari

To supplement the attached freestanding example, here is a smaller example depending upon bytestring,

module Hi where

import qualified Data.ByteString.Builder as B
import Data.Monoid
import Data.Word

hello :: Word64 -> Word64 -> B.Builder
hello a b = B.word64LE (f a) <> B.word64LE (f b)
  where f = (+1)

comment:13 Changed 3 years ago by bgamari

Here's a rather bare-bones example,

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE RankNTypes #-}
module TestCase3 where

import Data.Char (chr)

data Buffer = Buffer
data BufferRange = BufferRange Buffer Int
data BuildSignal a = BufferFull (BuildStep a)
                   | Done a
type BuildStep a = BufferRange -> IO (BuildSignal a)
newtype Builder = Builder (forall r. BuildStep r -> BuildStep r)

write2Words :: Int -> Int -> Builder
write2Words a b = Builder $ \cont rng@(BufferRange Buffer rem) ->
    let !fa = f a
        !fb = f b
        -- doWrite :: BufferRange -> IO (BuildSignal r)
        doWrite rng' = writeInt fa >> writeInt fb >> cont rng'
    in if rem < 16
         then return $ BufferFull doWrite
         else doWrite rng
  where
    f x = x + 42

writeInt :: Int -> IO ()
writeInt n = putStr [chr n]

The STG that this produces exhibits the issue that this ticket describes: both fa and fb are evaluated outside of the case generated by the if, meaning that we must keep both of these values in registers (or, worse, spill to the stack) at once,

-- We first unbox the arguments
case b_s3ep of _ {
  I# x_s3ex ->
    case a_s3eo of _ {
        I# x1_s3ez ->
            -- then we perform the additions
            case +# [x_s3ex 42#] of ipv_s3eA {
              __DEFAULT ->
                  case +# [x1_s3ez 42#] of ipv1_s3eB {
                    __DEFAULT ->
                        -- then we do our bounds check
                        case <# [dt1_s3ev 16#] of sat_s3eC {
                          __DEFAULT ->
                              case tagToEnum# [sat_s3eC] of _ {
                                -- now we finally use the results of the additions
                                False -> ...
                                True -> ...
  } } } } } }

While in this small case this doesn't pose much of a burden, in cases with larger argument counts the register pressure induced here is significant.

Note how if one replaces the if expression with simply doWrite rng the code produced is significantly better: the simplifier is able to float the additions in right next to their users, meaning that no more than one intermediate result has to be stored at a time.

What we want is for GHC to do this floating even in the presence of the case analysis. Simon and I discussed this on Thursday. He suggested that we may want something like float-in but later in compilation (e.g. during code generation when we know how many registers we have).

One way to accomplish this would be to make the Cmm sinking pass a bit more eager to sink past branches. However, you likely don't want to do this indiscriminately as this would result in unnecessary code bloat. Instead perhaps you want to make this conditional on some measure of register pressure.

Last edited 3 years ago by bgamari (previous) (diff)

comment:14 Changed 3 years ago by bgamari

Description: modified (diff)

comment:15 Changed 3 years ago by bgamari

Description: modified (diff)

comment:16 Changed 3 years ago by bgamari

It's worthwhile pointing out that #8048 is pointing out a rather similar and perhaps more general problem with GHC's treatment register-heavy code.

comment:17 Changed 3 years ago by bgamari

comment:18 Changed 2 years ago by simonpj

Keywords: CodeGen added

comment:19 Changed 2 years ago by michalt

Cc: michalt added

comment:20 Changed 20 months ago by bollu

Cc: bollu added

CCing myself because I have been running into related issues with closure-creation-on-the-stack that's not necessary here: https://github.com/bollu/exploration/tree/master/dec-22-asm-diff

Note: See TracTickets for help on using tickets.