Opened 5 years ago

Last modified 5 years ago

#10120 new bug

Unnecessary code duplication from case analysis

Reported by: bgamari Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.1-rc2
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

Consider a case analysis like this,

predicate c =
       c == '-' || c == '_' || c == '.' || c == '~' || c == ':'
    || c == '@' || c == '&' || c == '=' || c == '+' || c == '$' || c == ','

f x | predicate x = do_something
f x = do_something_else
main = f 'a'

GHC in some cases appears to produce Core for this example by constructing a case analysis on x, replicating do_something in every branch (generated by GHC 7.10),

$wa_r3UQ =
  \ ww_s3TC w_s3Tz ->
    case ww_s3TC of _ {
      __DEFAULT -> (# w_s3Tz, () #);
      '$' -> hPutStr2 stdout lvl2_r3Uv True w_s3Tz;
      '&' -> hPutStr2 stdout lvl4_r3Ux True w_s3Tz;
      ...
      '_' -> hPutStr2 stdout lvl20_r3UN True w_s3Tz;
      '~' -> hPutStr2 stdout lvl22_r3UP True w_s3Tz
    }

In some cases where do_something is large this can lead to a substantial increase in code size.

There are really two issues here,

  1. That a branch-y case analysis is being used for a boolean condition
  2. That do_something is being inlined into each branch

This seems to be sub-optimal for instruction cache efficiency, the number of branches in generated machine code, and understandability of the resulting Core.

I would have expected something like,

f x =
    let p = pred x
    in case p of
          True  -> do_something
          False -> do_something_else

Change History (13)

comment:1 Changed 5 years ago by bgamari

Description: modified (diff)

comment:2 Changed 5 years ago by bgamari

Description: modified (diff)

comment:3 Changed 5 years ago by bgamari

Description: modified (diff)

comment:4 Changed 5 years ago by bgamari

Interestingly enough GHC 7.8 is a bit more conservative in how it inlines: while it still produces the unneccessarily branch-y case expression, it at least doesn't inline do_something into each branch,

$wa_r3Fx =
  \ ww_s3EZ w_s3EW ->
    let {
      $j_s1xX
      $j_s1xX =
        \ _ ->
          hPutStr2
            stdout
            (case ww_s3EZ of ds1_a1xG {
               __DEFAULT -> : shows21 ($wshowLitChar ds1_a1xG lvl_r3Fw);
               '\'' -> shows20
             })
            True
            w_s3EW } in
    case ww_s3EZ of _ {
      __DEFAULT -> (# w_s3EW, () #);
      '$' -> $j_s1xX void#;
      '&' -> $j_s1xX void#;
      ...
      '_' -> $j_s1xX void#;
      '~' -> $j_s1xX void#

comment:5 Changed 5 years ago by bgamari

Description: modified (diff)

comment:6 Changed 5 years ago by bgamari

Description: modified (diff)

comment:7 Changed 5 years ago by nomeata

The GHC 7.8 code looks good to me – I would expect later stages in the pipeline to make good code from such a case on an enumeration type. Did you have a look at the actual generated code for that?

comment:8 Changed 5 years ago by nomeata

Oh, and did you check whether in 7.10, GHC would duplicate code of arbitrary size into the case alternatives? That would of course be quite wrong.

Can you attach a nice small example to play around with?

comment:9 Changed 5 years ago by bgamari

nomeata,

This is my toy example,

predicate c =
       c == '-' || c == '_' || c == '.' || c == '~' || c == ':'
    || c == '@' || c == '&' || c == '=' || c == '+' || c == '$'
    || c == ','

f x | predicate x = print x
f x = return ()
{-# NOINLINE f #-}

main = f 'a'

This was derived from https://github.com/aristidb/http-types/blob/4c9fc3d136b2bc18fe126adc3902dd7e7685ae62/Network/HTTP/Types/URI.hs (in particular urlEncodeBuilder's check of unreservedPI'. Here GHC doesn't inline the bodies but it does create a unique body for each branch of the case,

case ipv1_ijS6 of wild3_X1h {
    __DEFAULT -> ...
    __word 36 -> lvl_rkUl `cast` ...;
    __word 38 -> lvl1_rkUm `cast` ...;
    __word 43 -> lvl2_rkUn `cast` ...;
    __word 44 -> lvl3_rkUo `cast` ...;
    __word 45 -> lvl4_rkUp `cast` ...;
    __word 46 -> lvl5_rkUq `cast` ...;
    __word 58 -> lvl6_rkUr `cast` ...;
    __word 61 -> lvl7_rkUs `cast` ...;
    __word 64 -> lvl8_rkUt `cast` ...;
    __word 95 -> lvl9_rkUu `cast` ...;
    __word 126 -> lvl10_rkUv `cast` ...
}

Where each of the branches looks like this,

lvl2_rkUn
  :: forall r_ij1U.
     Data.ByteString.Builder.Internal.BuildStep r_ij1U
     -> Data.ByteString.Builder.Internal.BufferRange
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld,
           Data.ByteString.Builder.Internal.BuildSignal r_ij1U #)
lvl2_rkUn =
  \ (@ r_ij1U)
    (eta_ij1V :: Data.ByteString.Builder.Internal.BuildStep r_ij1U)
    (eta1_ij1W :: Data.ByteString.Builder.Internal.BufferRange)
    (eta2_ij1X :: GHC.Prim.State# GHC.Prim.RealWorld) ->
    case eta1_ij1W
    of _
    { Data.ByteString.Builder.Internal.BufferRange dt_ij27 dt1_ij28 ->
    case GHC.Prim.tagToEnum#
           (GHC.Prim.<# (GHC.Prim.minusAddr# dt1_ij28 dt_ij27) 1)
    of _ {
      False ->
        case GHC.Prim.writeWord8OffAddr# dt_ij27 0 (__word 43) eta2_ij1X
        of s2_ij40 { __DEFAULT ->
        ((eta_ij1V
            (Data.ByteString.Builder.Internal.BufferRange
               (GHC.Prim.plusAddr# dt_ij27 1) dt1_ij28))
         `cast` ...)
          s2_ij40
        };
      True ->
        (# eta2_ij1X,
           Data.ByteString.Builder.Internal.BufferFull
             1
             dt_ij27
             ((\ (ds_ij2P :: Data.ByteString.Builder.Internal.BufferRange)
                 (eta3_ij2Q :: GHC.Prim.State# GHC.Prim.RealWorld) ->
                 case ds_ij2P
                 of _
                 { Data.ByteString.Builder.Internal.BufferRange dt3_ij2T dt4_ij2U ->
                 case GHC.Prim.writeWord8OffAddr# dt3_ij2T 0 (__word 43) eta3_ij2Q
                 of s2_ij40 { __DEFAULT ->
                 ((eta_ij1V
                     (Data.ByteString.Builder.Internal.BufferRange
                        (GHC.Prim.plusAddr# dt3_ij2T 1) dt4_ij2U))
                  `cast` ...)
                   s2_ij40
                 }
                 })
              `cast` ...) #)
    }
    }

differing only in the (__word 43) terms.

These functions do in fact end up in the final C-- more or less unperturbed.

I think the real issue here is that GHC is expanding the number of alternatives that arise from this case expression (issue (1) in the description). I may be wrong but it doesn't seem like this sort of analysis would often be a win.

Last edited 5 years ago by bgamari (previous) (diff)

comment:10 Changed 5 years ago by bgamari

I'm unsure of whether it's reasonable to expect the compiler to know that it's not worth specializing the case analysis on the character c. For instance, do_something may also branch on the character; if the compiler specialized then it could eliminate the branch inside of do_something.

However, it would be nice if the compiler gave us the tools to force the sort of behavior we want. For instance, consider an identity-like function evalThis :: a -> a. This would prevent the compiler from "breaking up" the term in the course of optimization. For instance,

f c = let pred = evalThis $ c == 'a' || c == 'b' || c == 'c' :: Bool
      in case pred of
             True -> ...
             False -> ...

would prevent the compiler from using the "insides" of pred in the case analysis, ensuring that somewhere in the resulting executable there would be boolean value pred which would then be matched against. This is obviously a pretty big hammer, but it might be nice to have every once in a while.

Perhaps such a facility already exists?

comment:11 Changed 5 years ago by bgamari

Well, of course there's always a {-# NOINLINE pred #-} pragma but this doesn't quite do what you would like. Take for instance this,

predicate c =
       c == '-' || c == '_' || c == '.' || c == '~' || c == ':'
    || c == '@' || c == '&' || c == '=' || c == '+' || c == '$'
    || c == ','

f x | pred = print x
  where
    pred = predicate x
    {-# NOINLINE pred #-}
f x = return ()
{-# NOINLINE f #-}

main = f 'a'

which gets turned into this Core,

$wa_r3Fs
$wa_r3Fs =
  \ ww_s3EA w_s3Ex ->
    let {
      pred_s1bu
      pred_s1bu =
        case ww_s3EA of _ {
          __DEFAULT -> False;
          '$' -> True;
          ...
          '~' -> True
        } } in
    case pred_s1bu of _ {
      False -> (# w_s3Ex, () #);
      True ->
        hPutStr2 stdout (case ww_s3EA of ds1_a1yc {
                           __DEFAULT -> : shows21 ($wshowLitChar ds1_a1yc lvl_r3BR);
                          '\''       -> shows20
                         })
                 True
                 w_s3Ex
    }

Which looks like it has the potential to produce reasonably efficient code. Unfortunately the C-- is a soup of branches,

$wa_r3Fs_entry() //  [R2]
         { info_tbl: [(c3Jw,
                       label: $wa_r3Fs_info
                       rep:HeapRep static { Fun {arity: 2 fun_type: ArgSpec 4} })]
           stack_info: arg_space: 8 updfr_space: Just 8
         }
     {offset
       c3Jw:
           _s3Hu::I64 = R2;
           goto c3Ie;
       c3Ie:
           goto c3If;
       c3If:
           if ((old + 0) - <highSp> < SpLim) goto c3JK; else goto c3JL;
       c3JK:
           R2 = _s3Hu::I64;
           R1 = $wa_r3Fs_closure;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       c3JL:
           _s3Hx::I64 = _s3Hu::I64;
           if (_s3Hx::I64 < 46) goto c3IX; else goto c3IY;
       c3IX:
           if (_s3Hx::I64 < 43) goto c3IF; else goto c3IG;
       c3IF:
           if (_s3Hx::I64 < 38) goto c3Iw; else goto c3Ix;
       c3Iw:
           if (_s3Hx::I64 != 36) goto c3Ij; else goto c3Ik;
       c3Ik:
           _s3Hw::P64 = True_closure+2;
           goto c3Ii;
       c3Ix:
           if (_s3Hx::I64 != 38) goto c3Ij; else goto c3Il;
       c3Il:
           _s3Hw::P64 = True_closure+2;
           goto c3Ii;
       c3IG:
           if (_s3Hx::I64 < 44) goto c3IC; else goto c3ID;
       ...

Which generates a string of cmpq/jne/jmps,

_c3Jw:
        leaq -16(%rbp),%rax
        cmpq %r15,%rax
        jb _c3JK
_c3JL:
        cmpq $46,%r14
        jb _c3IX
_c3IY:
        cmpq $64,%r14
        jb _c3IU
_c3IV:
        cmpq $95,%r14
        jb _c3IR
_c3IS:
        cmpq $126,%r14
        jb _c3IO
_c3IP:
        movq %r14,%rax
        cmpq $126,%r14
        jne _c3Ij
_c3Ik:
        movl $True_closure+2,%ebx
_c3Ii:
        movq $block_c3J4_info,-16(%rbp)
        movq %rax,-8(%rbp)
        addq $-16,%rbp
        testq $7,%rbx
        jne _c3J4
_c3J5:
        jmp *(%rbx)
_c3IR:
        movq %r14,%rax
        cmpq $64,%r14
        jne _c3Ij
        jmp _c3Ik
...
Last edited 5 years ago by bgamari (previous) (diff)

comment:12 Changed 5 years ago by bgamari

It seems like we really should be able to do better generating code for the case analyses above. I've opened #10124 to track this.

comment:13 Changed 5 years ago by simonpj

I don't yet understand #10124, but I've left a comment there. Meanwhile, what remains of this ticket? What do you expect/want GHC to do, under what circumstances?

Simon

Note: See TracTickets for help on using tickets.