Opened 2 years ago

Last modified 2 years ago

#14222 new bug

Simple text fusion example results in rather duplicative code

Reported by: bgamari Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.2.1
Keywords: CSE Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D3990
Wiki Page:

Description

Consider this program,

module T14221 where

import Data.Text as T

isNumeric :: Text -> Bool
isNumeric t =
    T.all isNumeric' t && T.any isNumber t
  where
    isNumber c = '0' <= c && c <= '9'
    isNumeric' c = isNumber c
                || c == 'e'
                || c == 'E'
                || c == '.'
                || c == '-'
                || c == '+'

Compiling with -O results in over 1 kilobyte of assembler. After looking at the simplified Core the issue becomes apparent: the case analysis of isNumeric' is duplicated six times,

case ww4_X2zB of {
  __DEFAULT -> GHC.Types.False;
  '+'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  '-'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  '.'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  'E'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  'e'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#)
};

It seems to me that we would ideally try to share this bit. Of course, this may be quite tricky to do in practice.

Attachments (1)

T14221.dump-simpl (11.2 KB) - added by bgamari 2 years ago.
Simplified Core from example given in ticket description

Download all attachments as: .zip

Change History (10)

comment:1 Changed 2 years ago by bgamari

By contrast, a relatively naive translation to C compiled with gcc -O0 requires less than half the code,

#include <stdint.h>
#include <stdbool.h>

bool isNumber(uint16_t c) {
  return '0' <= c && c <= '9';
}

bool isNumeric(uint16_t c) {
  if (isNumber(c)) return true;
  switch (c) {
  case 'e':
  case 'E':
  case '.':
  case '-':
  case '+':
    return true;
  default:
    return false;
  }
}

bool test(uint16_t *buf, int len) {
  for (int i=0; i<len; i++) {
    if (!isNumeric(buf[i])) return false;
  }
  for (int i=0; i<len; i++) {
    if (isNumber(buf[i])) return true;
  }
  return false;
}

comment:2 Changed 2 years ago by nomeata

Shouldn’t the common block elimination on the CMM level take care of that kind of duplication?

comment:3 in reply to:  2 Changed 2 years ago by bgamari

Replying to nomeata:

Shouldn’t the common block elimination on the CMM level take care of that kind of duplication?

Indeed I would have thought so but that is not what I observe.

Initially I thought the problem was that the blocks differed in their source note ticks (which arise through unfoldings since I'm using the DWARF-enabled 8.2.1 binary distribution). I suspect that CBE (and indeed perhaps CSE in Core and STG) should learn to deal properly with such ticks.

However, after switching back to a non-DWARF-enabled 8.0.2 build I found that I could still easily spot duplicate blocks. For instance, with the program above I see six blocks of the form,

block_c6K1_info:
_c6K1:
        movq 7(%rbx),%r14
        movq 8(%rbp),%rbx
        addq $16,%rbp
        jmp $wloop_all_s6CQ_info

If these were common'd up then that would also allow for a number of additional blocks which are currently distinct to be themselves common'd up (assuming our CBE pass runs to a fixed point). All-in-all it looks like something is broken in CBE. I've opened #14226 to track this. After this is fixed I should also make sure that CBE isn't defeated by the presence of source notes.

comment:4 Changed 2 years ago by simonpj

I agree that CBE should nail it, so let's do #14226. But still it'd be even better to do it in Core.

As you'll see in Note [Combine identical alternatives] in CoreUtils we do some combining of identical alternatives, but only in the very convenient case where one of them is the DEFAULT, which is not the case here. I suppose we'd like

case ww4_X2zB of
  __DEFAULT -> GHC.Types.False;
  '+'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  '-'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  '.'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  'E'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#);
  'e'# -> jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#)

===>

join j = jump $wloop_all_s2z2 (GHC.Prim.+# ww3_s2z0 2#)
in case ww4_X2zB of
  __DEFAULT -> GHC.Types.False;
  '+'# -> jump j
  '-'# -> jump j
  '.'# -> jump j
  'E'# -> jump j
  'e'# -> jump j

You could imagine trying that.

It's a variant of something I've wanted to try for some time to get better CSE. Suppose we have

f (x+y) (g (x+y))

Currently we don't CSE the two (x+y) expressions. Suppose we did this:

  1. Convert to ANF, by let-binding every sub-expression
    f (let a1 = x+y in a1)
      (let a2 = let a3 = x+y in g a3 in a2)
    
  1. Float out aggressively
    let a1 = x+y in
    let a3 = x+y in
    let a2 = g a3 in
    f a1 a2
    
  1. Do CSE exactly as now
    let a1 = x+y in
    let a3 = a1 in
    let a2 = g a3 in
    f a1 a2
    
  1. Now inline as usual
    let a1 = x+1 in
    f a1 (g a1)
    
    which is what we want.

I like this 4-step plan because only step 1 is new: the other passes exist already. It's a bit brutal but it would do the job.

And (provided we did this for join-points too) it would nail the example nicely.

Anyone want to have a go?

comment:5 Changed 2 years ago by simonpj

Keywords: CSE added

comment:6 Changed 2 years ago by bgamari

I should have been a bit more clear; the duplication that I was pointing out wasn't in the case alternatives (although that duplication is perhaps also worth thinking about). Rather, I was pointing out that the entire case expression is duplicated six times at various points in the program. I'll attach the simplified Core in a moment.

Changed 2 years ago by bgamari

Attachment: T14221.dump-simpl added

Simplified Core from example given in ticket description

comment:7 Changed 2 years ago by bgamari

Last night I looked further into why we end up duplicating the expession in question: There are two ways by which we end up duplicating the case analysis.

First Duplication

First, relatively early in simplification we end up rewriting isNumeric' to (after inlining (||)),

isNumeric'_s2z6 :: Char -> Bool
isNumeric'_s2z6
  = \ (eta_B1 :: Char) ->
      case eta_B1 of { GHC.Types.C# c2_a2sP ->
      case GHC.Prim.leChar# '0'# c2_a2sP of {
        __DEFAULT ->
          case c2_a2sP of {
            __DEFAULT -> GHC.Types.False;
            '+'# -> GHC.Types.True;
            '-'# -> GHC.Types.True;
            '.'# -> GHC.Types.True;
            'E'# -> GHC.Types.True;
            'e'# -> GHC.Types.True
          };
        1# ->
          case GHC.Prim.leChar# c2_a2sP '9'# of {
            __DEFAULT ->
              case c2_a2sP of {
                __DEFAULT -> GHC.Types.False;
                '+'# -> GHC.Types.True;
                '-'# -> GHC.Types.True;
                '.'# -> GHC.Types.True;
                'E'# -> GHC.Types.True;
                'e'# -> GHC.Types.True
              };
            1# -> GHC.Types.True
          }
      }
      }

Note the two instances of case c2_a2sP of {...}; this comes about via unconditional post-inlining since the scrutinee is trivial (being a Var).

This makes me wonder whether there is any benefit for postInlineUnconditional to inline join points.

Second Duplication

Then, later in simplification we duplicate isNumeric' three times (resulting in six instances of the case analysis). This happens when we start with this Core after worker/wrapper,

                join {
                  $w$j_s2Cy :: GHC.Prim.Char# -> Int -> Bool
                  $w$j_s2Cy (ww_s2Cw :: GHC.Prim.Char#) (w_s2Ct :: Int)
                    = let {
                        w_s2Cs :: Char
                        w_s2Cs = GHC.Types.C# ww_s2Cw } in
                      let {
                        x_a2vL :: Char
                        x_a2vL = w_s2Cs } in
                      let {
                        s'_a2vM :: Int
                        s'_a2vM = w_s2Ct } in
                      case x_a2vL of { GHC.Types.C# c2_a2t0 ->
                      join {
                        $j_s2yJ :: Bool
                        $j_s2yJ
                          = case c2_a2t0 of {
                              __DEFAULT -> GHC.Types.False;
                              '+'# -> jump loop_all_a2vd s'_a2vM;
                              '-'# -> jump loop_all_a2vd s'_a2vM;
                              '.'# -> jump loop_all_a2vd s'_a2vM;
                              'E'# -> jump loop_all_a2vd s'_a2vM;
                              'e'# -> jump loop_all_a2vd s'_a2vM
                            } } in
                      case GHC.Prim.leChar# '0'# c2_a2t0 of {
                        __DEFAULT -> jump $j_s2yJ;
                        1# ->
                          case GHC.Prim.leChar# c2_a2t0 '9'# of {
                            __DEFAULT -> jump $j_s2yJ;
                            1# -> jump loop_all_a2vd s'_a2vM
                          }
                      }
                      } } in
                join {
                  -- the wrapper for the above worker
                  $j_s2At :: Char -> Int -> Bool
                  $j_s2At (w_s2Cs :: Char) (w_s2Ct :: Int)
                    = case w_s2Cs of ww_s2Cv { GHC.Types.C# ww_s2Cw ->
                      jump $w$j_s2Cy ww_s2Cw w_s2Ct
                      } } in
                {- ... a large expression with three call-sites of $j_s2At -}

In the post-worker-wrapper simplification step we then immediately inline the wrappers into all of the call-sites,

Considering inlining: $j_s2At
  arg infos [ValueArg, ValueArg]
  interesting continuation BoringCtxt
  some_benefit True
  is exp: True
  is work-free: True
  guidance ALWAYS_IF(arity=2,unsat_ok=True,boring_ok=False)
  ANSWER = YES
Inlining done: $j_s2At
    Inlined fn:  \ (w_s2Cs :: GHC.Types.Char)
                   (w_s2Ct :: GHC.Types.Int) ->
                   case w_s2Cs of { GHC.Types.C# ww_s2Cw -> jump $w$j ww_s2Cw w_s2Ct }
    Cont:   ApplyToVal nodup (GHC.Types.C#
                                (GHC.Prim.chr# (GHC.Prim.word2Int# r#)))
            ApplyToVal nodup (GHC.Types.I# (GHC.Prim.+# ipv 1#))
            Stop[BoringCtxt] GHC.Types.Bool

and in each case inline the workers soon thereafter,

Considering inlining: $w$j_s2Cy
  arg infos [NonTrivArg, ValueArg]
  interesting continuation BoringCtxt
  some_benefit True
  is exp: True
  is work-free: True
  guidance IF_ARGS [126 120] 190 0
  discounted size = -35
  ANSWER = YES
Inlining done: $w$j
    Inlined fn:  \ (ww_s2Cw :: GHC.Prim.Char#)
                   (w_s2Ct :: GHC.Types.Int) ->
                   join {
                     $j_s2yJ :: GHC.Types.Bool
                     $j_s2yJ
                       = case ww_s2Cw of {
                           __DEFAULT -> GHC.Types.False;
                           '+'# ->
                             case w_s2Ct of { GHC.Types.I# ww_X2Dh -> jump $wloop_all ww_X2Dh };
                           '-'# ->
                             case w_s2Ct of { GHC.Types.I# ww_X2Dh -> jump $wloop_all ww_X2Dh };
                           '.'# ->
                             case w_s2Ct of { GHC.Types.I# ww_X2Dh -> jump $wloop_all ww_X2Dh };
                           'E'# ->
                             case w_s2Ct of { GHC.Types.I# ww_X2Dh -> jump $wloop_all ww_X2Dh };
                           'e'# ->
                             case w_s2Ct of { GHC.Types.I# ww_X2Dh -> jump $wloop_all ww_X2Dh }
                         } } in
                   case GHC.Prim.leChar# '0'# ww_s2Cw of {
                     __DEFAULT -> jump $j_s2yJ;
                     1# ->
                       case GHC.Prim.leChar# ww_s2Cw '9'# of {
                         __DEFAULT -> jump $j_s2yJ;
                         1# ->
                           case w_s2Ct of { GHC.Types.I# ww_X2Dk -> jump $wloop_all ww_X2Dk }
                       }
                   }
    Cont:   ApplyToVal nodup ww_s2Cw
            ApplyToVal nodup w_s2Ct
            Stop[BoringCtxt] GHC.Types.Bool

We consequently end up with two identical copies and one nearly identical copy of the same rather large block of code. This is a tricky situation due to the one nearly identical copy; perhaps it's worth the duplication, but perhaps not. There may be relatively little that we can do about this.

Last edited 2 years ago by bgamari (previous) (diff)

comment:8 Changed 2 years ago by bgamari

simonpj, regarding comment:4:

It seems to me like there is a real threat of introducing inappropriate sharing in such a scheme, given that full laziness already tends to produce quite some headaches. Do you think CSE is currently cautious enough to avoid these sorts of blowups? Perhaps we would want to be slightly less aggressive in floating out bindings which the compiler introduced during ANF-ization?

Of course, this is all hypothetical; I suppose the best way to find out is to try.

comment:9 Changed 2 years ago by simonpj

Differential Rev(s): Phab:D3990

Phab:D3990 is start on comment:4

Note: See TracTickets for help on using tickets.