Opened 2 years ago

Last modified 9 months ago

#14226 new bug

Common Block Elimination pass doesn't eliminate common blocks

Reported by: bgamari Owned by: michalt
Priority: high Milestone: 8.10.1
Component: Compiler (CodeGen) Version: 8.2.1
Keywords: newcomer, CodeGen Cc: michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #9157, #14754, #14989 Differential Rev(s): Phab:D3973, Phab:D3999
Wiki Page:

Description (last modified by bgamari)

In #14222 it was noted that something appears to be broken in CmmCommonBlockElim. Consider the program from that ticket,

module T14221 where

import Data.Text as T

isNumeric :: Text -> Bool
isNumeric t =
    T.all isNumeric' t && T.any isNumber t
  where
    isNumber c = '0' <= c && c <= '9'
    isNumeric' c = isNumber c
                || c == 'e'
                || c == 'E'
                || c == '.'
                || c == '-'
                || c == '+'

This program produces six copies of a block of the form,

      c6JT:
          R2 = I64[R1 + 7];
          R1 = P64[Sp + 8];
          Sp = Sp + 16;
          call $wloop_all_s6CQ_info(R2, R1) args: 8, res: 0, upd: 8;

in the -ddump-opt-cmm output, which are manifest in the assembler as,

block_c6JT_info:
_c6JT:
        movq 7(%rbx),%r14
        movq 8(%rbp),%rbx
        addq $16,%rbp
        jmp $wloop_all_s6CQ_info

CBE really ought to be catching these.

Change History (37)

comment:1 Changed 2 years ago by bgamari

Description: modified (diff)

I had a quick look at this; it turns out that the reason that CBE doesn't work is 73f836f5d57a3106029b573c42f83d2039d21d89, which modifies the hash function to include local registers. This may sound familiar to you, nomeata, as you wrote it to address #10397.

Sadly this means that our ability to CBE is quite limited. It seems like we should likely revisit this decision.

comment:2 Changed 2 years ago by bgamari

Oh, actually, never mind. On further reflection I suppose CBE actually must look at local registers since they may be live across blocks. Indeed nomeata's patch was merely a change in the hash, which is an optimization, not a change in the behavior of the pass.

comment:3 Changed 2 years ago by simonpj

OK, so now we are back to: why doesn't CBE eliminate the common blocks?

comment:4 Changed 2 years ago by bgamari

Well, because we have an extremely narrow definition of "common". Namely, to be the same two blocks must be syntactically identical, including local register names. I don't know how often this happens in practice, but I'd imagine not terribly often.

I think the criterion that you would ideally want is equivalence up to alpha renaming of local registers live only in the block; unfortunately this would likely be considerably more costly to check for. This makes me wonder whether we shouldn't just try to avoid duplicating the blocks to begin with (e.g. addressing #14222 earlier in the compilation pipeline).

Last edited 2 years ago by bgamari (previous) (diff)

comment:5 Changed 2 years ago by bgamari

One option would be to run CBE after register allocation; this would make it significantly more likely that two structurally similar blocks would look similar to CBE.

comment:6 Changed 2 years ago by bgamari

To make my objection in comment:4 more concrete, consider that you have the following two blocks (from the example in #14226),

  c2VZ: // global
      _c2Yd::I64 = _s2Se::I64 + 1;
      _s2Sx::I64 = _c2Yd::I64;
      _s2Se::I64 = _s2Sx::I64;
      goto c2TE;
  c2VY: // global
      _c2Yb::I64 = _s2Se::I64 + 1;
      _s2Sw::I64 = _c2Yb::I64;
      _s2Se::I64 = _s2Sw::I64;
      goto c2TE;

They differ in only two local register names, both of which are dead outside of their respective blocks. Ignoring the fact that the sinking pass should eliminate both intermediate locals (e.g. _c2Yd and _s2Sx in the first block), it seems to me like CBE should really consider these blocks to be duplicates.

Last edited 2 years ago by bgamari (previous) (diff)

comment:7 Changed 2 years ago by simonpj

OK, got it. Yes, totally. CBE should be insensitive to alpha-renaming of local register.

One way to do this would be for the hashing function to be sensitive to binding sites. Concretely, suppose that we had

hashInstrs :: HashEnv -> [Instruction] -> HashCode
type HashEnv = (Map LocalReg HashCode, Int)

Now we could have

hashInstrs (env, next) (Load reg expr : instrs)
  = hashInstrs (extendEnv env reg next, next+1) instrs
    `hashPlus` hashExpr env expr

...

hashExpr env (LocalReg r)
  | Just hc <- lookupEnv env r = hc
  | otherwise = hashLocalReg r  -- as now

The mapping tells what hashcode to use for a local register. We deBruijn-number all binding sites as we come across them, mapping them to 1, 2, 3, etc. Now we'll be insensitive to what name is chosen.

I think we already have a function that, for an instruction, says what local regs it assigns to.

comment:8 Changed 2 years ago by bgamari

Differential Rev(s): Phab:D3973
Status: newpatch

Phab:D3973 implements the equivalence-up-to-alpha-renaming idea described above.

comment:9 Changed 2 years ago by bgamari

It turns out this was also previously reported as #9157.

comment:10 Changed 2 years ago by Ben Gamari <ben@…>

In 7920a7d/ghc:

cmm/CBE: Collapse blocks equivalent up to alpha renaming of local registers

As noted in #14226, the common block elimination pass currently
implements an extremely strict equivalence relation, demanding that two
blocks are equivalent including the names of their local registers. This
is quite restrictive and severely hampers the effectiveness of the pass.

Here we allow the CBE pass to collapse blocks which are equivalent up to
alpha renaming of locally-bound local registers. This is completely safe
and catches many more duplicate blocks.

Test Plan: Validate

Reviewers: austin, simonmar, michalt

Reviewed By: michalt

Subscribers: rwbarton, thomie

GHC Trac Issues: #14226

Differential Revision: https://phabricator.haskell.org/D3973

comment:11 Changed 2 years ago by simonpj

Ben I forgot to mention this. In CmmExpr you'll find foldLocalRegsDefd, which is, I think, precisely what you need to answer the question "which local regs does this node define". It would be much better to use it than to duplicate the logic into hash_node and eqMiddleWith.

Might you look at doing that?

comment:12 Changed 2 years ago by bgamari

Differential Rev(s): Phab:D3973Phab:D3973, Phab:D3999

Suggestion in comment:11 done in Phab:D3999.

comment:13 Changed 2 years ago by Ben Gamari <ben@…>

In 9aa73892/ghc:

cmm/CBE: Use foldLocalRegsDefd

Simonpj suggested this as a follow-on to #14226 to avoid code
duplication. This also gives us the ability to CBE cases involving
foreign calls for free.

Test Plan: Validate

Reviewers: austin, simonmar, simonpj

Reviewed By: simonpj

Subscribers: michalt, simonpj, rwbarton, thomie

GHC Trac Issues: #14226

Differential Revision: https://phabricator.haskell.org/D3999

comment:14 Changed 2 years ago by bgamari

Resolution: fixed
Status: patchclosed

This should now be resolved.

I have confirmed that the duplication reported in the original example is now eliminated. However, things can arguably still be improved. What remains is three largely-identical groups of branches. I'll quote two of them:

      c2RY: // global
          _s2Qv::I64 = %MO_UU_Conv_W16_W64(I16[(_s2PR::P64 + 16) + (_s2Qt::I64 << 1)]);
          if (_s2Qv::I64 < 55296) goto c2Sc; else goto c2Sd;
      c2Sc: // global
          _s2Qx::I64 = _s2Qv::I64;
          if (48 > _s2Qv::I64) goto c2U4; else goto c2U3;   // chr 48 == '0'
      c2U3: // global
          if (_s2Qx::I64 > 57) goto c2U4; else goto c2Tw;   // chr 57 == '9'
      c2U4: // global
          if (_s2Qx::I64 < 47) goto u2XU; else goto u2XX;   // chr 48 == '0'
      u2XU: // global
          if (_s2Qx::I64 >= 46) goto c2Tw; else goto u2XV;  // chr 46 == '.'
      u2XV: // global
          if (_s2Qx::I64 >= 45) goto c2Tw; else goto u2XW;  // chr 45 == '-'
      u2XW: // global
          if (_s2Qx::I64 == 43) goto c2Tw; else goto c2Uu;  // chr 43 == '+'
      u2XX: // global
          if (_s2Qx::I64 >= 102) goto c2Uu; else goto u2XY; // chr 102 == 'f'
      u2XY: // global
          if (_s2Qx::I64 >= 101) goto c2Tw; else goto u2XZ; // chr 101 == 'e'
      u2XZ: // global
          if (_s2Qx::I64 == 69) goto c2Tw; else goto c2Uu;  // chr 69 == 'E'
      c2Sd: // global
          if (_s2Qv::I64 > 56319) goto c2SN; else goto c2SO;

      // next group
      c2SN: // global
          _s2QP::I64 = _s2Qv::I64;
          if (48 > _s2Qv::I64) goto c2Tv; else goto c2Tu;
      c2Tu: // global
          if (_s2QP::I64 > 57) goto c2Tv; else goto c2Tw;
      c2Tv: // global
          if (_s2QP::I64 < 47) goto u2XI; else goto u2XL;
      u2XI: // global
          if (_s2QP::I64 >= 46) goto c2Tw; else goto u2XJ;
      u2XJ: // global
          if (_s2QP::I64 >= 45) goto c2Tw; else goto u2XK;
      u2XK: // global
          if (_s2QP::I64 == 43) goto c2Tw; else goto c2Uu;
      u2XL: // global
          if (_s2QP::I64 >= 102) goto c2Uu; else goto u2XM;
      u2XM: // global
          if (_s2QP::I64 >= 101) goto c2Tw; else goto u2XN;
      u2XN: // global
          if (_s2QP::I64 == 69) goto c2Tw; else goto c2Uu;
      c2Tw: // global
          _s2Qt::I64 = _s2Qt::I64 + 1;
          goto c2RT;

I don't believe we can eliminate the duplication here: while these groups are structurally similar, they are not identical.

Rather, I think we could improve the switch lowering: currently we lower this as a chain of branches. I suspect this is the sort of case where we might benefit from instead using a jump table.

Measuring this idea and, if it pays off, implementing it might be an interesting project for someone looking to get their hands dirty in the NCG.

Last edited 2 years ago by bgamari (previous) (diff)

comment:15 Changed 2 years ago by bgamari

Keywords: newcomer added

comment:16 Changed 23 months ago by nomeata

I see these two blocks not being commoned up:

  c4bx: // global
      I64[_s3Yp::P64] = stg_MUT_ARR_PTRS_FROZEN0_info;   // CmmStore
      _s3YJ::P64 = _s3Yp::P64;   // CmmAssign
      _s3YJ::P64 = _s3YJ::P64;   // CmmAssign
      I64[Hp - 48] = GHC.Types.I#_con_info;   // CmmStore
      I64[Hp - 40] = _s3Yj::I64;   // CmmStore
      _c4bX::P64 = Hp - 47;   // CmmAssign
      I64[Hp - 32] = GHC.Arr.Array_con_info;   // CmmStore
      P64[Hp - 24] = Database.idatabase1_closure+1;   // CmmStore
      P64[Hp - 16] = _c4bX::P64;   // CmmStore
      P64[Hp - 8] = _s3YJ::P64;   // CmmStore
      I64[Hp] = _s3Yq::I64;   // CmmStore
      _c4bY::P64 = Hp - 31;   // CmmAssign
      _s3YW::P64 = _c4bY::P64;   // CmmAssign
      goto c4b9;   // CmmBranch

and

  c4bD: // global
      I64[_s3Yp::P64] = stg_MUT_ARR_PTRS_FROZEN0_info;   // CmmStore
      _s3Yy::P64 = _s3Yp::P64;   // CmmAssign
      _s3Yy::P64 = _s3Yy::P64;   // CmmAssign
      I64[Hp - 48] = GHC.Types.I#_con_info;   // CmmStore
      I64[Hp - 40] = _s3Yj::I64;   // CmmStore
      _c4bZ::P64 = Hp - 47;   // CmmAssign
      I64[Hp - 32] = GHC.Arr.Array_con_info;   // CmmStore
      P64[Hp - 24] = Database.idatabase1_closure+1;   // CmmStore
      P64[Hp - 16] = _c4bZ::P64;   // CmmStore
      P64[Hp - 8] = _s3Yy::P64;   // CmmStore
      I64[Hp] = _s3Yq::I64;   // CmmStore
      _c4c0::P64 = Hp - 31;   // CmmAssign
      _s3YW::P64 = _c4c0::P64;   // CmmAssign
      goto c4b9;   // CmmBranch

Should they? Does maybe the _s3YJ::P64 = _s3YJ::P64; throw CBE off?

(This is from nofib’s fem with commit d871321ce20e – not sure how useful this commit ID is, as it is from a rebasing branch and so far it only lives on Phabricator.)

comment:17 in reply to:  16 Changed 23 months ago by michalt

Replying to nomeata:

[...] Should they? Does maybe the _s3YJ::P64 = _s3YJ::P64; throw CBE off?

Hmm.. Interesting. I'd expect that with the recent patch from bgamari this should be commoned up. Is this the final cmm or is it the input to CmmCommonBlockElim? (asking because sometimes the sinking pass can expose more opportunities for CmmCommonBlockElim, but we run sinking pass *after* eliminating common blocks; example: #12915)

comment:18 Changed 23 months ago by bgamari

Indeed it's hard to say much without knowing with certainty that this is the code that CBE looks at. That being said, I believe that the CBE implementation should be able to deal with CmmAssigns of the form x=x since it compares the RHSs up to alpha equivalence. It would of course be good to check this, however.

comment:19 Changed 23 months ago by nomeata

It’s the output of -ddump-cmm-cbe

Last edited 23 months ago by nomeata (previous) (diff)

comment:20 Changed 23 months ago by bgamari

Interesting; would you like to investigate or should I?

comment:21 in reply to:  14 Changed 20 months ago by AndreasK

Replying to bgamari:

This should now be resolved.

Rather, I think we could improve the switch lowering: currently we lower this as a chain of branches. I suspect this is the sort of case where we might benefit from instead using a jump table.

Measuring this idea and, if it pays off, implementing it might be an interesting project for someone looking to get their hands dirty in the NCG.

According to the comments in CmmSwitch.hs jump tables only make sense when the gaps between entries are smaller than 7 and the table has at least 5 elements which isn't the case here.

comment:22 Changed 20 months ago by simonpj

Keywords: CodeGen added

What's the status here?

comment:23 Changed 20 months ago by bgamari

Resolution: fixed
Status: closednew

I've not yet investigated but I can certainly do so.

comment:24 Changed 20 months ago by michalt

Cc: michalt added

comment:25 Changed 20 months ago by bgamari

Milestone: 8.4.18.6.1
Owner: set to bgamari

I'll need to take a look at this. Unfortunately it likely won't happen before 8.4.1.

comment:26 Changed 20 months ago by bgamari

Sadly I'll need to revert the above patches due to #14754. I'll have another look at this post-8.4.

comment:27 Changed 20 months ago by Ben Gamari <ben@…>

In 50adbd7c/ghc:

cmm: Revert more aggressive CBE due to #14226

Trac #14226 noted that the C-- CBE pass frequently fails to
common up semantically identical blocks due to the differences in local
register naming. These patches fixed this by making the pass consider
equality up to alpha-renaming.

However, the new logic failed to consider the possibility that local
register naming *may* matter across multiple blocks. This lead to the
regression #14754. I'll need to do a bit of thinking on a proper
solution to this but in the meantime I'm reverting all four patches.

This reverts commit a27056f9823f8bbe2302f1924b3ab38fd6752e37.
This reverts commit 6f990c54f922beae80362fe62426beededc21290.
This reverts commit 9aa73892e10e90a1799b9277da593e816a827364.
This reverts commit 7920a7d9c53083b234e060a3e72f00b601a46808.

comment:28 Changed 20 months ago by michalt

:(

I can currently think of two options:

1) Fix the approach using alpha renaming by actually checking liveness on entry and exit for each variable. (to only consider variables that are not live across blocks) The disadvantage is that this would be quite a bit more costly.

2) Don't improve CBE itself, but run it a second time after the sinking pass - this might get rid of some of the cases from this ticket (e.g., the one from comment:6) The advantage is that this would also fix #12915; the disadvantage: this might not be always as effective as option 1).

comment:29 Changed 20 months ago by bgamari

1) Fix the approach using alpha renaming by actually checking liveness on entry and exit for each variable. (to only consider variables that are not live across blocks) The disadvantage is that this would be quite a bit more costly.

Right, I briefly considered implementing this. I don't expect it will be absurdly expensive: all we need to know is which variables are mentioned in only one block. However, I do wonder how much this will actually improve code. Presumably there are cases where this will be *too* conservative.

2) Don't improve CBE itself, but run it a second time after the sinking pass - this might get rid of some of the cases from this ticket (e.g., the one from comment:6) The advantage is that this would also fix #12915; the disadvantage: this might not be always as effective as option 1).

Mmm, interesting idea. This would certainly be worth a try.

comment:30 in reply to:  29 Changed 20 months ago by michalt

Replying to bgamari:

1) Fix the approach using alpha renaming by actually checking liveness on entry and exit for each variable. (to only consider variables that are not live across blocks) The disadvantage is that this would be quite a bit more costly.

Right, I briefly considered implementing this. I don't expect it will be absurdly expensive: all we need to know is which variables are mentioned in only one block. However, I do wonder how much this will actually improve code. Presumably there are cases where this will be *too* conservative.

2) Don't improve CBE itself, but run it a second time after the sinking pass - this might get rid of some of the cases from this ticket (e.g., the one from comment:6) The advantage is that this would also fix #12915; the disadvantage: this might not be always as effective as option 1).

Mmm, interesting idea. This would certainly be worth a try.

I did a small experiment with this: https://ghc.haskell.org/trac/ghc/ticket/12915#comment:5 (I'm wondering if we could improve the CBE to be a bit cheaper, e.g., by improving hashing)

comment:31 Changed 20 months ago by bgamari

Owner: changed from bgamari to michalt

Happily, michalt may be looking into this later this week.

comment:32 Changed 19 months ago by Ben Gamari <ben@…>

In 4e513bf/ghc:

CBE: re-introduce bgamari's fixes

During some recent work on CBE we discovered that `zipWith` is used to
check for equality, but that doesn't quite work if lists are of
different lengths! This was fixed by bgamari, but unfortunately the fix
had to be rolled back due to other changes in CBE in
50adbd7c5fe5894d3e6e2a58b353ed07e5f8949d. Since I wanted to have another
look at CBE anyway, we agreed that the first thing to do would be to
re-introduce the fix.

Sadly I don't have any actual test case that would exercise this.

Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>

Test Plan: ./validate

Reviewers: bgamari, simonmar

Reviewed By: bgamari

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #14226

Differential Revision: https://phabricator.haskell.org/D4387

comment:33 Changed 18 months ago by Ben Gamari <ben@…>

In d5c4d46a/ghc:

CmmPipeline: add a second pass of CmmCommonBlockElim

The sinking pass often gets rid of unnecessary registers
registers/assignements exposing more opportunities for CBE, so this
commit adds a second round of CBE after the sinking pass and should
fix #12915 (and some examples in #14226).

Nofib results:
* Binary size:         0.9% reduction on average
* Compile allocations: 0.7% increase on average
* Runtime:             noisy, two separate runs of nofib showed a tiny
                       reduction on average, (~0.2-0.3%), but I think
                       this is mostly noise
* Compile time:        very noisy, but generally within +/- 0.5% (one
                       run faster, one slower)

One interesting part of this change is that running CBE invalidates
results of proc-point analysis. But instead of re-doing the whole
analysis, we can use the map that CBE creates for replacing/comparing
block labels (maps a redundant label to a useful one) to update the
results of proc-point analysis. This lowers the overhead compared to the
previous experiment in #12915.

Signed-off-by: Michal Terepeta <michal.terepeta@gmail.com>

Test Plan: ./validate

Reviewers: bgamari, simonmar

Reviewed By: simonmar

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #12915, #14226

Differential Revision: https://phabricator.haskell.org/D4417

comment:34 Changed 15 months ago by bgamari

Milestone: 8.6.18.8.1

This won't be fixed in 8.6. Bumping to 8.8.

comment:35 Changed 9 months ago by AndreasK

For anyone wondering this has been reverted because of issues with ProcPoint invariants.

See #14989

comment:36 Changed 9 months ago by sgraf

Note that I posted a solution in https://ghc.haskell.org/trac/ghc/ticket/14989#comment:2: Just recompute proc points after the transformation. There's no reason we couldn't have this now other than that it will probably eat more CPU time than the quick (but broken) fixup.

comment:37 Changed 9 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of some high and highest priority tickets.

Note: See TracTickets for help on using tickets.