Opened 2 years ago

Last modified 19 months ago

#14372 new feature request

CMM contains a bunch of tail-merging opportunities

Reported by: heisenbug Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.2.1
Keywords: CodeGen Cc: nomeata, michalt, AndreasK, sjakobi
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #14666 Differential Rev(s):
Wiki Page:

Description (last modified by simonpj)

Compile this code to CMM

data Small = S1 | S2 | S3 | S4 deriving (Show, Enum)

data Big = B1 | B2 | B3 | B4 | B5 | B6 | B7 | B8 | B9 | B10 deriving (Show, Enum)

{-# NOINLINE quux #-}
quux B1 = 'a'
quux B2 = 'b'
quux B3 = 'c'
quux B4 = 'd'
quux B5 = 'e'
quux B6 = 'f'
quux B7 = 'g'
quux B8 = 'h'
quux B9 = 'i'
quux B10 = 'j'

{-# NOINLINE qaax #-}
qaax B1 = 'a'
qaax B2 = 'b'
qaax B3 = 'c'
qaax B4 = 'd'
qaax B5 = 'e'

qaax B7 = 'g'
qaax B8 = 'h'
qaax B9 = 'i'
qaax B10 = 'j'


{-# NOINLINE foo #-}
foo B1 = S1
foo B2 = S2
foo B3 = S3
foo B4 = S4


{-# NOINLINE bar #-}
bar S1 = B1
bar S2 = B2
bar S3 = B3
bar S4 = B4

main = do print $ take 100000 (repeat (foo <$> [B1 .. B4]))
          print $ take 100000 (repeat (bar <$> [S1 .. S4]))
          print $ take 100000 (repeat (quux <$> [B1 .. B10]))
          print $ qaax B1

When Char or enum-like ADT is returned, I see lots of case branches, which only differ in the first instruction.

E.g.

       c30l: // global
           R1 = stg_CHARLIKE_closure+1649;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c30m: // global
           R1 = stg_CHARLIKE_closure+1665;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       u30Z: // global
           if (_c30p::I64 < 9) goto c30n; else goto c30o;
       c30n: // global
           R1 = stg_CHARLIKE_closure+1681;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c30o: // global
           R1 = stg_CHARLIKE_closure+1697;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

It would be nice to factor out the common tails, e.g. by branching to the first tail already emitted.

Bonus points for rewriting switch tables to contain the above numbers and compile to a lookup + common code.

This is what I am talking about:

       c307: // global
           _s2ON::P64 = R1;
           _c30j::P64 = _s2ON::P64 & 7;
           switch [1 .. 7] _c30j::P64 {
               case 1 : goto c30d;
               case 2 : goto c30e;
               case 3 : goto c30f;
               case 4 : goto c30g;
               case 5 : goto c30h;
               ...
           }

...
       c30h: // global
           R1 = stg_CHARLIKE_closure+1617;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c30g: // global
           R1 = stg_CHARLIKE_closure+1601;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c30f: // global
           R1 = stg_CHARLIKE_closure+1585;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c30e: // global
           R1 = stg_CHARLIKE_closure+1569;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c30d: // global
           R1 = stg_CHARLIKE_closure+1553;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

There should be an array [1553, 1569, 1585, ...] and each case should be the same:

           R1 = stg_CHARLIKE_closure;
           R1 = R1 + array[tag];
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

See also #14666

Change History (22)

comment:1 Changed 2 years ago by heisenbug

See also #14226.

comment:2 Changed 2 years ago by nomeata

It would be nice to factor out the common tails, e.g. by branching to the first tail already emitted.

Branches can be expensive, so this might not be a clear win. Have you measured the effect? (You can do that without modifying GHC: Just get your hands on the assembly file, make the changes you want to see, and compare.)

The other idea, detecting blocks that are equal up-to a constant, sounds fun. But again, the question is: Is it worth it?

comment:3 Changed 2 years ago by bgamari

Poorly predicted branches can indeed be expensive. However, I think here we are just taking about jumps which, as far as I know, are quite cheap assuming they don't boot you out of $I since they can be predicted perfectly.

comment:4 Changed 2 years ago by simonpj

Across an entire module I think there are quite a lot of code blocks that are byte-for-byte identical; i.e. no extra branch or "equal up to constant" required.

I wonder if we could simply gather stats on the top 100 code-blocks, put them into the RTS, and use them whenever we come across one? I bet that'd be highly effective; but I do not have data to back up my claim.

comment:5 Changed 2 years ago by bgamari

I wonder what prior art exists in this area; I'm sure other compilers have considered this in the past.

An interesting and loosely related bit of work that I have seen in the past is the work in the LLVM community on a technique they call outlining. See the talk for details. This was pursued as a means of reducing code size and is no doubt more sophisticated than what Simon was looking for in comment:4, but I thought I'd leave the reference here nevertheless.

comment:6 in reply to:  2 Changed 2 years ago by heisenbug

Replying to nomeata:

It would be nice to factor out the common tails, e.g. by branching to the first tail already emitted.

Branches can be expensive, so this might not be a clear win. Have you measured the effect? (You can do that without modifying GHC: Just get your hands on the assembly file, make the changes you want to see, and compare.)

These will probably be short, known branches. Since more relevant code now can reside in the instruction cache, I expect less misses and no prediction failures.

The other idea, detecting blocks that are equal up-to a constant, sounds fun. But again, the question is: Is it worth it?

Well, it makes code straight-line and branchless. Eliminates jump-tables. Shrinks code size. Which one of these is a penalty? ;-)

comment:7 Changed 2 years ago by heisenbug

-fcmm-elim-common-blocks helps a bit, but many equal ones are not caught:

These hash the same

hash_block
  c3dj
  170
hash_block
  c3dp
  170
hash_block
  c3dv
  170
hash_block
  c3dB
  170

but are not commoned:

==================== Post switch plan ====================
{offset
  c3dj: // global
      R1 = R1 + 7;
      call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
}



==================== Post switch plan ====================
{offset
  c3dp: // global
      R1 = R1 + 7;
      call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
}



==================== Post switch plan ====================
{offset
  c3dv: // global
      R1 = R1 + 7;
      call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
}



==================== Post switch plan ====================
{offset
  c3dB: // global
      R1 = R1 + 7;
      call (P64[(old + 8)])(R1) args: 8, res: 0, upd: 8;
}

Maybe the reason is the "result register" updates are not considered in CmmCommonBlockElim.hs.

comment:8 in reply to:  7 ; Changed 2 years ago by heisenbug

Replying to heisenbug:

-fcmm-elim-common-blocks helps a bit, but many equal ones are not caught:

Turns out c-b-e is a local optimisation, i.e. per procedure. We need a global one to crack this.

comment:9 in reply to:  8 Changed 2 years ago by heisenbug

Replying to heisenbug:

Replying to heisenbug:

-fcmm-elim-common-blocks helps a bit, but many equal ones are not caught:

Turns out c-b-e is a local optimisation, i.e. per procedure. We need a global one to crack this.

Okay I now have the beginnings of a global CMM C-B-E here: https://github.com/ggreif/ghc/tree/wip/global-cmm-cbe

Feedback welcome!

Some comments (for self)

  • better use ST(Refs) to not mess with GHC's -j mode (multi-module compilation)
  • CmmProcs should be transformed. Not yet done. Graph label to branch transformation is bogus, but keeping it as an example for now
  • condPass needs some more love.
Last edited 2 years ago by heisenbug (previous) (diff)

comment:10 Changed 2 years ago by simonpj

It's hard to review code without a spec, or even some input/output examples.

And an overview Note to explain the goal and how the impl works would be useful.

Based on a very quick look, it seems that you hope to branch from one CmmProc to a label in another. That's a huge thing to do, because it messes up all dataflow analyses, which expect to be able to see all the places that jump to a block (e.g. constant propagation). I'm 99% sure that you can't translate it into LLVM.

Before you invest a lot more effort, can we debate the goal? There are others on ghc-devs who know about this back-end stuff.

comment:11 in reply to:  10 Changed 2 years ago by heisenbug

Replying to simonpj:

It's hard to review code without a spec, or even some input/output examples.

And an overview Note to explain the goal and how the impl works would be useful.

Will do. I am in a very early stage :-)

Based on a very quick look, it seems that you hope to branch from one CmmProc to a label in another. That's a huge thing to do, because it messes up all dataflow analyses, which expect to be able to see all the places that jump to a block (e.g. constant propagation). I'm 99% sure that you can't translate it into LLVM.

You are correct. We can only branch to labels that we emit. So this probably will be a NCG-only thing (C-backend is off-limits too), unless we find a way to invoke the "outliner" in LLVM specifically.

What I have observed that sometimes previously noted labels disappear due to optimisation. So I'll have to move the commoning to the end of the Cmm pipeline.

Before you invest a lot more effort, can we debate the goal? There are others on ghc-devs who know about this back-end stuff.

Sure. First and above all I am trying to learn and understand the inner workings of Cmm. Let me just get this working with bootstrap and we can discuss. I'll also need to gather some statistics.

Thanks for the review, Simon!

comment:12 Changed 22 months ago by nomeata

Cc: nomeata added

comment:13 Changed 22 months ago by jp.rider63

What's the status of this? I'd be interested in contributing to the implementation if it's helpful.

comment:14 in reply to:  13 Changed 22 months ago by heisenbug

Replying to jp.rider63:

What's the status of this? I'd be interested in contributing to the implementation if it's helpful.

There is a branch in my GHC fork on github: https://github.com/ggreif/ghc/tree/wip/global-cmm-cbe

It does not work with ghc --make because the assembly files contain local labels. Concatenating all assembly files might work, but I have not tried it yet.

comment:15 Changed 22 months ago by nomeata

“This” is ambiguous. The original ticket asks for commoning up “almost equal” locks (by introducing a lookup table). I believe @jp.rider63 refers to that.

Later in the discussion, global CBE was brought up, which @heisenbug seems to be working on.

I recommend to open a new ticket for global CBE.

comment:16 Changed 22 months ago by jp.rider63

Yes, I was slightly confused about the CBE discussion, but it seems somewhat related.

FYI, this is my motivating example. The speedup I thought was happening due to eliminating common tails actually turned out to be from (lack of) inlining on someone else's machine. Either way, I think it makes sense to benchmark performance differences and see if this change is worthwhile. Even if performance is similar, it might be worth implementing to reduce code size.

For my use case, one optimization that might help more is creating lookup tables when matching on Word8. Maybe I'll create another ticket for that in the future.

comment:17 in reply to:  3 Changed 21 months ago by AndreasK

Replying to bgamari:

Poorly predicted branches can indeed be expensive. However, I think here we are just taking about jumps which, as far as I know, are quite cheap assuming they don't boot you out of $I since they can be predicted perfectly.

I've run the code:

{-# NOINLINE f_test #-}
f_test :: Int ->Int
f_test a = case a of
    1  -> 11001
    2  -> 11002
    3  -> 11003
    4  -> 11004
    5  -> 11005
    6  -> 11006
    7  -> 11007
    8  -> 11008

main = print . sum . map (f_test) $ (concat . replicate 9000000) [1..45::Int]

I did the transformation manually into:

.Lc48I:
	cmpq $9,%r14
	jge .Lc48z
.Lu48L:
	cmpq $1,%r14
	jl .Lc48z
.Lu48M:
	movl .Ln48P-8(,%r14,8), %ebx
	jmp *(%rbp)
.Lc48z:
	movq $-1,%rbx
	jmp *(%rbp)
	

.section .rodata
.Ln48P:
	.quad	11001
	# 11002 .. 11007
	.quad	11008

Turning the jumps into a indirect mov instruction.

I wonder what prior art exists in this area; I'm sure other compilers have considered this in the past.

Gcc/clang do the same thing for switch statements.

It improved speed, but not by much and depending on the codelayout I did it was possible to get worse performance in specific cases. But thats true for everything that changes the code layout and isn't a major win I assume.

comment:18 Changed 21 months ago by AndreasK

I think this can be solved in two steps:


Change the Cmm generation for case statements:

Currently we generate cmm of the sort

 ...
 case 1: goto foo;
 ...

foo:
  Register = Value
  return(Register)

Instead we could inline this when generating the switch.

#include "Cmm.h"

test_entry() 
{
switch [1 .. 8] R2 {
   case 1 : {R1 = 0; return();}
   ...
   case 8 : {R1 = 8; return();}
   }
}

From what I've seen this hopefully won't change the generated assembly since GHC does this already when compiling the switch in the cases I looked at.


The second step would be to change the Assembly generated to something like the listing in the comment above.

I imagine a way to do that would be to:

  • Check if
    • Assignments are into the same register.
    • The rest of the code is the same
  • Collect all the constants, put them into a lookup table and generate assembly that uses the lookup table like above.

comment:19 Changed 20 months ago by michalt

Cc: michalt added

comment:20 Changed 20 months ago by simonpj

Description: modified (diff)
Keywords: CodeGen added

comment:21 Changed 19 months ago by AndreasK

Cc: AndreasK added

comment:22 Changed 19 months ago by sjakobi

Cc: sjakobi added
Note: See TracTickets for help on using tickets.