Opened 8 months ago

Last modified 7 months ago

#16243 new task

Improve fregs-graph.

Reported by: AndreasK Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 8.6.3
Keywords: fregs-graph, CodeGen Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #7679 Differential Rev(s):
Wiki Page:

Description

I'm creating this as a way to keep track of shortcomings of the current implementation.

Some of these are directly taken from TODOS, other are my own opinions.

Things that could be looked at.

Compress spill slots.

Currently we assign each vreg a fixed spill slot. This is terrible for cache performance.

It's possible to end up with sequences like this:

block_ca01_info:
_ca01:
        movq 16(%rbp),%r14
        movq 24(%rbp),%rax
        movq %rax,144(%rsp)
        movq 32(%rbp),%rsi
        movq 40(%rbp),%rdi
        movq 64(%rbp),%rbx
        movq 56(%rbp),%rax
        movq %rax,208(%rsp)
        movq 48(%rbp),%rax
        movq %rax,168(%rsp)
        movq 8(%rbp),%rax
        movq %rax,216(%rsp)
        addq $8,%rbp

Which splits spill slots over multiple cache lines increasing cache pressure.

Things possibly worth trying:

  • Assign spill slots in the order they appear - hopefully reducing the average spread within code blocks.
  • Dynamically assign vregs to spill slots.

For the later if we have large independent subgraphs we will sometimes get away with using far fewer slots. The reason simply being that non overlapping vregs can share spill slots.

Doing this in an efficient manner however isn't easy.

We sometimes load + spill vregs when we could use memory addressing.

One example I've seen a few times in nofib is this:

	movq 208(%rsp),%rax
	incq %rax
	movq %rax,208(%rsp)

Which we should replace with a single inc operating on memory.

Lift the stack size limit.

There is already a patch in flight.

Revisit spill costs.

Currently the allocator always spills the longest live range. Which often works but is certainly not a great metric.

Ideas:

Consider loop membership.

At least for x86/64 this would be easy to achieve on basic block granularity using the CFG that is also used for block layout.

Revisit chaitins cost model.

There have been major changes to cpus (amd64 becoming the norm) and the backend so at a minimum this cost model should be reevaluated.

Try splitting long live ranges instead of simply spilling the complete live range.

After all long live ranges were the original reason to switch to a life range only based cost heuristic.

Use a different spill algorithm altogether.

Maybe Chows live range splitting based approach would be an option. See "The priority based coloring approach to register allocation" for details.

Change History (3)

comment:1 Changed 7 months ago by AndreasK

comment:2 Changed 7 months ago by AndreasK

The case for live range splitting.

I've looked a bit at loops recently.

Below is a inner loop from nofib/fannkuch-redux. It's not some random code either. The loop makes up ~30% of the programs execution time

First of the code we generate isn't great. We spill/reload variables across the call that are not used in the loop. Even so the final result is code that uses fewer vregs than we have registers available:

      ca12: // global
          _s9D0::I64 = %MO_UU_Conv_W8_W64(I8[_s9zN::I64]);
          if (_s9D0::I64 != 0) goto ca1f; else goto ca1i;
      ca1f: // global
          I64[Sp - 16] = block_ca1d_info;
          R3 = _s9zN::I64 + _s9D0::I64;
          R2 = _s9zN::I64;
          I64[Sp - 8] = _s9CV::I64;
          I64[Sp] = _s9Cd::I64;
          I64[Sp + 40] = _s9Cc::I64;
          I64[Sp + 48] = _s9Cb::I64;
          I64[Sp + 56] = _s9zN::I64;
          Sp = Sp - 16;
          call $wflopp_r9x0_info(R3,
                                 R2) returns to ca1d, args: 8, res: 8, upd: 8;
      ca1d: // global
          _s9z5::I64 = I64[Sp + 24];
          _s9zv::I64 = I64[Sp + 32];
          _s9zx::I64 = I64[Sp + 40];
          _s9zz::I64 = I64[Sp + 48];
          _s9zN::I64 = I64[Sp + 72];
          _s9Cb::I64 = I64[Sp + 64];
          _s9Cc::I64 = I64[Sp + 56];
          _s9Cd::I64 = I64[Sp + 16];
          _s9CV::I64 = I64[Sp + 8] + 1;
          Sp = Sp + 16;
          goto ca12;

We get this liveness:

        ca12:
                movzbl (%vI_s9zN),%vI_s9D0
                    # born:    %vI_s9D0
                testq %vI_s9D0,%vI_s9D0
                jne _ca1f
                    # r_dying: %vI_s9D0
                jmp _ca1i
                    # r_dying: %vI_s9z5 %vI_s9zv %vI_s9zx %vI_s9zz %vI_s9zN %vI_s9Cb %vI_s9Cc %vI_s9Cd %vI_s9CV
     NONREC
        ca1f:
                movq $block_ca1d_info,-16(%rbp)
                movq %vI_s9zN,%rsi
                    # born:    %r4
                addq %vI_s9D0,%rsi
                    # r_dying: %vI_s9D0
                movq %vI_s9zN,%r14
                    # born:    %r14
                movq %vI_s9CV,-8(%rbp)
                    # r_dying: %vI_s9CV
                movq %vI_s9Cd,(%rbp)
                    # r_dying: %vI_s9Cd
                movq %vI_s9Cc,40(%rbp)
                    # r_dying: %vI_s9Cc
                movq %vI_s9Cb,48(%rbp)
                    # r_dying: %vI_s9Cb
                movq %vI_s9zN,56(%rbp)
                    # r_dying: %vI_s9zN
                addq $-16,%rbp
                jmp $wflopp_r9x0_info
                    # r_dying: %r4 %r14
     NONREC
        ca1d:
                movq 24(%rbp),%vI_s9z5
                    # born:    %vI_s9z5
                movq 32(%rbp),%vI_s9zv
                    # born:    %vI_s9zv
                movq 40(%rbp),%vI_s9zx
                    # born:    %vI_s9zx
                movq 48(%rbp),%vI_s9zz
                    # born:    %vI_s9zz
                movq 72(%rbp),%vI_s9zN
                    # born:    %vI_s9zN
                movq 64(%rbp),%vI_s9Cb
                    # born:    %vI_s9Cb
                movq 56(%rbp),%vI_s9Cc
                    # born:    %vI_s9Cc
                movq 16(%rbp),%vI_s9Cd
                    # born:    %vI_s9Cd
                movq 8(%rbp),%vI_nakO
                    # born:    %vI_nakO
                leaq 1(%vI_nakO),%vI_s9CV
                    # born:    %vI_s9CV
                    # r_dying: %vI_nakO
                addq $16,%rbp
                jmp _ca12
                    # r_dying: %vI_s9z5 %vI_s9zv %vI_s9zx %vI_s9zz %vI_s9zN %vI_s9Cb %vI_s9Cc %vI_s9Cd %vI_s9CV

However the graph register spills many of these:

  • vI_s9zN - 5 uses
  • vI_s9Cd - 2 uses
  • and a few more but the details don't matter that much.

With live range splitting ideally we just split ranges overlapping the loop into three ranges, so the ranges overlapping the loop could be spilled/loaded on loop entry/exit not affecting the performance of the loop.

comment:3 Changed 7 months ago by simonpj

Keywords: CodeGen added
Note: See TracTickets for help on using tickets.