wiki:Commentary/Compiler/CodeLayout

Version 3 (modified by AndreasK, 19 months ago) (diff)

--

Block Layout

As expected GHC's code generator eventually produces a list of basic blocks which are then layout out sequentially before finally being translated to a binary.

GHCs current block layout algorithm.

GHC's current way of doing so is very basic.

  • Build a graph where basic blocks are nodes.
  • Add a edge for each jump at the end of a basic block.
  • Find sets of strongly connected components.
  • For each set flatten it resulting in a list of basic blocks.
  • Finally place all of these lists after each other.

This works well when all important jumps are at the end of a block. However it comes with two problems:

  • There is no way to incorporate edge weight information directly.
  • Important edges might be represented by conditional jumps not at the end of an block or calls. Both of which are not considered at all.

Impact on performance

Modern CPUs are complicated so predicting performance is difficult at the best of times. After all we want to optimize for:

  • Efficiently using L1 cache (64byte lines)
  • Efficiently using the fetch window (16Byte)
  • Efficiently using the DSB.
  • Match alignment with instruction boundries.

We can't really do much about alignments past the first block. GHC doesn't know about the length of instructions and only emits assembly. So we can't calculate where the alignment of blocks/instructions ends up. Changing this would be a major change.

For the other factors there a lot of ways to improve things and even more to make things worse. But the simplest thing we CAN do is to try and place blocks which are executed consecutively in execution order.

Factorial example

For highlighting the basic issues we don't have to look far. The factorial function already shows some of the problems with ghcs algorithm. Even if they are NOT an issue for this function,

Consider the following Cmm code for factorial:

Cmm Code

       c4Ba: stack check
           if ((Sp + -16) < SpLim) (likely: False) goto c4Bb; else goto c4Bc;
       c4Bb: run gc if stack exhausted
           R2 = R2;
           R1 = $wfactorial_closure;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       c4Bc: compare against zero
           if (R2 != 0) goto c4B8; else goto c4B9;
       c4B8: call factorial
           I64[Sp - 16] = c4Bh;
           _s4A8::I64 = R2;
           R2 = R2 - 1;
           I64[Sp - 8] = _s4A8::I64;
           Sp = Sp - 16;
           call $wfactorial_info(R2) returns to c4Bh, args: 8, res: 8, upd: 8;
       c4Bh: x * factorial (x-1)
           R1 = I64[Sp + 8] * R1;
           Sp = Sp + 16;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       c4B9: x = 1
           R1 = 1;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

Control Flow

digraph {
	check_stack -> run_gc[label="exhausted",weight="10"];
	check_stack -> "x/=0"[label="ok",weight="90"];
	".. factorial(x-1)" -> "x*call_result"[label="10",weight="10"];
	"x/=0" -> ".. factorial(x-1)"[label="True",weight="49"];
	"x/=0" -> "ret 1"[label="False",weight="51"];
}

Actual Control Flow

  • Repeat:
    • check_stack
    • x/=0
    • call factorial
  • ret 1
  • repeat
    • ret (x*call_result)
  • done

Generated Layout

  • check_stack
  • x/=0
  • ret 1
  • ret (x*call_result)
  • run_gc
  • call factorial

Possible improvements:

Cmm does not know that the argument is usually not null so placing ret 1 after the comparison is bad but the best we can do.

However the current algorithm also can't be easily expanded even if we had this information. After all, only one jump per block is ever considered. And even after that hurdle we still have no way to assign weights to edges.

Ideally we would also want to place call factorial next to ret (x*call_result). After all there is clearly an edge between these blocks.

It's not an big issue for factorial in particular. But if we call a small function from within a loop this would be advantageous.

What I looked into:

If we want to use the information present in Cmm we have to make it available when it matters. Which is a few steps down where cmm has been turned into platform dependent instructions already.

So I started out with building a CFG which is essentially the CmmGraph stripped of the actual code.

Building a CFG from Cmm

We simply start at the entry block and recursivly add all edges. But further we also weigh them by the type of jump.

Example Heuristic:

Construct Weight(s)
goto 100
If/Else 49/51
If/Else with likelyhood 10/90
Call with return label 10

The main issue with this is that we have to update the cfg with any changes to the control flow the asm passes introduce after we have created the initial cfg. While it was quiet annoying to track down all the places where this happens in the end with some persistence I found all of these.<sup>At least for x64 ...<sup>I hope ...</sup> </sup>

For x64 these passes change the cfg:

Linear register allocation

Here we add blocks when we join two control flow paths. I solved this by tracking inserted nodes in the register allocator state and updating the cfg afterwards.

Generating instructions from Cmm

Initially surprisingly when we have code like below operating on floats.

  if (a < b) then { goto L1; } else { goto L2; }

We insert a new block L3: goto L2 and the generated code jumps to L3 if either (a < b) or if the floats are unordered. Floating point is truely a headache.

The reason is that if (a,b) is unordered we consider a < b to be false. So we have to check if they are unordered BEFORE we check if a < b.

To make it even more complicated we also can't generate a jump directly to the false branch for the parity check. In the place where we generate the floating point comparison we don't know the label of the false branch.

Long story short. It's a mess.

Shortcutting

Shortcutting is removing blocks which only transfer controlflow to another block. So we reduce:

A: ... ; goto B;
B: goto C;
C: ...

to A: goto C; C: ...

This was obvious and the easiest place to adjust.

Place blocks based on this graph

We can then construct placements based on this graph.

The basic idea is to find many long sequences of blocks which should be placed in series. Then place these next to each other. Very much alike the current algorithm does but based on a weighted digraph instead.

The best sequences intuitively have heavy edges. So what I ended up doing is visiting each block in reverse postorder and either append it to an existing sequence if it's a heavy edge. Or create a new sequence consisting of just this block.

Since this is a very greedy algorithm it misses some cases. So there is some more edge case handling involved. But it doesn't change the fundamental approach.

Preliminary results:

Which is disappointing. I've combined a earlier version of my layout branch with my patch to detect error branches as unlikely and recursion as likely which gave a improvment of -0.8% which is nice.

However as a standalone patch it currently makes performance worse:

         lambda           0.0%      0.0%    +22.4%    +20.8%      0.0%
             FS           0.0%      0.0%     +9.8%     +9.7%      0.0%
--------------------------------------------------------------------------------
            Min          -0.1%      0.0%     -1.7%     -3.2%      0.0%
            Max          +0.0%      0.0%    +22.4%    +20.8%      0.0%
 Geometric Mean          -0.0%     -0.0%     +1.1%     +0.8%     -0.0%

FS seems like it is just a edge case for the new layout algorithm. With lambda I have no idea what is up. The benchmark basically runs two functions: main = do { mainSimple ; mainMonad }.

It seems like some edge case I didn't consider. But so far I haven't really been able to pinpoint why it get's faster by so much.

Removing either one the new layout code is faster. Using -XStrict[Data] the new one is faster. However when padding the whole executable to change alignment the difference stays about the same. So it doesn't seem to be an alignment issue.

Performance counters indicate the DSB buffer being the culprit. But I haven't yet been able to find out how so.

Things left to do:

It's not clear why the lambda case is so slow.

Besides that another question is how calls would be best handled. If we have a small function f we really want to keep a sequence like

A: ...;
B: ...;
   call f returns to C;
C: ...;

a sequence when laying out the code since there is a good chance blocks B and C will overlap in the cache.

On the other hand if f is a large function then by the time we return any benefit we can gain by putting C right after B is gone. (Cache lines have been evicted, buffers invalidated, ...).

Conclusion

I've dug pretty deep into this. But with the results as they are this does not seem ready for production. Given the time constraints of GSOC I will for now shelf my patch and probably revisit it when I work on likelyhood information again.

I will put the patch up on phabricator once I have cleaned it up a bit. If you have ideas or feedback hit me on IRC or comment on trac #15124.

Update

After some tweaking this patch https://phabricator.haskell.org/D4726 lowered runtime by 0.5% on Haswell and Skylake.

The primary change made compared to the above was to ignore edges based on call returns for code layout. It also special cases the typical "check tag and evaluate if required" control flow by putting the evaluation code right after the check.