Changes between Initial Version and Version 1 of Commentary/Compiler/CodeLayout

May 19, 2018 8:00:48 PM (20 months ago)



  • Commentary/Compiler/CodeLayout

    v1 v1  
     1= Block Layout
     3As expected GHC's code generator eventually produces a list of basic blocks
     4which are then layout out sequentially before finally being translated to
     5a binary.
     7== GHCs current block layout algorithm.
     9GHC's current way of doing so is very basic.
     11* Build a graph where basic blocks are nodes.
     12* Add a edge for each jump at the end of a basic block.
     13* Find sets of [strongly connected components](
     14* For each set flatten it resulting in a list of basic blocks.
     15* Finally place all of these lists after each other.
     17This works well when all important jumps are at the end of a block.
     18However it comes with two problems:
     20* There is no way to incorporate edge weight information directly.
     21* Important edges might be represented by conditional jumps not at the end of an block
     22  or calls. Both of which are not considered at all.
     24== Impact on performance
     26Modern CPUs are complicated so predicting performance is difficult
     27at the best of times. After all we want to optimize for:
     29* Efficiently using L1 cache (64byte lines)
     30* Efficiently using the fetch window (16Byte)
     31* Efficiently using the DSB.
     32* Match alignment with instruction boundries.
     34We can't really do much about alignments past the first block. GHC doesn't know about the length of instructions and only emits assembly. So we can't calculate where the alignment of blocks/instructions ends up. Changing this would be a major change.
     36For the other factors there a lot of ways to improve things and even more to make things worse. But the simplest thing we CAN do is to try and place blocks which are executed consecutively in execution order.
     38== Factorial example
     40For highlighting the basic issues we don't have to look far. The factorial function already shows some of the problems with ghcs algorithm. Even if they are NOT an issue for this function,
     42Consider the following Cmm code for factorial:
     44=== Cmm Code
     48       c4Ba: stack check
     49           if ((Sp + -16) < SpLim) (likely: False) goto c4Bb; else goto c4Bc;
     50       c4Bb: run gc if stack exhausted
     51           R2 = R2;
     52           R1 = $wfactorial_closure;
     53           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
     54       c4Bc: compare against zero
     55           if (R2 != 0) goto c4B8; else goto c4B9;
     56       c4B8: call factorial
     57           I64[Sp - 16] = c4Bh;
     58           _s4A8::I64 = R2;
     59           R2 = R2 - 1;
     60           I64[Sp - 8] = _s4A8::I64;
     61           Sp = Sp - 16;
     62           call $wfactorial_info(R2) returns to c4Bh, args: 8, res: 8, upd: 8;
     63       c4Bh: x * factorial (x-1)
     64           R1 = I64[Sp + 8] * R1;
     65           Sp = Sp + 16;
     66           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     67       c4B9: x = 1
     68           R1 = 1;
     69           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     73=== Control Flow
     76digraph {
     77        check_stack -> run_gc[label="exhausted",weight="10"];
     78        check_stack -> "x/=0"[label="ok",weight="90"];
     79        ".. factorial(x-1)" -> "x*call_result"[label="10",weight="10"];
     80        "x/=0" -> ".. factorial(x-1)"[label="True",weight="49"];
     81        "x/=0" -> "ret 1"[label="False",weight="51"];
     86=== Actual Control Flow
     88* Repeat:
     89  * check_stack
     90  * x/=0
     91  * call factorial
     92* ret 1
     93* repeat
     94  *  ret (x*call_result)
     95* done
     97=== Generated Layout
     99* check_stack
     100* x/=0
     101* ret 1
     102* ret (x*call_result)
     103* run_gc
     104* call factorial
     106== Possible improvements:
     108Cmm does not know that the argument is usually not null so placing `ret 1` after the comparison is bad but the best we can do.
     110However the current algorithm also can't be easily expanded even if we had this information. After all, only one jump per block is ever considered. And even after that hurdle we still have no way to assign weights to edges.
     112Ideally we would also want to place `call factorial` next to `ret (x*call_result)`. After all there is clearly an edge between these blocks.
     114It's not an big issue for factorial in particular. But if we call a small function from
     115within a loop this would be advantageous.
     117== What I looked into:
     119If we want to use the information present in Cmm we have to make it available when it matters. Which is a few steps down where
     120cmm has been turned into platform dependent instructions already.
     122So I started out with building a CFG which is essentially the CmmGraph stripped of the actual code.
     124=== Building a CFG from Cmm
     126We simply start at the entry block and recursivly add all edges.
     127But further we also weigh them by the type of jump.
     129Example Heuristic:
     131|| Construct    || Weight(s) |
     132|| goto          || 100 ||
     133|| If/Else       || 49/51     ||
     134|| If/Else with likelyhood || 10/90  ||
     135|v Call with return label  || 10     ||
     137The main issue with this is that we have to update the cfg with any changes to
     138the control flow the asm passes introduce after we have created the initial cfg.
     139While it was quiet annoying to track down all the places where this happens in the
     140end with some persistence I found all of these.<sup>At least for x64 ...<sup>I hope ...</sup> </sup>
     142For x64 these passes change the cfg:
     144==== Linear register allocation
     146Here we add blocks when we join two control flow paths. I solved this by tracking inserted nodes in the register allocator state and updating the cfg afterwards.
     148==== Generating instructions from Cmm
     150Initially surprisingly when we have code like below operating on floats.
     152  if (a < b) then { goto L1; } else { goto L2; }
     155We insert a new block `L3: goto L2` and the generated code jumps to L3 if either (a < b) or if the floats are unordered. Floating point is truely a headache.
     157The reason is that if (a,b) is unordered we consider a < b to be false.
     158So we have to check if they are unordered BEFORE we check if a < b.
     160To make it even more complicated we also can't generate a jump directly to the false branch for the parity check.
     161In the place where we generate the floating point comparison we don't know the label of the false branch.
     163Long story short. It's a mess.
     165==== Shortcutting
     167Shortcutting is removing blocks which only transfer controlflow to another block.
     168So we reduce:
     170A: ... ; goto B;
     171B: goto C;
     172C: ...
     174to `A: goto C; C: ...`
     176This was obvious and the easiest place to adjust.
     178=== Place blocks based on this graph
     180We can then construct placements based on this graph.
     182The basic idea is to find many long sequences of blocks which should be placed
     183in series. Then place these next to each other. Very much alike the current
     184algorithm does but based on a weighted digraph instead.
     186The best sequences intuitively have heavy edges.
     187So what I ended up doing is visiting each block in reverse postorder and either
     188append it to an existing sequence if it's a heavy edge. Or create a new sequence consisting of
     189just this block.
     191Since this is a very greedy algorithm it misses some cases. So there is some more edge case handling
     192involved. But it doesn't change the fundamental approach.
     194=== Preliminary results:
     196Which is disappointing. I've combined a earlier version of my layout branch with my patch to [detect error branches as unlikely and recursion as likely]( which gave a improvment of -0.8% which is nice.
     198However as a standalone patch it currently makes performance worse:
     202         lambda           0.0%      0.0%    +22.4%    +20.8%      0.0%
     203             FS           0.0%      0.0%     +9.8%     +9.7%      0.0%
     206            Min          -0.1%      0.0%     -1.7%     -3.2%      0.0%
     207            Max          +0.0%      0.0%    +22.4%    +20.8%      0.0%
     208 Geometric Mean          -0.0%     -0.0%     +1.1%     +0.8%     -0.0%
     210FS seems like it is just a edge case for the new layout algorithm.
     211With lambda I have no idea what is up. The benchmark basically runs two functions: `main = do { mainSimple ; mainMonad }`.
     213It seems like some edge case I didn't consider. But so far I haven't really been able to pinpoint why it get's faster by so much.
     215Removing either one the new layout code is faster. Using -XStrict[Data] the new one is faster.
     216However when padding the whole executable to change alignment the difference stays about the same. So it doesn't seem to be an alignment issue.
     218Performance counters indicate the DSB buffer being the culprint. But I haven't yet been able to find out how so.
     220=== Things left to do:
     222It's not clear why the lambda case is so slow.
     224Besides that another question is how calls would be best handled.
     225If we have a small function f we really want to keep a sequence like
     228A: ...;
     229B: ...;
     230   call f returns to C;
     231C: ...;
     234a sequence when laying out the code since there is a good chance blocks B and C will overlap in the cache.
     236On the other hand if f is a large function then by the time we return any benefit we can gain by putting C
     237right after B is gone. (Cache lines have been evicted, buffers invalidated, ...).
     239=== Conclusion
     241I've dug pretty deep into this. But with the results as they are this does not seem ready for production.
     242Given the time constraints of GSOC I will for now shelf my patch and probably revisit it when I work on
     243likelyhood information again.
     245I will put the patch up on phabricator once I have cleaned it up a bit. If you have ideas or feedback hit me on IRC
     246or comment on [trac #15124](