Changes between Initial Version and Version 1 of Commentary/Compiler/CodeLayout


Ignore:
Timestamp:
May 19, 2018 8:00:48 PM (20 months ago)
Author:
AndreasK
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Commentary/Compiler/CodeLayout

    v1 v1  
     1= Block Layout
     2
     3As expected GHC's code generator eventually produces a list of basic blocks
     4which are then layout out sequentially before finally being translated to
     5a binary.
     6
     7== GHCs current block layout algorithm.
     8
     9GHC's current way of doing so is very basic.
     10
     11* Build a graph where basic blocks are nodes.
     12* Add a edge for each jump at the end of a basic block.
     13* Find sets of [strongly connected components](https://en.wikipedia.org/wiki/Strongly_connected_component).
     14* For each set flatten it resulting in a list of basic blocks.
     15* Finally place all of these lists after each other.
     16
     17This works well when all important jumps are at the end of a block.
     18However it comes with two problems:
     19
     20* There is no way to incorporate edge weight information directly.
     21* Important edges might be represented by conditional jumps not at the end of an block
     22  or calls. Both of which are not considered at all.
     23
     24== Impact on performance
     25
     26Modern CPUs are complicated so predicting performance is difficult
     27at the best of times. After all we want to optimize for:
     28
     29* Efficiently using L1 cache (64byte lines)
     30* Efficiently using the fetch window (16Byte)
     31* Efficiently using the DSB.
     32* Match alignment with instruction boundries.
     33
     34We can't really do much about alignments past the first block. GHC doesn't know about the length of instructions and only emits assembly. So we can't calculate where the alignment of blocks/instructions ends up. Changing this would be a major change.
     35
     36For the other factors there a lot of ways to improve things and even more to make things worse. But the simplest thing we CAN do is to try and place blocks which are executed consecutively in execution order.
     37
     38== Factorial example
     39
     40For highlighting the basic issues we don't have to look far. The factorial function already shows some of the problems with ghcs algorithm. Even if they are NOT an issue for this function,
     41
     42Consider the following Cmm code for factorial:
     43
     44=== Cmm Code
     45
     46
     47{{{
     48       c4Ba: stack check
     49           if ((Sp + -16) < SpLim) (likely: False) goto c4Bb; else goto c4Bc;
     50       c4Bb: run gc if stack exhausted
     51           R2 = R2;
     52           R1 = $wfactorial_closure;
     53           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
     54       c4Bc: compare against zero
     55           if (R2 != 0) goto c4B8; else goto c4B9;
     56       c4B8: call factorial
     57           I64[Sp - 16] = c4Bh;
     58           _s4A8::I64 = R2;
     59           R2 = R2 - 1;
     60           I64[Sp - 8] = _s4A8::I64;
     61           Sp = Sp - 16;
     62           call $wfactorial_info(R2) returns to c4Bh, args: 8, res: 8, upd: 8;
     63       c4Bh: x * factorial (x-1)
     64           R1 = I64[Sp + 8] * R1;
     65           Sp = Sp + 16;
     66           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     67       c4B9: x = 1
     68           R1 = 1;
     69           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     70}}}
     71
     72
     73=== Control Flow
     74
     75{{{
     76digraph {
     77        check_stack -> run_gc[label="exhausted",weight="10"];
     78        check_stack -> "x/=0"[label="ok",weight="90"];
     79        ".. factorial(x-1)" -> "x*call_result"[label="10",weight="10"];
     80        "x/=0" -> ".. factorial(x-1)"[label="True",weight="49"];
     81        "x/=0" -> "ret 1"[label="False",weight="51"];
     82}
     83}}}
     84
     85
     86=== Actual Control Flow
     87
     88* Repeat:
     89  * check_stack
     90  * x/=0
     91  * call factorial
     92* ret 1
     93* repeat
     94  *  ret (x*call_result)
     95* done
     96
     97=== Generated Layout
     98
     99* check_stack
     100* x/=0
     101* ret 1
     102* ret (x*call_result)
     103* run_gc
     104* call factorial
     105
     106== Possible improvements:
     107
     108Cmm does not know that the argument is usually not null so placing `ret 1` after the comparison is bad but the best we can do.
     109
     110However the current algorithm also can't be easily expanded even if we had this information. After all, only one jump per block is ever considered. And even after that hurdle we still have no way to assign weights to edges.
     111
     112Ideally we would also want to place `call factorial` next to `ret (x*call_result)`. After all there is clearly an edge between these blocks.
     113
     114It's not an big issue for factorial in particular. But if we call a small function from
     115within a loop this would be advantageous.
     116
     117== What I looked into:
     118
     119If we want to use the information present in Cmm we have to make it available when it matters. Which is a few steps down where
     120cmm has been turned into platform dependent instructions already.
     121
     122So I started out with building a CFG which is essentially the CmmGraph stripped of the actual code.
     123
     124=== Building a CFG from Cmm
     125
     126We simply start at the entry block and recursivly add all edges.
     127But further we also weigh them by the type of jump.
     128
     129Example Heuristic:
     130
     131|| Construct    || Weight(s) |
     132|| goto          || 100 ||
     133|| If/Else       || 49/51     ||
     134|| If/Else with likelyhood || 10/90  ||
     135|v Call with return label  || 10     ||
     136
     137The main issue with this is that we have to update the cfg with any changes to
     138the control flow the asm passes introduce after we have created the initial cfg.
     139While it was quiet annoying to track down all the places where this happens in the
     140end with some persistence I found all of these.<sup>At least for x64 ...<sup>I hope ...</sup> </sup>
     141
     142For x64 these passes change the cfg:
     143
     144==== Linear register allocation
     145
     146Here we add blocks when we join two control flow paths. I solved this by tracking inserted nodes in the register allocator state and updating the cfg afterwards.
     147
     148==== Generating instructions from Cmm
     149
     150Initially surprisingly when we have code like below operating on floats.
     151```
     152  if (a < b) then { goto L1; } else { goto L2; }
     153```
     154
     155We insert a new block `L3: goto L2` and the generated code jumps to L3 if either (a < b) or if the floats are unordered. Floating point is truely a headache.
     156
     157The reason is that if (a,b) is unordered we consider a < b to be false.
     158So we have to check if they are unordered BEFORE we check if a < b.
     159
     160To make it even more complicated we also can't generate a jump directly to the false branch for the parity check.
     161In the place where we generate the floating point comparison we don't know the label of the false branch.
     162
     163Long story short. It's a mess.
     164
     165==== Shortcutting
     166
     167Shortcutting is removing blocks which only transfer controlflow to another block.
     168So we reduce:
     169```
     170A: ... ; goto B;
     171B: goto C;
     172C: ...
     173```
     174to `A: goto C; C: ...`
     175
     176This was obvious and the easiest place to adjust.
     177
     178=== Place blocks based on this graph
     179
     180We can then construct placements based on this graph.
     181
     182The basic idea is to find many long sequences of blocks which should be placed
     183in series. Then place these next to each other. Very much alike the current
     184algorithm does but based on a weighted digraph instead.
     185
     186The best sequences intuitively have heavy edges.
     187So what I ended up doing is visiting each block in reverse postorder and either
     188append it to an existing sequence if it's a heavy edge. Or create a new sequence consisting of
     189just this block.
     190
     191Since this is a very greedy algorithm it misses some cases. So there is some more edge case handling
     192involved. But it doesn't change the fundamental approach.
     193
     194=== Preliminary results:
     195
     196Which is disappointing. I've combined a earlier version of my layout branch with my patch to [detect error branches as unlikely and recursion as likely](https://phabricator.haskell.org/D4327) which gave a improvment of -0.8% which is nice.
     197
     198However as a standalone patch it currently makes performance worse:
     199
     200
     201```
     202         lambda           0.0%      0.0%    +22.4%    +20.8%      0.0%
     203             FS           0.0%      0.0%     +9.8%     +9.7%      0.0%
     204```
     205--------------------------------------------------------------------------------
     206            Min          -0.1%      0.0%     -1.7%     -3.2%      0.0%
     207            Max          +0.0%      0.0%    +22.4%    +20.8%      0.0%
     208 Geometric Mean          -0.0%     -0.0%     +1.1%     +0.8%     -0.0%
     209
     210FS seems like it is just a edge case for the new layout algorithm.
     211With lambda I have no idea what is up. The benchmark basically runs two functions: `main = do { mainSimple ; mainMonad }`.
     212
     213It seems like some edge case I didn't consider. But so far I haven't really been able to pinpoint why it get's faster by so much.
     214
     215Removing either one the new layout code is faster. Using -XStrict[Data] the new one is faster.
     216However when padding the whole executable to change alignment the difference stays about the same. So it doesn't seem to be an alignment issue.
     217
     218Performance counters indicate the DSB buffer being the culprint. But I haven't yet been able to find out how so.
     219
     220=== Things left to do:
     221
     222It's not clear why the lambda case is so slow.
     223
     224Besides that another question is how calls would be best handled.
     225If we have a small function f we really want to keep a sequence like
     226
     227{{{
     228A: ...;
     229B: ...;
     230   call f returns to C;
     231C: ...;
     232}}}
     233
     234a sequence when laying out the code since there is a good chance blocks B and C will overlap in the cache.
     235
     236On the other hand if f is a large function then by the time we return any benefit we can gain by putting C
     237right after B is gone. (Cache lines have been evicted, buffers invalidated, ...).
     238
     239=== Conclusion
     240
     241I've dug pretty deep into this. But with the results as they are this does not seem ready for production.
     242Given the time constraints of GSOC I will for now shelf my patch and probably revisit it when I work on
     243likelyhood information again.
     244
     245I will put the patch up on phabricator once I have cleaned it up a bit. If you have ideas or feedback hit me on IRC
     246or comment on [trac #15124](https://ghc.haskell.org/trac/ghc/ticket/15124).