Opened 16 months ago

Last modified 8 months ago

#15124 new task

Improve block layout for the NCG

Reported by: AndreasK Owned by: AndreasK
Priority: normal Milestone: 8.10.1
Component: Compiler (NCG) Version: 8.2.2
Keywords: CodeGen Cc: jmct
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #8082 #14672 Differential Rev(s): Phab:D4726
Wiki Page:

Description

The current codegen tries to place two blocks A and B next to each other if they are "close".

Drawbacks of the current algorithm

Blocks are only considered close if one jumps to the other via a unconditional jump instructions. This is a fast and reasonable approximation. But we might be able to do better.

Calls are not considered even if they always return to the same block.

For example it's clear that if foo == bar we jump to C here:

A:
	movq $block_C_info,-16(%rbp)
        cmp foo,bar
	jne D
B:
	jmp *(%rbx) #Returns to C

        ...

.align 8
	.infotable
block_C_info:
C:
        doThings

Conditional jumps are never considered.

The following block either jumps directly to c3mQ or returns there from a call. Yet that is completely ignored by the current algorithm.

_c3mG:
	movq $block_c3mQ_info,-8(%rbp)
        ...
	jne _c3mQ
	jmp *(%rbx) #returns to c3mQ

Likelyhood of branches is not considered.

We track information about branches which are rarely taken at the cmm level. However this information is currently lost during the Cmm -> Asm transformation which happens before we do code layout.

This means for Cmm code where one branch is never executed the branch that's never executed might be the only one considered relevant.

This happens for example for this stack check:

    if ((Sp + -40) >= SpLim) (likely: True) goto c5PS; else goto c5Q3;

Potential gains

If nothing else this would allow some Cmm optimizations which are made useless by the current code layout algorithm. I have hit cases where the block layout algorithm turned optimizations into pessimizations of 2-5% by pulling obvious loops apart.

Given that the scenario that leads to loops being pulled apart doesn't seem that unlikely I assume some loops are also pulled apart like this in HEAD.

Cleaning this up ideally would also make it possible to move blocks which are not part of loops out of them.

Change History (13)

comment:1 Changed 16 months ago by AndreasK

Besides the usual (Aigner Guides, Intel Optimization Manuals) here are a few more references which should make a good starting point to explore this space:

[1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do i.org/10.1145/173262.155119)

[2] Hans Wennborg. The recent switch lowering improvements. (http://llv m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http s://www.youtube.com/watch?v=gMqSinyL8uk

[3] James E. Smith. A study of branch prediction strategies (https://dl .acm.org/citation.cfm?id=801871)

[4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html

[5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht tps://doi.org/10.1145/93542.93550)

[6] Hashemi et al. Efficient procedure mapping using cache line coloring (https://doi.org/10.1145/258915.258931)

comment:2 Changed 16 months ago by jstolarek

comment:3 Changed 16 months ago by jmct

Cc: jmct added

comment:4 Changed 16 months ago by AndreasK

Owner: set to AndreasK

comment:5 Changed 15 months ago by AndreasK

Differential Rev(s): Phab:D4726

I've wrote up a patch implementing an experimental code layout algorithm, which hopefully is easy to adjust to take advantage of additional information about control flow.

Code is at Phab:D4726. As it stands:

  • It works (only) on x64.
  • Performance is around the same. (Better but within the margin of error).
  • Compiler performance is slightly worse.

There are a lot of knobs to tune this approach further but verifying any results is tedious so I stopped eventually.

For now I hope to use this eventually as a substrate for the work on #14672.

I've also created a wiki page here https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CodeLayout with more details.

Last edited 15 months ago by AndreasK (previous) (diff)

comment:6 Changed 15 months ago by AndreasK

Ater a bit of tweaking runtime difference is at -0.5%.

The next step is to verify if this holds up on a different machines as well.

comment:7 in reply to:  5 Changed 14 months ago by AndreasK

Replying to AndreasK:

I've wrote up a patch implementing an experimental code layout algorithm, which hopefully is easy to adjust to take advantage of additional information about control flow.

Code is at Phab:D4726. As it stands:

  • It works (only) on x64.
  • Performance is around the same. (Better but within the margin of error).
  • Compiler performance is slightly worse.

There are a lot of knobs to tune this approach further but verifying any results is tedious so I stopped eventually.

For now I hope to use this eventually as a substrate for the work on #14672.

I've also created a wiki page here https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CodeLayout with more details.

I've came over a comment here: https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/NCG#Redundantreconstructionofthecontrolflowgraph

The allocator goes to considerable computational expense to construct all the flow edges in the group of instructions it's allocating for, by using the insnFuture function in the Instr pseudo-abstract type.

Since my patch already constructs and maintains a CFG maybe we can reuse this. Although currently it's on the basic block not instruction level and I haven't checked if the comment is still relevant either.

comment:8 Changed 14 months ago by bgamari

Milestone: 8.6.18.8.1

These won't be addressed for GHC 8.6.

comment:9 Changed 12 months ago by AndreasK

Snapshot of the performance gains by the linked patch. (Speedup - so higher is better). Last I checked nofib performance was neither here no there in terms of runtime. (Slightly slower on sandy bridge, slightly faster on haswell). Compile time with the patch went down on both though.

Library Sandy Bridge (Linux) Haswell (Linux) Skylake (Win)
aeson +2.6%/+2.8% +2.3%/-0% +1.2%/+1.0%
containers +1.4%/+1.2% +1.1%/+1.7% +1.7%/+1.0%
megaparsec +3.2%/+3.6% +13.6%/+13.7% 1) +8.0%/+6.6%
perf-xml +0.2%/+0.0% +1.1%/+0.8%
text +3.0%/+3.0% NA NA
Vector +2.5%/+3.6% +2.5%/+2.9% +1.3%/3.8%

1 ) Inflated by background noise.

comment:10 Changed 11 months ago by kavon

Dropping a recent paper on this topic that uses dynamic profiling, though there might be some adaptable ideas / further reading in here: https://arxiv.org/abs/1810.00905

comment:11 in reply to:  10 Changed 11 months ago by AndreasK

Replying to kavon:

Dropping a recent paper on this topic that uses dynamic profiling, though there might be some adaptable ideas / further reading in here: https://arxiv.org/abs/1810.00905

Cool idea!

I don't know how much Haskell could would benefit comperatively here as I would assume info tables would get in the way of fall throughs between functions. But might still be worth doing eventually.

comment:12 Changed 9 months ago by Andreas Klebinger <klebinger.andreas@…>

In 912fd2b6/ghc:

NCG: New code layout algorithm.

Summary:
This patch implements a new code layout algorithm.
It has been tested for x86 and is disabled on other platforms.

Performance varies slightly be CPU/Machine but in general seems to be better
by around 2%.
Nofib shows only small differences of about +/- ~0.5% overall depending on
flags/machine performance in other benchmarks improved significantly.

Other benchmarks includes at least the benchmarks of: aeson, vector, megaparsec, attoparsec,
containers, text and xeno.

While the magnitude of gains differed three different CPUs where tested with
all getting faster although to differing degrees. I tested: Sandy Bridge(Xeon), Haswell,
Skylake

* Library benchmark results summarized:
  * containers: ~1.5% faster
  * aeson: ~2% faster
  * megaparsec: ~2-5% faster
  * xml library benchmarks: 0.2%-1.1% faster
  * vector-benchmarks: 1-4% faster
  * text: 5.5% faster

On average GHC compile times go down, as GHC compiled with the new layout
is faster than the overhead introduced by using the new layout algorithm,

Things this patch does:

* Move code responsilbe for block layout in it's own module.
* Move the NcgImpl Class into the NCGMonad module.
* Extract a control flow graph from the input cmm.
* Update this cfg to keep it in sync with changes during
  asm codegen. This has been tested on x64 but should work on x86.
  Other platforms still use the old codelayout.
* Assign weights to the edges in the CFG based on type and limited static
  analysis which are then used for block layout.
* Once we have the final code layout eliminate some redundant jumps.

  In particular turn a sequences of:
      jne .foo
      jmp .bar
    foo:
  into
      je bar
    foo:
      ..

Test Plan: ci

Reviewers: bgamari, jmct, jrtc27, simonmar, simonpj, RyanGlScott

Reviewed By: RyanGlScott

Subscribers: RyanGlScott, trommler, jmct, carter, thomie, rwbarton

GHC Trac Issues: #15124

Differential Revision: https://phabricator.haskell.org/D4726

comment:13 Changed 8 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.