Opened 22 months ago

Last modified 11 months ago

#14672 new task

Make likelyhood of branches/conditions available throughout the compiler.

Reported by: AndreasK Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler Version: 8.2.2
Keywords: CodeGen Cc: michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #849 #15124 #8326 Differential Rev(s): Phab:D4316 Phab:D4324 Phab:D4327
Wiki Page:

Description

For simple code such as if a > 10 then foo else bar it can make a big difference if we jump on a > 10 or a <= 10.

We already tracked the likelyhood of such conditionals for stack/heap checks in Cmm. I recently updated the native codegen on X86 to take this information into account where available and in one edge case this improved execution time from 1.7s to 1.5s!

Having this info available for other cases like bound checking (assume valid) or partial pattern matches (assume no match failure) should be a small win for most programs and hopefully a big one for a few of them.

Change History (22)

comment:1 Changed 22 months ago by AndreasK

Differential Rev(s): Phab:D4316 Phab:D4324

comment:2 Changed 22 months ago by AndreasK

Differential Rev(s): Phab:D4316 Phab:D4324Phab:D4316 Phab:D4324 Phab:D4327

I've created Phab:D4327 for now as a proof of concept.

  • The diff alone improves runtime by about 0.8% for the parts of nofib I looked at.
  • It only covers the case where we have a branch that is recognized as being bottom.
  • It doesn't change asm codegen outside of simple cases.

comment:3 Changed 22 months ago by Ben Gamari <ben@…>

In e7dcc708/ghc:

Add ability to parse likely flags for ifs in Cmm.

Adding the ability to parse likely flags in Cmm allows better codegen
for cmm files.

Test Plan: ci

Reviewers: bgamari, simonmar

Reviewed By: bgamari

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #14672

Differential Revision: https://phabricator.haskell.org/D4316

comment:4 Changed 22 months ago by simonpj

Do we have any documentation of our Cmm-source syntax? It would be Jolly Good to have one. It seems poor to have Foo.cmm files with zero documentation of what's allowed in them.

comment:5 Changed 22 months ago by bgamari

Sadly not really; the closest thing we have is some notes at the top of CmmParse.y. I agree that this is a terrible situation, especially given that some performance-critical user packages actually use C--.

comment:6 Changed 22 months ago by AndreasK

Ideally we would add human readable examples to the Parser file and update https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType as well.

But that's not a small task and outside of the scope of this.

I might at least add examples for the if syntax though.

comment:7 Changed 22 months ago by AndreasK

D4327 will add likeliness Information at the STG stage.

Some of the next tasks to improve on this:

  • Take more advantage of this in the backend.
    • Allow the cmm parser to process likely information in switch statements.
    • Build switches in a manner that the most likely path is the fastest one.
    • Use this information for code layout optimization.
    • Add hints to more cmm code used within GHC.
    • Investigate other use cases.
  • Find a good way for users to provide this information.
    • Bikeshedding: Syntax, Semantics etc. for HsSyn
    • How to represent und use this in the core Pipeline
      • Simon M recommended using ticks for this.

comment:8 Changed 22 months ago by svenpanne

Just a general remark: D4327 talks about annotations as a source of likelihood values, which is fine, but having actual data from previous runs, i.e. using profile-guided optimization, would probably have much more potential. Humans are notoriously bad at guessing what actually eats performace, so manual annotations can only get you so far.

As an example: The performance of the CPython interpreter itself increases by roughly 10% if it is compiled with PGO and LTO (link-time optimization), see e.g.

Of course the effect of PGO heavily depends on the actual program and the programming language, but I think it is something which should be considered in the long run. Perhaps our LLVM backend already has some (relatively) easy way to make profile data available to the LLVM pipeline, but I might be wrong here.

https://msdn.microsoft.com/en-us/library/e7k32f4k.aspx lists a few PGOs, and some are definitely worthwhile for GHC, e.g. using profile data in the register allocator to decide which registers to spill etc.

comment:9 Changed 22 months ago by Ben Gamari <ben@…>

In 12056292/ghc:

Add likely annotation to cmm files in a few obvious places.

Provide information about paths more likely to be taken in the cmm files
used by the rts.

This leads to slightly better assembly being generated.

Reviewers: bgamari, erikd, simonmar

Subscribers: alexbiehl, rwbarton, thomie, carter

GHC Trac Issues: #14672

Differential Revision: https://phabricator.haskell.org/D4324

comment:10 in reply to:  8 ; Changed 22 months ago by AndreasK

Replying to svenpanne:

Just a general remark: D4327 talks about annotations as a source of likelihood values, which is fine, but having actual data from previous runs, i.e. using profile-guided optimization, would probably have much more potential.

I was talking about annotations primarily because they seem like the low hanging fruit.

As far as the native codegen is concerned it makes no difference if the data comes from user suggestions or actual measurements. Having good data on all code paths would certainly be better then just an estimate for some paths the user annotated.

Once we have a way to pass/use the information throughout the passes adding PGO is the next logical step and should be possible without touching anything past core.

comment:11 in reply to:  10 Changed 22 months ago by svenpanne

Replying to AndreasK:

I was talking about annotations primarily because they seem like the low hanging fruit.

I assumed that, and it's definitely the right way to start. And annotations can sometimes be really useful, e.g. GCC has _builtin_expect for a reason. :-) In general, it's just a bit hard to find the places for such annotations: You have to profile your code, identify the hot spots, and figure out if a branch hint helps. PGO automates that process.

[...] Once we have a way to pass/use the information throughout the passes adding PGO is the next logical step and should be possible without touching anything past core.

PGO is not about branch hints alone, it is e.g. tremendously helpful to have an estimate how often a looping construct actually loops, know the usual target of an indirect branch etc. (Well, in a sense these are all some kind of branch hint, but in a much broader sense.)

So it might be a good idea to plan for something more general than a relative branch frequency. But I guess even threading that simple information through the compiler and use it wisely is not a simple task, so let's not overengineer in the beginning...

comment:12 Changed 21 months ago by michalt

Cc: michalt added

comment:13 Changed 20 months ago by AndreasK

Just document it for a future time:

Coming across the Ticky-ticky profiling page this would seem like a good way collect branch weights for the generated STG code.

The plan would then be to

  • Compile the executable with ticky-ticky enabled.
  • Collect information from a representative set of use cases.
  • Compile the executable without tick-ticky, using the collected information to inform GHC about relative branch weights.

comment:14 Changed 19 months ago by AndreasK

I did some work on this for the Cmm/STG stages. But so far the results still don't justify the complexity. Although it's pretty clear which things should be tried before further judgment can be made.

I did implement a rudimentary version which detected recursion as likely and bottoming branches as unlikely. Overall performance was underwhelming. While most nofib programs got slightly faster (0-1%) very few got a lot (order of ~5%) slower.

What I did not do yet is propagte the likelyhood info into the asm passes which is probably required. Without this we get some awkward interactions. For example we might have code of the sort:

if checkError() then {goto panic} else {goto foo};

Assuming another block also jumps to foo there are two ways to compile this block:

check:
    cmp R1, 0
    jnz panic
    jmp foo
    
check2:
    cmp R1, 0
    jz foo
    jmp panic

Performance of jumps to panic is essentially meaningless so we want variant two which saves an instruction in each loop.

Blocklayout however turns this into a pessimization.

We often get something like this:

check:
    cmp R1, 0
    jnz panic
    jmp foo
bar: #<other block jumping to foo>
    few ins here
    #<shortcut>jmp foo
foo:
    ins do things
panic:
    some
    more
    instructions

This seems reasonable. If we are lucky bar is small enough that when we jump to foo we won't even miss the cache.

But if we optimize the check block we suddenly get:

check2:
    cmp R1, 0
    jz foo
    #<shortcut> jmp panic #block layout assumes we will take this path
panic:
    some
    more
    instructions
bar: #<other block jumping to foo>
    few ins here
    #<shortcut>jmp foo
foo:
    ins do things

Clearly we don't want this because it pulls check2 and foo far apart. But block layout assumes the last jump is the likely one so tries to place panic right after check2. If this is a loop there is a good chance that this causes a cache miss on each jump to foo.

I did play around with the block layout code but without actually having likelyhood info I couldn't make it work better than what we have know. I do want to give lowering likelyhood info into the asm stage a try sooner or (probably) later since I would expect that to lead to much better code..

However for now I ran out of steam before I came up with a good way to tackle this.

Things that need to be still done.

  • Find a good layout algorithm which can make use of partial information. There are descriptions for algorithms which work with full information about block entry counts. But Implementing one which works well on partial information is something I couldn't find anything on yet. Rolling my own is not out of the question but seems like something for a few weeks and not a few days.
  • Chose a design for lowering likelyhood information to the asm level. How isn't yet clear. Annotate the instructions? Add information about blocks in a sidechannel? It's also not static since things like the register allocator also generate new blocks on the fly. Shortcutting might remove blocks and so on.
  • Besides block layout if we lower that information it should also be used by the register allocators.

comment:15 Changed 19 months ago by simonpj

Is there a wip branch with your work, in case you or someone else wants to pick it up later?

comment:16 in reply to:  15 Changed 19 months ago by AndreasK

Replying to simonpj:

Is there a wip branch with your work, in case you or someone else wants to pick it up later?

Most of it is at wip/D4327 already. I plan to rebase my latest changes to master, make sure it compiles, is somewhat sane and then ask for someone to update that branch.

comment:17 in reply to:  14 Changed 18 months ago by AndreasK

  • Find a good layout algorithm which can make use of partial information. There are descriptions for algorithms which work with full information about block entry counts. But Implementing one which works well on partial information is something I couldn't find anything on yet. Rolling my own is not out of the question but seems like something for a few weeks and not a few days.
  • Chose a design for lowering likelyhood information to the asm level. How isn't yet clear. Annotate the instructions? Add information about blocks in a sidechannel? It's also not static since things like the register allocator also generate new blocks on the fly. Shortcutting might remove blocks and so on.
  • Besides block layout if we lower that information it should also be used by the register allocators.

It wrote up a patch for #15124 (Phab:D4726) which implements a new code layout algorithm.

Putting the implementation for likelyhood detection (Phab:D4327) on top of the new layout code lead to runtime improvements of about -0.7%.

However until I have verified these results on another machine I wouldn't consider them actionable.

comment:18 Changed 18 months ago by AndreasK

Keywords: CodeGen added
Milestone: 8.8.1

Turns out this is an old idea: #849

comment:19 Changed 17 months ago by AndreasK

Doing this right might also help with performance being dependant on the syntactic order of Constructors in the end.

For example containers has this note in Data.IntSet.Internal

-- [Note: Order of constructors]
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-- The order of constructors of IntMap matters when considering performance.
-- Currently in GHC 7.0, when type has 3 constructors, they are matched from
-- the first to the last -- the best performance is achieved when the
-- constructors are ordered by frequency.
-- On GHC 7.0, reordering constructors from Nil | Tip | Bin to Bin | Tip | Nil
-- improves the benchmark by circa 10%.

comment:20 Changed 15 months ago by AndreasK

comment:21 Changed 15 months ago by AndreasK

Sidenote: During work on #15124 I've seen that changes with nice improvements in library benchmarks showed almost no effect.

So I want to reevaluate this patch based not only on nofib where it made not much difference in the last round of benchmarks I ran. But also look at other library benchmarks.

comment:22 Changed 11 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.