Opened 6 years ago

Closed 2 years ago

#8082 closed bug (wontfix)

Ordering of assembly blocks affects performance

Reported by: jstolarek Owned by:
Priority: normal Milestone:
Component: Compiler (NCG) Version: 7.6.3
Keywords: Cc: gidyn, simonmar
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

During my work on #6135 I noticed that performance of reverse-complem benchmark in nofib depends highly on the order in which assembly block are laid out. With my patches I am consistently getting a 18-20% speed-up. In theory my patches should not impact performance of existing programs, but for some reason they affect the ordering of generated assembly blocks. On of the earlier versions of my patch I noticed that kahan benchmark suffered a 16% performance hit and again the only difference I noticed in the generated assembly was ordering of blocks. I did a more in-depth investigation in case of kahan and it turned out that this difference results from the way Core is generated: the difference between HEAD and my patches was that a worker function had its three parameters passed in different order. I did not investigate this for reverse-complem because Core is considerably larger, but I could spend some time on it if it might be relevant.

Attachments (2)

reverse-complem-HEAD.asm (36.7 KB) - added by jstolarek 6 years ago.
Assembly generated by HEAD
reverse-complem-bool-primops.asm (36.7 KB) - added by jstolarek 6 years ago.
Assembly generated with my patches

Download all attachments as: .zip

Change History (10)

Changed 6 years ago by jstolarek

Attachment: reverse-complem-HEAD.asm added

Assembly generated by HEAD

Changed 6 years ago by jstolarek

Assembly generated with my patches

comment:1 Changed 6 years ago by jstolarek

It is best to view attached assembly dumps with:

diff -y --suppress-common-lines reverse-complem-HEAD.asm​ reverse-complem-bool-primops.asm​ | less

Or some other visual diff program.

comment:2 Changed 5 years ago by gidyn

Cc: gidyn simonmar added

comment:3 Changed 5 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:4 Changed 5 years ago by nomeata

I can confirm that this can cause serious wobbles. When working on #10137, I changed the use of < to >= when generating if-then-else trees, and some numbers went up and some went down, without any guidance as to which one is better:

           Min          -0.1%     -0.0%     -3.2%     -3.2%      0.0%
           Max          +0.0%      0.0%     +4.4%     +4.3%     +3.3%
Geometric Mean          -0.0%     -0.0%     +0.3%     +0.3%     +0.1%

Looks like without dynamic tracing, this problem is not easily solved.

comment:5 Changed 5 years ago by carter

could that be a branch prediction issue? (plus perhaps a matter of memory locality in the instruction cache?)

comment:6 Changed 4 years ago by nomeata

Yes, quite likely.

comment:7 Changed 4 years ago by Joachim Breitner <mail@…>

In de1160be047790afde4ec76de0a81ba3be0c73fa/ghc:

Refactor the story around switches (#10137)

This re-implements the code generation for case expressions at the Stg →
Cmm level, both for data type cases as well as for integral literal
cases. (Cases on float are still treated as before).

The goal is to allow for fancier strategies in implementing them, for a
cleaner separation of the strategy from the gritty details of Cmm, and
to run this later than the Common Block Optimization, allowing for one
way to attack #10124. The new module CmmSwitch contains a number of
notes explaining this changes. For example, it creates larger
consecutive jump tables than the previous code, if possible.

nofib shows little significant overall improvement of runtime. The
rather large wobbling comes from changes in the code block order
(see #8082, not much we can do about it). But the decrease in code size
alone makes this worthwhile.

```
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
            Min          -1.8%      0.0%     -6.1%     -6.1%     -2.9%
            Max          -0.7%     +0.0%     +5.6%     +5.7%     +7.8%
 Geometric Mean          -1.4%     -0.0%     -0.3%     -0.3%     +0.0%
```

Compilation time increases slightly:
```
        -1 s.d.                -----            -2.0%
        +1 s.d.                -----            +2.5%
        Average                -----            +0.3%
```

The test case T783 regresses a lot, but it is the only one exhibiting
any regression. The cause is the changed order of branches in an
if-then-else tree, which makes the hoople data flow analysis traverse
the blocks in a suboptimal order. Reverting that gets rid of this
regression, but has a consistent, if only very small (+0.2%), negative
effect on runtime. So I conclude that this test is an extreme outlier
and no reason to change the code.

Differential Revision: https://phabricator.haskell.org/D720

comment:8 Changed 2 years ago by bgamari

Resolution: wontfix
Status: newclosed

I don't really see what we can do about this. There are numerous reasons why code layout affects performance (e.g., see this talk from the LLVM developers meeting) and tackling most of them is Very Hard.

We can continue to chat on this ticket, but I'm going to close it as its not really actionable as-is.

Note: See TracTickets for help on using tickets.