Opened 6 years ago

Last modified 14 months ago

#8279 new bug

bad alignment in code gen yields substantial perf issue

Reported by: carter Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.7
Keywords: Cc: bgamari@…, schyler, hvr, george.colpitts@…, archblob, gidyn, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8082 Differential Rev(s):
Wiki Page:

Description (last modified by jstolarek)

independently, a number of folks have noticed that in various ways, GHC currently has quite a few different memory alignment related performance problems that can have >= 10% perf impact!

Nicolas Frisby notes

On my laptop, a program showed a consistent slowdown with -fdicts-strict

I didn't find any obvious causes in the Core differences, so I turned to Intel's 
Performance Counter Monitor for measurements. After trying a few counters, I eventually 
saw that there are about an order of magnitude more misaligned memory loads with 
-fdicts-strict than without, so I think that may be a significant part of the slowdown.
I'm not sure if these are code or data reads.

Can anyone suggest how to validate this hypothesis about misaligned reads?

A subsequent commit has changed the behavior I was seeing, so I'm not interested 
in alternatives means to determine if -fdicts-strict is somehow at fault — I'm just 
asking specifically about data/code memory alignment in GHC and how to 
diagnose/experiment with it.

Reid Barton has independently noted

so I did a nofib run with llvm libraries, ghc quickbuild

so there's this really simple benchmark tak,
it doesn't use any libraries at all in the main loop because the Ints all get unboxed
but it's still 8% slower with quick-llvm (vs -fasm)
weird right?

[14:36:30] <carter>	 could you post the asm it generates for that function?
[14:36:49] <rwbarton>	 well it's identical between the two versions
<rwbarton>	 but they get linked at different offsets because some llvm sections are different sizes
<rwbarton>	 if I add a 128-byte symbol to the .text section to move it to the same address... then the llvm libs version is just as fast
<rwbarton>	 well, apparently 404000 is good and 403f70 is bad
 <rwbarton>	 I guess I can test other alignments easily enough
<rwbarton>	 I imagine it wants to start on a cache line
 <rwbarton>	 but I don't know if it's just a coincidence that it worked with the ncg libraries
 <rwbarton>	 that it got a good location

<rwbarton>	 for this program every 32-byte aligned address is 10+% faster than any merely 16-byte aligned address

 <rwbarton>	 and by alignment I mean alignment of the start of the Haskell code section
 <carter>	 haswell, sandybridge, ivy bridge, other?
 <rwbarton>	 dunno
 <rwbarton>	 I have similar results on Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
 <rwbarton>	 and on Quad-Core AMD Opteron(tm) Processor 2374 HE
 <carter>	 ok
 <rwbarton>	 trying a patch now that aligns all *_entry symbols to 32 bytes

the key point in there is that on the tak benchmark, better alignment for the code made a 10% perf differnce on TAk on Core2 and opteron cpus!

benjamin scarlet and Luite are speculating that this may be further induced by Tables next to code (TNC) accidentally creating bad alignment so theres cache line pollution / conflicts between the L1 Instruction-cache and data-caches. So one experiment would be to have the TNC transform pad after the table so the function entry point starts on the next cacheline?

Change History (29)

comment:1 Changed 6 years ago by ezyang

The bad cache behavior of tables next to code has been observed in (some of the discussion there is out-of-date, as it predates dynamic pointer tagging).

comment:2 Changed 6 years ago by carter

Nathan howell notes that its ok for L1 I and data caches to have the same cache lines, so the issue's probably just having good alignment, and that the naive padding TNC idea will likely decrease locality quality

Last edited 6 years ago by carter (previous) (diff)

comment:3 Changed 6 years ago by carter

anyways: any exploration of this will require thorough benchmarking / nofib exercise, ideally on a diversity of CPU variants / microarchitectures

comment:4 Changed 6 years ago by rwbarton

The "align *_entry to 32 bytes" patch helped tak a lot (~10%), but made no noticeable difference on average over nofib (lots of random changes roughly in the range -3% to 3%).

I don't really understand how shifting the address of code by 16 bytes can have such a drastic effect on performance. I guess it must have to do with cache lines, but is using one more cache line really so awful?

comment:5 Changed 6 years ago by carter

@rwbarton, good question!

I think we'll just have to figure out some more systematic experiments, and try to get good measurements on a variety of recent cpu micro architectures, to understand this better. Hopefully something we can explore once 7.9 dev gets afoot!

comment:6 Changed 6 years ago by simonmar

I've noticed things like this in the past. One hypothesis is that when the code is tightly packed together the branch predictor doesn't work so well, but that's just a guess.

The way forward is to do systematic measurements with all of nofib, measuring performance counters and code size with various alignment strategies. Also someone could thoroughly read the latest Intel optimization guides and see if we're doing the right things.

comment:7 Changed 6 years ago by jstolarek

comment:8 Changed 6 years ago by bgamari

Cc: bgamari@… added

comment:9 Changed 6 years ago by schyler

Little bit of research uncovers that Intel recommends aligning code and branch targets on 16-byte boundaries: ` - Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned.`

The reasons for this are as follows;

  • Aligning to a 16-byte boundary means that it's more likely loops will fall inside a single cacheline rather than inside 2. Falling on a boundary in a loop has a really noticeable negative performance impact.
  • Forward small relative jumps that lie inside the same cache line are handled more efficiently by the pipeline (and alignment is more likely to bump both over into the same line).

Note that we have to be really careful not to turn a small relative jump into a larger jump by altering alignment too much, because these tend to be slower.

Tail calls may benefit from aligning both the top of tables (for aligned reading) and the top of actual function entry points (for aligned iteration) to 16 bytes. This requires experimentation and benchmarking.

It's probably safe to say that the binary size implications of alignment are probably irrelevant today. This is easy performance fruit and binary size can be won elsewhere (imo).

Last edited 6 years ago by schyler (previous) (diff)

comment:10 Changed 6 years ago by schyler

Cc: schyler added

comment:11 Changed 6 years ago by simonmar

So the problem with 16-byte aligning branch targets is that many of our code blocks have a 3-word info table. We would have to pad these info tables by one word in addition to aligning to 16 bytes. That might not be too bad, but someone needs to do the measurements to see what the code size / speed tradeoff is.

Also we don't necessarily want to align all our labels, because many of them are just heap-check failure targets and wouldn't benefit from aligning at all.

I tend to optimise for small binary sizes because I was brought up on computers with 32K of memory and I think you should never waste a byte :-) If you think binary size can be won elsewhere, please do it :-P

comment:12 Changed 6 years ago by thoughtpolice

Milestone: 7.10.1
Priority: highesthigh

I'm de-escalating this and punting it to 7.10. There's no reason it should be highest priority IMO.

comment:13 Changed 5 years ago by hvr

Cc: hvr added

comment:14 Changed 5 years ago by George

Is this true for both the new code generator and llvm?

comment:15 Changed 5 years ago by George

Cc: george.colpitts@… added

comment:16 Changed 5 years ago by jstolarek

Description: modified (diff)

comment:17 Changed 5 years ago by simonpj

If we knew that better alignment would improve speed at the cost of binary size (something that might differ across different architectures), that would be more motivation for a flag -fchoose-speed-over-binary-size. Another project for someone!


comment:18 Changed 5 years ago by archblob

Cc: archblob added

comment:19 Changed 5 years ago by carter

Priority: highnormal

comment:20 Changed 5 years ago by gidyn

Cc: gideon@… added

comment:21 Changed 4 years ago by gidyn

Cc: gideon@… removed

comment:22 Changed 4 years ago by gidyn

Cc: gidyn added

comment:23 Changed 4 years ago by thoughtpolice


Milestone renamed

comment:24 Changed 4 years ago by rwbarton

Some related reading material:

comment:25 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:26 Changed 3 years ago by George

re Simon PJ's comment 17, would it be relatively easy to do this with the llvm compiler? AFAIK opt and llc don't seem to have the appropriate options but clang has -Os and -Oz (see so I would hope it wouldn't be too hard to do in the Haskell llvm compiler.

Also wrt llvm there is mention of alignment here I'm sure most people interested in the llvm compiler are aware of this doc but I thought I may as well mention it to be complete.

comment:27 Changed 14 months ago by sgraf

I hit this today in CSD (kernel is basically a counting loop), with a delta of 12%. Supplying -fllvm swapped symptoms for me, still a delta of ~-8%. This was 1a0a971b76c0b717794af9af4e27dcb488924800 vs

comment:28 Changed 14 months ago by sgraf

Cc: sgraf added

comment:29 Changed 14 months ago by George

re Simon PJ's and Simon Mar's comment , we don't know that that better alignment would improve speed at the cost of binary size , but the Intel documentation is a pretty strong hint that it probably will and is thus worth checking out, right? Couldn't we break this into two tasks: First to provide a patch with instructions on how to apply it e.g. to 8.6.1 or head or whatever and a task to benchmark the resulting compiler. The benchmarking task could be done by people with much less, or at least different, expertise than it takes to produce such a patch right?

Last edited 14 months ago by George (previous) (diff)
Note: See TracTickets for help on using tickets.