Opened 3 years ago

Last modified 3 years ago

#13570 new bug

CoreFVs patch makes n-body slower

Reported by: simonpj Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.0.1
Keywords: Cc: kavon@…, dfeuer, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #13629 Differential Rev(s):
Wiki Page:

Description (last modified by dfeuer)

The commit 751996e90a964026a3f86853338f10c82db6b610

commit 751996e90a964026a3f86853338f10c82db6b610
Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Apr 7 17:10:07 2017 +0100

    Kill off complications in CoreFVs

makes n-body in nofib go 25% slower. See perf results page.

Investigate.

Attachments (5)

coreFV-cmm.diff (5.7 KB) - added by dfeuer 3 years ago.
cmm diff before and after
coreFV-stg.diff (52.6 KB) - added by dfeuer 3 years ago.
STG diff before and after
coreFV-simpl.diff (3.0 KB) - added by dfeuer 3 years ago.
Tidy core diff before and after
latest-asm.2 (64.4 KB) - added by dfeuer 3 years ago.
-ddump-asm actually after the recent change
latest-asm (64.4 KB) - added by dfeuer 3 years ago.
Overwrite garbage

Download all attachments as: .zip

Change History (18)

comment:1 Changed 3 years ago by dfeuer

Description: modified (diff)

Changed 3 years ago by dfeuer

Attachment: coreFV-cmm.diff added

cmm diff before and after

Changed 3 years ago by dfeuer

Attachment: coreFV-stg.diff added

STG diff before and after

Changed 3 years ago by dfeuer

Attachment: coreFV-simpl.diff added

Tidy core diff before and after

comment:2 Changed 3 years ago by dfeuer

Whatever's going on here looks really subtle. The simplifier dumps appear to be exactly the same, except for doing things in slighly different orders. The STG dumps are a bit harder to read, but the difference in CMM dumps is sufficiently small and confined that someone familiar with CMM might well be able to see what's going on there directly. I'll have to learn something about that pretty soon, I imagine, but I don't know it yet.

comment:3 Changed 3 years ago by hsyl20

With the floating-in of the array-indexing primop, instead of having the following order:

1) A <- load from address @XXX
2) compute several things (sqrt,mul,div,sub,add) independent of A
3) store (A - ..) at address @XXX

We have this order: 2, 1, 3.

As this is n-body, could it be a prefetching issue?

comment:4 Changed 3 years ago by simonpj

Cc: kavon@… added

I'm a bit surprised that the Sinking pass in Cmm didn't convert 1,2,3 into 2,1,3. In which case both programs would have the same end product.

So you've at least confirmed that the runtime change was indeed (as I thought) something v subtle to do with prefetching or whatnot. It looks unreasonable for the FloatIn pass to choose the "right" thing, even if we knew what was "right".

So it's still interesting, but it removes the urgency of saying "do we want this particular patch".

comment:5 Changed 3 years ago by kavon

It seems the Cmm Sinker won't move the loads in (1) closer to their use in (3) because there is an intervening foreign call to the sqrt function in (2).

Right now, the Sinker's analysis conservatively says that it's not safe to commute a memory load with a foreign call due to the possibility of that call writing to the heap. Ideally, there would be a marker for foreign calls that we know are pure, i.e., math functions like sqrt, so that the loads can move past it. I'm not sure if this marker already exists or not.

In this case, the importance of turning it into 2,1,3 via sinking is that there are fewer values live across the call to sqrt, because that call causes any floating-point values that are in register to be saved to the stack, negating any benefit of loading it so early. Here's the output of the 1,2,3 version I'm seeing with the NCG:

    movsd (%rax), %xmm5     ; load A into xmm5 very early
    ; ...
    movsd %xmm5, 128(%rsp)  ; save A to stack
    ; ...
    call _sqrt
    ; ...
    movsd 128(%rsp), %xmm4  ; restore A from stack
    subsd %xmm3, %xmm4      ; actually use A

I think 2,1,3 is always desirable in Cmm, as the instruction scheduler should be hiding the load latency.

Last edited 3 years ago by kavon (previous) (diff)

comment:6 Changed 3 years ago by bgamari

I've opened #13629 to track the sqrt code generation issue.

comment:7 Changed 3 years ago by dfeuer

Unfortunately, 9ac22183/ghc (the fix to #13629) does *not* seem to fix this. -ddump-asm indicates that we're now using sqrtsd, but the numbers remain wretched. We'll have to see if Gipeda agrees with me, of course.

comment:8 Changed 3 years ago by kavon

I'm surprised n-body doesn't improve after that patch. Could you attach the assembly file? I'm wondering if the stack save/restores are still being emitted around the sqrtsd instruction.

comment:9 Changed 3 years ago by dfeuer

Cc: dfeuer added

comment:10 Changed 3 years ago by dfeuer

Ignore that attachment just now till I fix it. Attached the wrong one.

Changed 3 years ago by dfeuer

Attachment: latest-asm.2 added

-ddump-asm actually after the recent change

Changed 3 years ago by dfeuer

Attachment: latest-asm added

Overwrite garbage

comment:11 Changed 3 years ago by dfeuer

Sorry for all the wrong clicks. The attachment should be right now.

comment:12 Changed 3 years ago by kavon

Well, at least the assembly code after the patch for #13629 is much more efficient!

The sqrt inefficiency in #13629 was a somewhat separate observation: I expected nbody to improve a bit because of that fix, counteracting whatever weird thing is going on in this regression.

I guess once the Gipeda report is up we'll see if that patch helped any of the benchmarks.

comment:13 Changed 3 years ago by michalt

Cc: michalt added
Note: See TracTickets for help on using tickets.