Opened 8 years ago

Closed 8 years ago

#5598 closed task (fixed)

Function quotRem is inefficient

Reported by: boris Owned by: igloo
Priority: normal Milestone: 7.6.1
Component: Compiler Version: 7.0.3
Keywords: division, performance, primop Cc: kazu@…, pumpkingod@…, dterei
Operating System: Unknown/Multiple Architecture: x86
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Function quotRem gets compiled into two div instructions although div computes both quotient and remainder. This inefficiency exists both with NCG and LLVM backends. Thus, quotRem is at least twice as slow as it could be.

As far as I understand, quotRem is decomposed into two primops quotInt# and remInt# (or quotWord# and remWord#) and each of them is compiled independently into code which has div instruction. I propose to add new primops quotRemInt# and quotRemWord# to address this flaw.

Please see the sample code and corresponding assembly.

Attachments (4)

division.hs (132 bytes) - added by boris 8 years ago.
division-asm.txt (7.6 KB) - added by boris 8 years ago.
division-llvm.txt (81.1 KB) - added by boris 8 years ago.
0001-Implement-quotRemInt-primop.patch (15.3 KB) - added by igloo 8 years ago.

Download all attachments as: .zip

Change History (20)

Changed 8 years ago by boris

Attachment: division.hs added

Changed 8 years ago by boris

Attachment: division-asm.txt added

Changed 8 years ago by boris

Attachment: division-llvm.txt added

comment:1 Changed 8 years ago by boris

Correction: two divs exist only for NCG. LLVM optimiser replaces div by smarter code based on shifts, but this code is still duplicated for quot and rem.

Division is duplicated because in the LLVM assembly the dividend is loaded twice and optimiser fails to merge divisions.

134	    %ln1ot = load i64* %ls1ln
135	    %ln1ou = srem i64 %ln1ot, 123
136	    store i64 %ln1ou, i64* %ls1lW
137	    %ln1ov = load i64* %ls1ln
138	    %ln1ow = sdiv i64 %ln1ov, 123

When I replaced %ln1ov with %ln1ot, the native assembly code performed division only once.

comment:2 Changed 8 years ago by kazu-yamamoto

Cc: kazu@… added

comment:3 Changed 8 years ago by igloo

I've run into this before, but it's currently tricky to fix as CmmMachOp can only return a single result. The nicest way to fix this would be to relax that restriction, so that operations like quotRem and addWithCarry can be implemented as CmmMachOps.

comment:4 Changed 8 years ago by igloo

This would also let us define a multiply op that returns the high word.

comment:5 Changed 8 years ago by pumpkin

Cc: pumpkingod@… added

comment:6 Changed 8 years ago by simonpj

It's true that a CmmMachOp can only return one result, but a CallishMachOp could in principle return more than one.

Background: some PrimOps may be implemented by foreign calls (sin is an example). In the translation from STG to Cmm, these primops are separated out by CgPrimOp.callishOp, and we generate a Cmm statement like

r = op(a,b)

Becuase this is a statement, it can have several results. There'd be no difficulty with

r,s = op(a,b)

In contrast, ordinary primops can occur in Cmm expressions and so can have only one result.

So I think all we need is a PrimOp returning an unboxed pair, with a type like

quotRem# :: Int# -> Int# -> (# Int#, Int# #)

or something like that. It's just barely possible that it'd Just Work if we added such a primop (provided it was "callish"), and suitably extended callishOp, but I bet we've never tried it with two results, so it'd probably fail.

Anyway unless I'm wildly off-beam, the point is that we're all set to do it right now; so someone might like to have a go.

Simon

comment:7 Changed 8 years ago by igloo

Milestone: 7.6.1
Owner: set to igloo

comment:8 Changed 8 years ago by dterei

Cc: dterei added

Changed 8 years ago by igloo

comment:9 Changed 8 years ago by igloo

difficulty: Unknown

I had a look at this. It actually seemed simpler to me to just use a MachOp after all, and immediately assign the result. I also had to implement an Assign2 constructor, and add a constructor to the CmmType type; is that what you intended?

I've attached a patch with my attempt, although it only implements the new primop for the X86 NCG.

comment:10 Changed 8 years ago by simonpj

Thanks for looking at this. And thank you for checking about the design. I have some concerns:

  • I'm against having a MachOp that returns two results. As you point out we’d then need a CmmType for a multiple value, and that has knock-on consequences that are far reaching and hard to forsee.
  • Likewise I’d really like to avoid adding CmmAssign2.
  • Instead, I suggest we use CmmNode.CmmUnsafeForeignCall. That is already set up to be a “fat machine instruction”, and already has multiple results. I see no difficulty with this path; it just means adding a CallishMachOp.

I'm not sure exactly what the equivalent in the old codegen path is, but it'd be ok for the new story to work well only in the new path.

comment:11 Changed 8 years ago by igloo@…

commit 7bfb7bfc6da981ef827b1a166c8cbfb5b29a25a4

Author: Ian Lynagh <igloo@earth.li>
Date:   Tue Feb 14 21:26:18 2012 +0000

    Define a quotRem CallishMachOp; fixes #5598
    
    This means we no longer do a division twice when we are using quotRem
    (on platforms on which the op is supported; currently only amd64).

 compiler/cmm/CmmMachOp.hs                 |    5 +-
 compiler/cmm/OldCmmUtils.hs               |   11 +
 compiler/cmm/PprC.hs                      |   10 +-
 compiler/codeGen/CgPrimOp.hs              |    9 +
 compiler/ghc.cabal.in                     |    1 -
 compiler/llvmGen/LlvmCodeGen/CodeGen.hs   |   15 +-
 compiler/nativeGen/PPC/CodeGen.hs         |   14 +-
 compiler/nativeGen/SPARC/CodeGen.hs       |  499 +++++++++++++++++++++++-----
 compiler/nativeGen/SPARC/CodeGen/CCall.hs |  343 --------------------
 compiler/nativeGen/X86/CodeGen.hs         |   31 ++-
 compiler/prelude/primops.txt.pp           |    5 +
 11 files changed, 495 insertions(+), 448 deletions(-)

comment:12 Changed 8 years ago by boris

Thank you for fixing this. Despite LLVM does not have a div-like command, we can help it to optimize the second division away by replacing two loads with one.

Actual:

%ln1ot = load i64* %ls1ln
%ln1ou = srem i64 %ln1ot, 123
store i64 %ln1ou, i64* %ls1lW
%ln1ov = load i64* %ls1ln
%ln1ow = sdiv i64 %ln1ov, 123
store i64 %ln1ow, i64* %ls1lV

Expected:

%ln1ot = load i64* %ls1ln
%ln1ou = srem i64 %ln1ot, 123
store i64 %ln1ou, i64* %ls1lW
%ln1ow = sdiv i64 %ln1ot, 123
store i64 %ln1ow, i64* %ls1lV

Perhaps this optimization can be generalized further as sharing load command for multiple adjacent primops. Some time ago I saw several loads from one memory location in LLVM assembly not related to division.

comment:13 Changed 8 years ago by rl

Have you looked at the output of LLVM's optimiser and verified that the load is still there? If LLVM can't remove it, it's because it thinks the store might alias the load. That should be fixed by David's TBAA patches.

comment:14 Changed 8 years ago by igloo@…

commit 7a7f6d703a99045cb9f590c819b795409a090022

Author: Ian Lynagh <igloo@earth.li>
Date:   Fri Feb 17 22:46:27 2012 +0000

    Add a primop for unsigned quotRem; part of #5598
    
    Only amd64 has an efficient implementation currently.

 compiler/cmm/CmmMachOp.hs               |    1 +
 compiler/cmm/OldCmmUtils.hs             |    4 +++
 compiler/cmm/PprC.hs                    |    1 +
 compiler/codeGen/CgPrimOp.hs            |    8 ++++++
 compiler/llvmGen/LlvmCodeGen/CodeGen.hs |    1 +
 compiler/nativeGen/PPC/CodeGen.hs       |    1 +
 compiler/nativeGen/SPARC/CodeGen.hs     |    1 +
 compiler/nativeGen/X86/CodeGen.hs       |   38 ++++++++++++++++++------------
 compiler/prelude/primops.txt.pp         |    4 +++
 9 files changed, 44 insertions(+), 15 deletions(-)

comment:15 Changed 8 years ago by boris

Checked the output of LLVM backend on HEAD. It seems that the issue is fixed.

comment:16 Changed 8 years ago by igloo

Resolution: fixed
Status: newclosed

Now implemented, with support for the new callishMachOps in the x86 and x86_64 NCGs.

Note: See TracTickets for help on using tickets.