Opened 5 years ago

Last modified 5 years ago

#9660 new bug

unnecessary indirect jump when returning a case scrutinee

Reported by: rwbarton Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.8.3
Keywords: Cc: simonmar, dfeuer
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I happened to be looking at the Cmm for this code (ghc 7.8.3, -O2)

f :: Int -> Int
f x = if x < 0 then x else x+1

and I noticed something a bit funny about it:

       c12e:
           if ((Sp + -8) < SpLim) goto c12z; else goto c12A;
       c12z:
           R2 = R2;
           R1 = Test.f_closure;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       c12A:
           I64[Sp - 8] = c12b;
           R1 = R2;
           Sp = Sp - 8;
           if (R1 & 7 != 0) goto c12b; else goto c12c;
       c12c:
           call (I64[R1])(R1) returns to c12b, args: 8, res: 8, upd: 8;
       c12b:
           Hp = Hp + 16;
           if (Hp > HpLim) goto c12y; else goto c12x;
       c12y:
           HpAlloc = 16;
           R1 = R1;
           call stg_gc_unpt_r1(R1) returns to c12b, args: 8, res: 8, upd: 8;
       c12x:
           _s11Q::I64 = I64[R1 + 7];
           if (%MO_S_Lt_W64(_s11Q::I64, 0)) goto c12u; else goto c12v;
       c12u:
           Hp = Hp - 16;
           R1 = R1 & (-8);                              /* <--- */
           Sp = Sp + 8;
           call (I64[R1])(R1) args: 8, res: 0, upd: 8;  /* <--- */
       c12v:
           I64[Hp - 8] = GHC.Types.I#_con_info;
           I64[Hp] = _s11Q::I64 + 1;
           R1 = Hp - 7;
           Sp = Sp + 8;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;

On the two marked lines, we untag R1 (which is x) and enter it. However, we know at this point that x is already in WHNF so we could simply return it by replacing the two lines with call (P64[Sp])(R1), if I'm not mistaken. That will save a load and an indirect jump (which we actually know is to I#_con_info, which would just retag R1 and return to the address on the stack anyways).

I think the same optimization should be available any time we do an algebraic case and in a branch simply return the scrutinee.

I looked at what it would take to fix this. It looks almost easy: if we add a new LambdaFormInfo constructor LFUnknownCon meaning that we know the identifier is bound to a saturated application of an unknown constructor, then we could set the cg_lf of the case binder variable of an algebraic case statement to LFUnknownCon, and return ReturnIt for LFUnknownCon variables in getCallMethod. I think that would do it. Does that sound right? Is there a better way?

(In my original example we actually know the constructor has to be I#. But if the case was on a type with more than one constructor we wouldn't know statically which one we got, just that it has to be one of them.)

Change History (7)

comment:1 Changed 5 years ago by dfeuer

Cc: dfeuer added

comment:2 Changed 5 years ago by rwbarton

Cc: dfeuer removed

In fact it doesn't seem strictly necessary to add a new constructor to LambdaFormInfo... this proof-of-concept patch validates and produces the expected improved Cmm; nofib coming next...

diff --git a/compiler/codeGen/StgCmmClosure.hs b/compiler/codeGen/StgCmmClosure.hs
index b65d56b..df0f6be 100644
--- a/compiler/codeGen/StgCmmClosure.hs
+++ b/compiler/codeGen/StgCmmClosure.hs
@@ -22,7 +22,7 @@ module StgCmmClosure (
         LambdaFormInfo,         -- Abstract
         StandardFormInfo,        -- ...ditto...
         mkLFThunk, mkLFReEntrant, mkConLFInfo, mkSelectorLFInfo,
-        mkApLFInfo, mkLFImported, mkLFArgument, mkLFLetNoEscape,
+        mkApLFInfo, mkLFImported, mkLFArgument, mkLFLetNoEscape, mkLFUnknownCon,
         lfDynTag,
         maybeIsLFCon, isLFThunk, isLFReEntrant, lfUpdatable,
 
@@ -241,6 +241,9 @@ mkLFThunk thunk_ty top fvs upd_flag
             NonStandardThunk
             (might_be_a_function thunk_ty)
 
+mkLFUnknownCon :: LambdaFormInfo
+mkLFUnknownCon = LFCon $ error "tried to access constructor of case binder at compile time"
+
 --------------
 might_be_a_function :: Type -> Bool
 -- Return False only if we are *sure* it's a data type
diff --git a/compiler/codeGen/StgCmmExpr.hs b/compiler/codeGen/StgCmmExpr.hs
index b2b64f8..6b6ffef 100644
--- a/compiler/codeGen/StgCmmExpr.hs
+++ b/compiler/codeGen/StgCmmExpr.hs
@@ -441,7 +441,11 @@ cgCase scrut bndr alt_type alts
        ; let sequel = AssignTo alt_regs do_gc{- Note [scrut sequel] -}
        ; ret_kind <- withSequel sequel (cgExpr scrut)
        ; restoreCurrentCostCentre mb_cc
-       ; _ <- bindArgsToRegs ret_bndrs
+       ; case alt_type of
+           AlgAlt _ -> case ret_bndrs of
+             [ret_bndr] -> void $ bindToReg ret_bndr mkLFUnknownCon
+             _ -> error "wat"
+           _ -> void $ bindArgsToRegs ret_bndrs
        ; cgAlts (gc_plan,ret_kind) (NonVoid bndr) alt_type alts
        }

comment:3 Changed 5 years ago by rwbarton

Cc: dfeuer added

(Oops, edit conflict...)

comment:4 Changed 5 years ago by rwbarton

Well, initial results are somewhat inconclusive:

            Min          -0.1%     -0.1%     -6.3%     -6.2%     -7.4%
            Max          +0.1%     +0.0%     +7.5%     +7.5%     +3.6%
 Geometric Mean          -0.0%     -0.0%     -0.2%     -0.2%     -0.0%

I believe the large variations are due to the same kind of issues I encountered in #8279. Got sucked into investigating that for a while, with no real progress.

Anyways it seems like it ought to be a clear win; what's the best way to implement this for real?

comment:5 Changed 5 years ago by simonmar

It ought to be a clear win. To be extra sure I'd compare the sizes of object files in nofib built both ways, find some object files that differ, and then compare the assembly output for those modules, just to be sure that nothing unexpected is happening.

I'd also double-check any nofib results that appear to vary by more than a few percent. You can use perf to count instructions, which should go down.

comment:6 Changed 5 years ago by schyler

Could you show the core difference (for us who don't really understand the original ticket that well)? Also, if you dump and attach the ASM I'd be happy to sift through and find the problem.

Last edited 5 years ago by schyler (previous) (diff)

comment:7 in reply to:  6 Changed 5 years ago by rwbarton

Replying to schyler:

Could you show the core difference (for us who don't really understand the original ticket that well)?

There is no difference at the Core level. In either case the core is

Test.f =
  \ (x_arU :: GHC.Types.Int) ->
    case x_arU of wild_a10j { GHC.Types.I# x1_a10l ->
    case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.<# x1_a10l 0)
    of _ [Occ=Dead] {
      GHC.Types.False -> GHC.Types.I# (GHC.Prim.+# x1_a10l 1);
      GHC.Types.True -> wild_a10j
    }
    }

The difference is that in the generated Cmm, we don't re-enter wild_a10j in the True case, since we know it has already been reduced to WHNF by the first case.

Also, if you dump and attach the ASM I'd be happy to sift through and find the problem.

I don't have a copy of the instances I was looking at any more, but in one case the change seemed to affect the unique names chosen, which apparently led to laying out functions in a different order... annoying for trying to diff.

My main question here is whether it's actually a good idea to use an error/panic in the LambdaFormInfo, or is it better to add a new constructor. I sort of like the error in that if something goes wrong, it's more likely to do so at compile time than at run time; but it's also a bit ugly.

Note: See TracTickets for help on using tickets.