Opened 6 years ago

Last modified 11 months ago

#8326 new task

Place heap checks common in case alternatives before the case

Reported by: jstolarek Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.7
Keywords: CodeGen Cc: bgamari, rwbarton, osa1, tjakway, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking: #8317
Related Tickets: #1498 Differential Rev(s): Phab:D343
Wiki Page:

Description (last modified by simonpj)

We would like to have functions that check whether an Int# is a valid tag to represent Bool (see Note [Optimizing isTrue#] in ghc-prim):

isTrue# :: Int# -> Bool
isTrue# 1# = True
isTrue# _  = False

isFalse# :: Int# -> Bool
isFalse# 0# = True
isFalse# _  = False

We could use them with comparison primops like this:

f :: Int# -> Int
f x | isTrue# (x ># 0#) = I# x
    | otherwise         = -(I# x)

isTrue# is optimized away at the Core level:

A.f =
  \ (x_aqM :: GHC.Prim.Int#) ->
    case GHC.Prim.># x_aqM 0 of _ {
      __DEFAULT -> GHC.Types.I# (GHC.Prim.negateInt# x_aqM);
      1 -> GHC.Types.I# x_aqM
    }

but the code genrator produces very bad Cmm code, because it pushes heap checks into case alternatives:

     {offset
       cFd: // stack check
           if ((Sp + -16) < SpLim) goto cFr; else goto cFs;
       cFr: // not enough place on the stack, call GC
           R2 = R2;
           R1 = A.f_closure;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       cFs: // scrutinize (x ># 0#)
           _sEU::I64 = R2;
           _sEV::I64 = %MO_S_Gt_W64(R2, 0);
           if (_sEV::I64 != 1) goto cFg; else goto cFo;
       cFg: // False branch
           Hp = Hp + 16;
           if (Hp > HpLim) goto cFy; else goto cFx;
       cFy: // not enough heap, call GC
           HpAlloc = 16;
           I64[Sp - 16] = cFf;
           R1 = _sEV::I64;
           I64[Sp - 8] = _sEU::I64;
           Sp = Sp - 16;
           call stg_gc_unbx_r1(R1) returns to cFf, args: 8, res: 8, upd: 8;
       cFf: // re-do the False branch
           _sEU::I64 = I64[Sp + 8];
           Sp = Sp + 16;
           _sEV::I64 = R1;
           goto cFg;
       cFx: // RHS of False branch
           I64[Hp - 8] = GHC.Types.I#_con_info;
           I64[Hp] = -_sEU::I64;
           R1 = Hp - 7;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
       cFo: // True branch
           Hp = Hp + 16;
           if (Hp > HpLim) goto cFv; else goto cFu;
       cFv: // not enough heap, call GC
           HpAlloc = 16;
           I64[Sp - 16] = cFn;
           R1 = _sEV::I64;
           I64[Sp - 8] = _sEU::I64;
           Sp = Sp - 16;
           call stg_gc_unbx_r1(R1) returns to cFn, args: 8, res: 8, upd: 8;
       cFn: // re-do the True branch
           _sEU::I64 = I64[Sp + 8];
           Sp = Sp + 16;
           _sEV::I64 = R1;
           goto cFo;
       cFu: // RHS of True branch
           I64[Hp - 8] = GHC.Types.I#_con_info;
           I64[Hp] = _sEU::I64;
           R1 = Hp - 7;
           call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
     }

This results in average 2.5% increase in binary size. By contrast, if we use tagToEnum# instead of isTrue# heap check will be placed before case expression and the code will be significantly shorter (this is done by a special case-on-bool optimization in the code generator - see #8317). What we would like to do here is:

  1. compile case alternatives without placing heap checks inside them
  2. each compiled alternative should return amount of heap it needs to allocate
  3. code generator inspects amounts of heap needed by each alternative and either adds heap checks in alternatives or puts a single check before the case expression.

Getting this right might be a bit tricky.

  1. if all branches allocate some heap then we can just put a common heap check before the case. Note that we must allocate the higgest amount required by any of the alternatives and then alternatives that use less heap must retract the heap pointer accordingly.
  2. if we have two alternatives, one of which allocates heap and the other does not, we should place the heap check only in the alternative that allocates the stack. This will solve #1498.
  3. it is not clear to me what to do if we have combination of the above (more than one branch that allocates heap and at least one branch that does not). If we place heap check before the case expression we lose optimization of recursive functions and face the problem described in #1498. If we push heap checks into branches that allocate heap then we get code duplication, i.e. the problem that we're addressing in this ticket. I guess the only way to make correct decission here is to try different aproaches and measure their performance.

This ticket is mentioned

  • on this wiki page
  • in the source code in Note [Optimizing isTrue#] in ghc-prim.
  • In Simplify.hs, Note [Optimising tagToEnum#]

Once this ticket is resolved we need to update these places accordingly.

Attachments (2)

hc-nofib.txt (189.6 KB) - added by rwbarton 4 years ago.
knucleotide-master-and-patch-dumps.tar.bz2 (132.5 KB) - added by michalt 23 months ago.
STG/cmm/asm dumps of k-nucleotide from the current master and the patched GHC

Download all attachments as: .zip

Change History (48)

comment:1 Changed 6 years ago by Jan Stolarek <jan.stolarek@…>

In 53948f915140396acd1b80c6a7a252b2d1e12635/ghc:

Restore old names of comparison primops

In 6579a6c we removed existing comparison primops and introduced new ones
returning Int# instead of Bool. This commit (and associated commits in
array, base, dph, ghc-prim, integer-gmp, integer-simple, primitive, testsuite and
template-haskell) restores old names of primops. This allows us to keep
our API cleaner at the price of not having backwards compatibility.

This patch also temporalily disables fix for #8317 (optimization of
tagToEnum# at Core level). We need to fix #8326 first, otherwise
our primops code will be very slow.

comment:2 Changed 6 years ago by jstolarek

Description: modified (diff)

comment:3 Changed 5 years ago by dfeuer

I do not understand the theory here—what is the (not-yet-implemented version of) isTrue# supposed to accomplish? If I hand it 12#, it will dutifully tell me False. What exactly have I learned from that? Either I meant for it to return True and I just got a wrong answer or I just used a very confusingly-named function to see if something equals 1#. I could fake C-style booleans if I wanted using isFalse, but it would be much clearer to just explicitly compare something to zero. If you wanted to actually get some kind of safety, you'd need something more invasive, with more potential to slow things down, like maybe

isTrue# x | tagToEnum# ((x `orI#` 1#) ==# 1#)  = tagToEnum# x
          | otherwise                          = error "Oops"

I just don't see how the proposed isTrue# offers any real advantage over tagToEnum#. Using primops is always playing with fire, and something that looks like a safety net but really isn't just invites careless errors.

comment:4 Changed 5 years ago by carter

@dfeuer, (my interpretation is) this ticket is articulating how we'd *like* to write isTrue# and noting some of the optimization engineering thats needed to support writing things in the desired high level fashion.

Your safety concern is at least in some small part spurious, because some of this optimization/engineering is about how to transform high level type safe code into that branch free form as optimizations in Core, STG and CMM. That is, end users should not have to deal with this tomfoolery ever, but we'd still like to give them branch free code when its safe!

comment:5 Changed 5 years ago by carter

In the mean time, things like isTrue# are not meant to be safe wrapped things, but tools that make performance engineering manageable and tractable.

comment:6 Changed 5 years ago by carter

more concretely, that change in definition has impact on other pieces of code generation that needs to be fixed in order have that nicer / safer definition to work out.

comment:7 in reply to:  3 ; Changed 5 years ago by jstolarek

Replying to dfeuer:

If I hand it 12#, it will dutifully tell me False. What exactly have I learned from that?

That the argument you passed is not a valid tag for True.

This ticket is not really about isTrue# or isFalse# - which are just tools a programmer might want or not want to use - but about fixing the heap checks and thus fixing #8317. If you feel that isTrue# and isFalse# don't offer you any benefit you can still use tagToEnum#.

comment:8 in reply to:  7 ; Changed 5 years ago by dfeuer

Replying to jstolarek:

Replying to dfeuer:

If I hand it 12#, it will dutifully tell me False. What exactly have I learned from that?

That the argument you passed is not a valid tag for True.

This ticket is not really about isTrue# or isFalse# - which are just tools a programmer might want or not want to use - but about fixing the heap checks and thus fixing #8317. If you feel that isTrue# and isFalse# don't offer you any benefit you can still use tagToEnum#.

I'm not sure where the appropriate place is for this line of discussion, but it seems that in the wild (all over the library source), isTrue# is typically used as a function for converting from Int#, produced by a comparison operator, to Bool, rather than as a validity test for True. I have yet to see any explanation of why that is appropriate. Certainly, optimizing heap checks is an entirely different matter, and presumably a good idea regardless.

comment:9 in reply to:  8 Changed 5 years ago by jstolarek

Replying to dfeuer:

it seems that in the wild (all over the library source), isTrue# is typically used as a function for converting from Int#, produced by a comparison operator, to Bool, rather than as a validity test for True. I have yet to see any explanation of why that is appropriate.

Right, I see what you mean. So the only way the isTrue# and isFalse# functions are "safe" is because they promise to do what their names imply. You've focused on isTrue#, which indeed is identical to tagToEnum# but this is definitely not the case with isFalse#.

comment:10 Changed 5 years ago by jstolarek

Owner: set to jstolarek

comment:11 Changed 5 years ago by jstolarek

Differential Rev(s): D343

comment:12 Changed 5 years ago by jstolarek

Differential Rev(s): D343Phab:D343

comment:13 Changed 5 years ago by simonpj

Responding to the writeup on Phab:D343.

Before running off to make special cases for comparisons, look at the relevant code for cgCase:

cgCase scrut bndr alt_type alts
  = -- the general case
    do { dflags <- getDynFlags
       ; up_hp_usg <- getVirtHp        -- Upstream heap usage
       ; let ret_bndrs = chooseReturnBndrs bndr alt_type alts
             alt_regs  = map (idToReg dflags) ret_bndrs
       ; simple_scrut <- isSimpleScrut scrut alt_type
       ; let do_gc  | not simple_scrut = True
                    | isSingleton alts = False
                    | up_hp_usg > 0    = False
                    | otherwise        = True
               -- cf Note [Compiling case expressions]
             gc_plan = if do_gc then GcInAlts alt_regs else NoGcInAlts

       ; mb_cc <- maybeSaveCostCentre simple_scrut

       ; let sequel = AssignTo alt_regs do_gc{- Note [scrut sequel] -}
       ; ret_kind <- withSequel sequel (cgExpr scrut)
       ; restoreCurrentCostCentre mb_cc
       ; _ <- bindArgsToRegs ret_bndrs
       ; cgAlts (gc_plan,ret_kind) (NonVoid bndr) alt_type alts
       }

If do_gc is true, we put heap checks at the start of each branch. If do_gc is false, we take the max of the branches, and do the heap check before the case.

I'll use a running example like this:

  f = \x -> let y = blah
            in case <scrut> of
                  0#      -> <rhs1>
                  DEFAULT -> <rhs2>

Things that affect the do_gc decision:

  • If the scrutinee <scrut> requires any non-trivial work, we MUST have do_gc = True. For example if <scrut> was (g x), then calling g might result in lots of allocation, so any heap check done at the start of f is irrelevant to the branches. They must do their own checks. This is the simple_scrut check. It succeeds on simple finite computations like x +# 1 or x (if x is unboxed).

The other cases are all for the simple-srut situation:

  • If there is just one alternative, then it's always good to amalgamate
  • If there is heap allocation in the code before the case (up_hp_usg > 0), then we are going to do a heap-check upstream anyway. In that case, don't do one in the alterantives too. (The single check might allocate too much space, but the alterantives that use less space simply move Hp back down again, which only costs one instruction.)
  • Otherwise, if there no heap alloation upstream, put heap checks in each alternative. The resoning here was that if one alternative needs heap and the other one doesn't we don't want to pay the runtime for the heap check in the case where the heap-free alternative is taken.

Now, what is happening in your example is that

  • There is no upstream heap usage
  • Both alternatives allocate

Result: you get two heap checks instead of one. But if only one branch allocated, you'd probably want to have the heap check in that branch!

So I think the criterion should be that (assuming no upstream allocation)

  • If all the branches allocate, do the heap check before the case
  • Otherwise pay the price of a heap check in each branch

Or alterantively (less code size, slightly slower)

  • If more than one branch allocates, do the heap check before the case
  • If only one allocates, do it in the brcnch

The difficulty here is that it's hard to find out whether the branches allocate without running the code generator on them, and that's now how the current setup is structured. (When you run the code generator on some code, the monad keeps track of how much allocation is done; see StgCmmMonad.getHeapUsage.) It might well be possible to move things around a bit, but it would need a little care.

But before doing that, the first thing is to decide what the criteria should be.

Simon

comment:14 Changed 5 years ago by jstolarek

Thanks for detailed explanation. I was a bit concerned about all that extra stuff like saving cost centres in the general case. Don't we need to worry that any of these will impact performance of the compiled code?

The difficulty here is that it's hard to find out whether the branches allocate without running the code generator on them

Yes, I believe this was our conclusion when we discussed that during my internship. I don't see how to use StgCmmMonad.getHeapUsage to implement the solution. We compile all the alternatives in one go by calling cgAlts (gc_plan,ret_kind) (NonVoid bndr) alt_type alts, whereas to apply any of the criteria that you proposed we would need to compile the alternatives one by one and analyse their heap usage. My understanding is that if we simply call cgAlts in a lambda passed to getHeapUsage we will only know how much heap total is used by the compiled alternatives but we won't have any detailed knowledge about each of the alternatives. Is that correct?

But before doing that, the first thing is to decide what the criteria should be.

It's a bit hard to tell without doing the actual implementation and measuring the results. I guess there will be some programs that work better with the first criterion and some that work better with the second criterion. That being said, I'd go with the first one. Slightly larger binaries are a small price to pay for possibility of having better performance.

comment:15 Changed 5 years ago by simonpj

Well, of course the plumbing would need to change a bit. cgAlts would have to return something saying which branches allocated. The difficulty is that at the moment the gc_plan flag is passed into cgAlts whereas now we are proposing that the plan will depend on something returned by cgAlts. That might be ok, if we tied a recursive knot, provided cgAlts was sufficiently lazy in its "plan" parameter.

More than that I cannot say without looking a lot harder at the code, something you can do just as well as I, perhaps better.

Simon

comment:16 Changed 5 years ago by jstolarek

Well, of course the plumbing would need to change a bit. (...)

Right. I just wanted to make sure that I understood things correctly. I'll try to figure out how to make that change.

comment:17 Changed 5 years ago by jstolarek

Status update:

I tried knot-tying but it doesn't work - cgAlts is strict in gc_plan and changing that doesn't look trivial. The only alternative I see at the moment is to:

a) compile the alternatives without heap checks; b) examine heap usage of compiled alternatives c) create a GC plan d) add heap checks to compiled alternatives, if necessary

That sounds simple but I have no idea how to leverage FCode monadery to add heap checks to compiled CmmAGraph. Am I right to think that currently there is no plumbing for compiling more code on top of already existing CmmAGraph? So far I was only able to came up with a prototype that implements point a-c above but instead of d) it re-compiles the alternatives from scratch.

comment:18 Changed 5 years ago by jstolarek

BTW. I think it would be a Good Thing to make FCode monad an instance of MonadFix and replace StgCmmMonad.fixC with mfix. I think this would make code easier to read. People know what mfix is, while implementation of fixC might not be immediately obvious.

comment:19 Changed 5 years ago by simonpj

I'm sure this is do-able by knot-tying, but it'll need a bit of care.

  • Each alternative can start by emitting a pure blob of code that is a function of the GC plan
  • The GC plan is computed from the heap-allocation info from all the alternatives (this is the knot-tied bit)
  • So it must be possible to run the alternative FCode blobs without yet knowing the GC plan.
  • Currently cgAlts is strict in the plan, but I don't think it needs to be. After all, the code we generate for the alternatives does not depend on whether there's a heap check at the beginning. You probably need to allocate a label regardless since that is a monadic operation.

So not trivial, but ought to be quite possible. Semantically the data dependencies are just fine!

Simon

comment:20 Changed 5 years ago by jstolarek

Yes, data dependencies are fine. My current patch actually does the things outlined above but without knot-tying. And it's buggy at the moment. I'll try to debug my patch and if that fails I'll try the approach you just described above but I admit I'm not too keen on it. The problem is that making cgAlts non-strict in gc_plan seems very non-trivial and seems to require *a lot* of changes in the structure of the code. By contrast, my patch is quite well localized and confined to a small area of code.

You probably need to allocate a label regardless since that is a monadic operation.

I don't understand that bit. Could you elaborate?

Last edited 5 years ago by jstolarek (previous) (diff)

comment:21 Changed 5 years ago by jstolarek

Owner: jstolarek deleted

I'm unable to make further progress on this ticket. Sorry. I'm unassigning this ticket - perhaps someone else can take over.

comment:22 Changed 5 years ago by bgamari

Cc: bgamari added

comment:23 Changed 4 years ago by rwbarton

Cc: rwbarton added

comment:24 Changed 4 years ago by rwbarton

Even if it was right to do the heap checks in the alternatives there's another problem with this Cmm: _sEV is being kept live longer than it should be. See #10676 #8327.

Last edited 4 years ago by rwbarton (previous) (diff)

comment:25 Changed 4 years ago by rwbarton

Type of failure: None/UnknownRuntime performance bug

comment:26 Changed 4 years ago by simonpj

Why this ticket #8326 is blocking #8317? I think precisely this: implementing #8317 on its own causes a 2.5% code size increase.

Simon

comment:27 Changed 4 years ago by rwbarton

I'm not convinced that we shouldn't just always do the heap check outside the case, even when there is only one branch that allocates. In the worst case we do an extra memory read (HpLim, probably in L1 cache), a comparison and a branch (which is highly predictable). However

  • In simple functions (like the one in #10676) we don't need to allocate stack if we do the heap check up front. However a heap check in an alternative in this case requires allocating a stack frame (not sure whether occurs in all cases, or whether it is avoidable), so then we do have to do a stack check up front instead which is almost as expensive. (The difference is that SpLim is in a register while HpLim must be stored in memory.) In addition the code becomes substantially larger because we have two separate entries to the GC.
  • In functions which need to allocate stack anyways, putting the heap check outside the case does mean we have to an extra check when we take a branch that does not allocate. On the other hand, we can then use the same GC entry code for the stack check and the heap check, so the code is again much smaller. This may pay for the extra instructions of the heap check, especially if the non-allocating branch(es) are infrequent anyways.

At any rate it's simpler than the knot-tying discussed in this ticket, so I will try it.

comment:28 Changed 4 years ago by simonpj

Good point. Trying that is an excellent idea. Note that in a tree of (primop) conditionals, if even one branch allocates, then we'll do a heap check in all cases.

It might be interesting to gather the following statistic in ticky-ticky profiling

  • Total number of heap checks
  • Total number of times that Hp is wound back to zero. This winding back is done by adjustHpBackwards. You can tell if you are winding back to zero because vHp is zero.

If we wind back to zero, that means that we allocated nothing, so the original heap check was wasted. Then we can see what proportion of heap checks are wasted.

I guess the worry is that this might happen a lot in some hot inner loop, but let's see.

Thanks for doing this. If it looks reasonable it'd be a much simpler cleaner solution.

comment:29 Changed 4 years ago by bgamari

Simon, indeed tying the knot would at very least require that two labels be unconditionally allocated and carried rather deeply into StgCmmHeap.

Another issue is heapCheck, which is itself monadic as it ends up calling getHeapUsage. My initial thought was that maybeAltHeapCheck could be made lazy in gc_plan with something like,

maybeAltHeapCheck :: (GcPlan,ReturnKind) -> FCode a -> FCode a
maybeAltHeapCheck gc_plan code = do
  label1 <- allocLabelC
  label2 <- allocLabelC
  info_down <- getInfoDown
  getHeapUsage $ \hpHw -> do
    emit $ case gc_plan of
      (NoGcInAlts, _)                      -> emptyOL
      (GcInAlts regs, AssignedDirectly)    -> ...
      (GcInAlts regs, ReturnedTo lref off) -> ...
    code

Unfortunately this becomes quite invasive as you must pass those labels and the CgInfoDownwards pretty deeply into StgCmmHeap. It may be that a refactoring will help a bit here but it is still likely to be quite messy as you essentially have to reimplement any helper you might need in the heap check body itself to avoid FCode (e.g. mkCall). Perhaps I am missing an obvious solution here?

In summary, as far as I can see there are several paths through the case alternative heap-check code (the code path column below shows the number of newLabelCs appearing in each function in parentheses),

gc_plan ret_kind canned entry-pt. labels needed code path
NoGcInAlts * * 0 just run the code
GcInAlts AssignedDirectly no 1 altOrNoEscapeHeapCheck -> genericGC (1) -> heapCheck
GcInAlts AssignedDirectly yes 2 altOrNoEscapeHeapCheck (2) -> cannedGCReturnsTo -> heapCheck
GcInAlts ReturnedTo no 1 altHeapCheckReturnsTo -> genericGC (1) -> heapCheck
GcInAlts ReturnedTo yes 0 altHeapCheckReturnsTo -> cannedGCReturnsTo -> heapCheck
Last edited 4 years ago by bgamari (previous) (diff)

comment:30 Changed 4 years ago by simonpj

Before delving into this, let's try the simpler approach you suggested!

simon

comment:31 Changed 4 years ago by bgamari

Actually, now that I look a bit more carefully StgCmmMonad.codeOnly may be exactly what is needed here. I'll need to think a bit more to make sure this breaks all of the potential cycles, but I am optimistic.

comment:32 Changed 4 years ago by bgamari

I think the issue here is that while on paper the code we emit is independent of our GcPlan, in practice we emit the code in most cases in heapCheck and how we end up in heapCheck is very much dependent on the GcPlan.

I believe this almost does what we want,

maybeAltHeapCheck :: (GcPlan,ReturnKind) -> FCode a -> FCode a
maybeAltHeapCheck gc_plan code = do
    codeOnly $ case gc_plan of
      (NoGcInAlts,_)                       -> code
      (GcInAlts regs, AssignedDirectly)    -> altHeapCheck regs code
      (GcInAlts regs, ReturnedTo lret off) -> altHeapCheckReturnsTo regs regs lret off code

The trouble is you have now thrown away the heap usage information from the code you emitted in the NoGcInAlts case.

Alternatively, you can lift the code emission out of heapCheck, but this breaks its nice interface,

maybeAltHeapCheck :: (GcPlan,ReturnKind) -> FCode a -> FCode a
maybeAltHeapCheck gc_plan code = do
    codeOnly $ case gc_plan of
      (NoGcInAlts,_)                       -> return ()
      -- These now only use 'code' to compute the heap usage and do not emit the statements it produces
      (GcInAlts regs, AssignedDirectly)    -> altHeapCheck regs code
      (GcInAlts regs, ReturnedTo lret off) -> altHeapCheckReturnsTo regs regs lret off code
    code

Anyways, let's see how Reid's approach performs; perhaps this is all irrelevant.

Last edited 4 years ago by bgamari (previous) (diff)

comment:33 Changed 4 years ago by simonpj

Anyways, let's see how Reid's approach performs; perhaps this is all irrelevant

Exactly!

Changed 4 years ago by rwbarton

Attachment: hc-nofib.txt added

comment:34 Changed 4 years ago by rwbarton

Attached nofib results are for the patch

        ; simple_scrut <- isSimpleScrut scrut alt_type
        ; let do_gc  | not simple_scrut = True
-                    | isSingleton alts = False
-                    | up_hp_usg > 0    = False
-                    | otherwise        = True
+                    | otherwise        = False -- ticket:8326#comment:27
                -- cf Note [Compiling case expressions]
              gc_plan = if do_gc then GcInAlts alt_regs else NoGcInAlts

As can be seen from the fact that very few Module Sizes changed, this actually rarely makes a difference. The reason is that there are two special cases of cgCase that always use NoGcInAlts. (I don't know why they do so, perhaps an oversight.)

  • If the case has algebraic alternatives, then either the scrutinee is not simple and we must GcInAlts, or the scrutinee is an application of tagToEnum# and the first special case applies and always uses NoGcInAlts.
  • If the case has primitive alternatives, then when the scrutinee is simply a variable, the second special case applies and always uses NoGcInAlts.

So this patch only makes a difference when all of the following hold:

  • the case has primitive alternatives
  • the scrutinee is an application of a primop (that does not allocate, so it is simple, but most primops do not)
  • there is more than one alternative
  • there is no upstream heap check already
  • at least one alternative actually allocates (often CPR analysis has moved an allocation outside of the case)

The combination of the second and third items is fairly rare, it means you are comparing the result of a primop against a constant. A typical example would be testing whether an Int is even or odd.

Basically my conclusions are that

  • it's acceptable to apply this patch and always allocate outside the case here, since nofib did not find any significant regressions
  • it may still be worthwhile to try to make better decisions about whether to do heap checks in the alternatives, but then we should also do so in the special cases of cgCase

comment:35 Changed 4 years ago by simonpj

Terrific. Could you try the effect of re-enabling the fix in 8317#comment:2. The code is still in Simplify.hs, just commented out. Look for "Disabled until we fix #8326".

If that works (no perf cost, no binary size increase), then the next step is to remove the special case Note [case on bool] in StgCmmExpr; it will never be used because the Simplify change will catch the case first.

Might you try those two steps? Thanks!

Simon

comment:36 Changed 4 years ago by bgamari

Reid, any progress here?

comment:37 Changed 4 years ago by rwbarton

I ran a nofib benchmark with these changes and there was mostly no effect, aside from a few changes in the plus-or-minus 5-10% range that I could not explain. When I find a larger block of free time I'll return to this subject (should be more pleasant now that ghcspeed will benchmark wip branches).

comment:38 in reply to:  37 Changed 4 years ago by jstolarek

Replying to rwbarton:

a few changes in the plus-or-minus 5-10% range that I could not explain

Random idea: check ordering of assembly blocks (a.k.a #8082)

comment:39 Changed 4 years ago by osa1

Cc: osa1 added

comment:40 Changed 3 years ago by tjakway

Cc: tjakway added

comment:41 Changed 2 years ago by michalt

Cc: michalt added

comment:42 Changed 23 months ago by michalt

What is the status of this ticket?

I've tried the patch suggested in comment:34, but my results of nofib were quite different, with some clear regressions (below I've removed anything where the difference was <2%)

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
            CSD          -0.4%      0.0%     +7.4%     +7.4%      0.0%
              S          -0.5%      0.0%     +3.2%     +3.3%      0.0%
             VS          -0.5%      0.0%     -4.4%     -4.4%      0.0%
            VSM          -0.5%      0.0%     +8.7%     +8.7%      0.0%
           bspt          -0.5%      0.0%     0.003     0.003    +50.0%
    constraints          -0.4%      0.0%     +2.5%     +2.6%      0.0%
   cryptarithm1          -0.4%      0.0%     +6.0%     +6.0%      0.0%
    exact-reals          -1.2%      0.0%     -2.7%     -2.7%      0.0%
 fannkuch-redux          -0.4%      0.0%     -2.5%     -2.5%      0.0%
          fasta          -0.4%      0.0%     +7.7%     +7.6%      0.0%
   k-nucleotide          -0.8%     +0.0%    +17.9%    +18.0%      0.0%
         lambda          -0.4%      0.0%     -2.2%     -2.1%      0.0%
         linear          -1.2%      0.0%     -5.9%     -5.9%      0.0%
           mate          -0.4%      0.0%     -2.2%     -2.2%      0.0%
         n-body          -0.9%      0.0%     +3.2%     +3.2%      0.0%
--------------------------------------------------------------------------------
            Min          -1.4%     -0.2%     -5.9%     -5.9%      0.0%
            Max          -0.2%     +0.0%    +17.9%    +18.0%    +50.0%
 Geometric Mean          -0.7%     -0.0%     +1.7%     +1.6%     +0.4%

I've tried to have a look into k-nucleotide, which slowed down the most. My current understanding is that there's a tight loop within a single function that goes like this:

  • A: some computation, eventually goes to B
  • B: case on andI# variable 127# (one alternative is the slow path that allocates, the other is fast that doesn't)
  • C: alternative that does a bit of computation (but no allocation) and jumps back A

Now with the change we get a heap check in front of the case, which will now be executed on every iteration and slow everything down.

NOTE: I don't have much experience with investigations like this, so the whole analysis might be quite wrong. ;) Please let me know if something seems off. I'll attach the dump of STG/cmm/asm from both versions of k-nucleotide (with and without the patch).

Changed 23 months ago by michalt

STG/cmm/asm dumps of k-nucleotide from the current master and the patched GHC

comment:43 Changed 23 months ago by simonpj

What Michael says in comment:42 seems to be exactly what item (3) in the Description is all about. If the hot path does not allocate, then adding an allocation check into the hot path will cost time, eve if it reduces binary size.

comment:44 in reply to:  42 Changed 16 months ago by AndreasK

Replying to michalt:

What is the status of this ticket?

I've tried the patch suggested in comment:34, but my results of nofib were quite different, with some clear regressions (below I've removed anything where the difference was <2%)

NOTE: I don't have much experience with investigations like this, so the whole analysis might be quite wrong. ;) Please let me know if something seems off. I'll attach the dump of STG/cmm/asm from both versions of k-nucleotide (with and without the patch).

When I experimented with the order in which we generate uniques I also got a regression of ~18% for one of the shootout benchmarks, I think it was k-nucleotide but could have been another one.

So while I don't doubt that there is a regression for k-nucleotide with this patch it doesn't have to be because the code we generate is worse for the general case. One really has to look at the Asm/Cmm for that.

it is not clear to me what to do if we have combination of the above (more than one branch that allocates heap and at least one branch that does not).

In the long run we should do some static analysis to help us determine the hot code path. Eg distinguish between:

  • Alternatives leading to recursion
  • Alternatives being called once.
  • Bottoming alternatives.

There are some ideas and work on that in #14672.

comment:45 Changed 12 months ago by simonpj

Keywords: CodeGen added

comment:46 Changed 11 months ago by simonpj

Description: modified (diff)
Note: See TracTickets for help on using tickets.