Opened 6 years ago
Last modified 11 months ago
#8326 new task
Place heap checks common in case alternatives before the case
Reported by: | jstolarek | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | 7.7 |
Keywords: | CodeGen | Cc: | bgamari, rwbarton, osa1, tjakway, michalt |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | #8317 | |
Related Tickets: | #1498 | Differential Rev(s): | Phab:D343 |
Wiki Page: |
Description (last modified by )
We would like to have functions that check whether an Int#
is a valid tag to represent Bool
(see Note [Optimizing isTrue#] in ghc-prim):
isTrue# :: Int# -> Bool isTrue# 1# = True isTrue# _ = False isFalse# :: Int# -> Bool isFalse# 0# = True isFalse# _ = False
We could use them with comparison primops like this:
f :: Int# -> Int f x | isTrue# (x ># 0#) = I# x | otherwise = -(I# x)
isTrue#
is optimized away at the Core level:
A.f = \ (x_aqM :: GHC.Prim.Int#) -> case GHC.Prim.># x_aqM 0 of _ { __DEFAULT -> GHC.Types.I# (GHC.Prim.negateInt# x_aqM); 1 -> GHC.Types.I# x_aqM }
but the code genrator produces very bad Cmm code, because it pushes heap checks into case alternatives:
{offset cFd: // stack check if ((Sp + -16) < SpLim) goto cFr; else goto cFs; cFr: // not enough place on the stack, call GC R2 = R2; R1 = A.f_closure; call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8; cFs: // scrutinize (x ># 0#) _sEU::I64 = R2; _sEV::I64 = %MO_S_Gt_W64(R2, 0); if (_sEV::I64 != 1) goto cFg; else goto cFo; cFg: // False branch Hp = Hp + 16; if (Hp > HpLim) goto cFy; else goto cFx; cFy: // not enough heap, call GC HpAlloc = 16; I64[Sp - 16] = cFf; R1 = _sEV::I64; I64[Sp - 8] = _sEU::I64; Sp = Sp - 16; call stg_gc_unbx_r1(R1) returns to cFf, args: 8, res: 8, upd: 8; cFf: // re-do the False branch _sEU::I64 = I64[Sp + 8]; Sp = Sp + 16; _sEV::I64 = R1; goto cFg; cFx: // RHS of False branch I64[Hp - 8] = GHC.Types.I#_con_info; I64[Hp] = -_sEU::I64; R1 = Hp - 7; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; cFo: // True branch Hp = Hp + 16; if (Hp > HpLim) goto cFv; else goto cFu; cFv: // not enough heap, call GC HpAlloc = 16; I64[Sp - 16] = cFn; R1 = _sEV::I64; I64[Sp - 8] = _sEU::I64; Sp = Sp - 16; call stg_gc_unbx_r1(R1) returns to cFn, args: 8, res: 8, upd: 8; cFn: // re-do the True branch _sEU::I64 = I64[Sp + 8]; Sp = Sp + 16; _sEV::I64 = R1; goto cFo; cFu: // RHS of True branch I64[Hp - 8] = GHC.Types.I#_con_info; I64[Hp] = _sEU::I64; R1 = Hp - 7; call (P64[Sp])(R1) args: 8, res: 0, upd: 8; }
This results in average 2.5% increase in binary size. By contrast, if we use tagToEnum#
instead of isTrue#
heap check will be placed before case
expression and the code will be significantly shorter (this is done by a special case-on-bool optimization in the code generator - see #8317). What we would like to do here is:
- compile case alternatives without placing heap checks inside them
- each compiled alternative should return amount of heap it needs to allocate
- code generator inspects amounts of heap needed by each alternative and either adds heap checks in alternatives or puts a single check before the case expression.
Getting this right might be a bit tricky.
- if all branches allocate some heap then we can just put a common heap check before the case. Note that we must allocate the higgest amount required by any of the alternatives and then alternatives that use less heap must retract the heap pointer accordingly.
- if we have two alternatives, one of which allocates heap and the other does not, we should place the heap check only in the alternative that allocates the stack. This will solve #1498.
- it is not clear to me what to do if we have combination of the above (more than one branch that allocates heap and at least one branch that does not). If we place heap check before the
case
expression we lose optimization of recursive functions and face the problem described in #1498. If we push heap checks into branches that allocate heap then we get code duplication, i.e. the problem that we're addressing in this ticket. I guess the only way to make correct decission here is to try different aproaches and measure their performance.
This ticket is mentioned
- on this wiki page
- in the source code in Note [Optimizing isTrue#] in ghc-prim.
- In
Simplify.hs
,Note [Optimising tagToEnum#]
Once this ticket is resolved we need to update these places accordingly.
Attachments (2)
Change History (48)
comment:1 Changed 6 years ago by
comment:2 Changed 6 years ago by
Description: | modified (diff) |
---|
comment:3 follow-up: 7 Changed 5 years ago by
I do not understand the theory here—what is the (not-yet-implemented version of) isTrue#
supposed to accomplish? If I hand it 12#
, it will dutifully tell me False
. What exactly have I learned from that? Either I meant for it to return True
and I just got a wrong answer or I just used a very confusingly-named function to see if something equals 1#
. I could fake C-style booleans if I wanted using isFalse
, but it would be much clearer to just explicitly compare something to zero. If you wanted to actually get some kind of safety, you'd need something more invasive, with more potential to slow things down, like maybe
isTrue# x | tagToEnum# ((x `orI#` 1#) ==# 1#) = tagToEnum# x | otherwise = error "Oops"
I just don't see how the proposed isTrue#
offers any real advantage over tagToEnum#
. Using primops is always playing with fire, and something that looks like a safety net but really isn't just invites careless errors.
comment:4 Changed 5 years ago by
@dfeuer, (my interpretation is) this ticket is articulating how we'd *like* to write isTrue# and noting some of the optimization engineering thats needed to support writing things in the desired high level fashion.
Your safety concern is at least in some small part spurious, because some of this optimization/engineering is about how to transform high level type safe code into that branch free form as optimizations in Core, STG and CMM. That is, end users should not have to deal with this tomfoolery ever, but we'd still like to give them branch free code when its safe!
comment:5 Changed 5 years ago by
In the mean time, things like isTrue#
are not meant to be safe wrapped things, but tools that make performance engineering manageable and tractable.
comment:6 Changed 5 years ago by
more concretely, that change in definition has impact on other pieces of code generation that needs to be fixed in order have that nicer / safer definition to work out.
comment:7 follow-up: 8 Changed 5 years ago by
Replying to dfeuer:
If I hand it
12#
, it will dutifully tell meFalse
. What exactly have I learned from that?
That the argument you passed is not a valid tag for True
.
This ticket is not really about isTrue#
or isFalse#
- which are just tools a programmer might want or not want to use - but about fixing the heap checks and thus fixing #8317. If you feel that isTrue#
and isFalse#
don't offer you any benefit you can still use tagToEnum#
.
comment:8 follow-up: 9 Changed 5 years ago by
Replying to jstolarek:
Replying to dfeuer:
If I hand it
12#
, it will dutifully tell meFalse
. What exactly have I learned from that?That the argument you passed is not a valid tag for
True
.This ticket is not really about
isTrue#
orisFalse#
- which are just tools a programmer might want or not want to use - but about fixing the heap checks and thus fixing #8317. If you feel thatisTrue#
andisFalse#
don't offer you any benefit you can still usetagToEnum#
.
I'm not sure where the appropriate place is for this line of discussion, but it seems that in the wild (all over the library source), isTrue#
is typically used as a function for converting from Int#
, produced by a comparison operator, to Bool
, rather than as a validity test for True
. I have yet to see any explanation of why that is appropriate. Certainly, optimizing heap checks is an entirely different matter, and presumably a good idea regardless.
comment:9 Changed 5 years ago by
Replying to dfeuer:
it seems that in the wild (all over the library source),
isTrue#
is typically used as a function for converting fromInt#
, produced by a comparison operator, toBool
, rather than as a validity test forTrue
. I have yet to see any explanation of why that is appropriate.
Right, I see what you mean. So the only way the isTrue#
and isFalse#
functions are "safe" is because they promise to do what their names imply. You've focused on isTrue#
, which indeed is identical to tagToEnum#
but this is definitely not the case with isFalse#
.
comment:10 Changed 5 years ago by
Owner: | set to jstolarek |
---|
comment:11 Changed 5 years ago by
Differential Rev(s): | → D343 |
---|
comment:12 Changed 5 years ago by
Differential Rev(s): | D343 → Phab:D343 |
---|
comment:13 Changed 5 years ago by
Responding to the writeup on Phab:D343.
Before running off to make special cases for comparisons, look at the relevant code for cgCase
:
cgCase scrut bndr alt_type alts = -- the general case do { dflags <- getDynFlags ; up_hp_usg <- getVirtHp -- Upstream heap usage ; let ret_bndrs = chooseReturnBndrs bndr alt_type alts alt_regs = map (idToReg dflags) ret_bndrs ; simple_scrut <- isSimpleScrut scrut alt_type ; let do_gc | not simple_scrut = True | isSingleton alts = False | up_hp_usg > 0 = False | otherwise = True -- cf Note [Compiling case expressions] gc_plan = if do_gc then GcInAlts alt_regs else NoGcInAlts ; mb_cc <- maybeSaveCostCentre simple_scrut ; let sequel = AssignTo alt_regs do_gc{- Note [scrut sequel] -} ; ret_kind <- withSequel sequel (cgExpr scrut) ; restoreCurrentCostCentre mb_cc ; _ <- bindArgsToRegs ret_bndrs ; cgAlts (gc_plan,ret_kind) (NonVoid bndr) alt_type alts }
If do_gc
is true, we put heap checks at the start of each branch. If do_gc
is false, we take the max of the branches, and do the heap check before the case
.
I'll use a running example like this:
f = \x -> let y = blah in case <scrut> of 0# -> <rhs1> DEFAULT -> <rhs2>
Things that affect the do_gc
decision:
- If the scrutinee
<scrut>
requires any non-trivial work, we MUST havedo_gc = True
. For example if<scrut>
was(g x)
, then callingg
might result in lots of allocation, so any heap check done at the start off
is irrelevant to the branches. They must do their own checks. This is thesimple_scrut
check. It succeeds on simple finite computations likex +# 1
orx
(ifx
is unboxed).
The other cases are all for the simple-srut situation:
- If there is just one alternative, then it's always good to amalgamate
- If there is heap allocation in the code before the case (
up_hp_usg > 0
), then we are going to do a heap-check upstream anyway. In that case, don't do one in the alterantives too. (The single check might allocate too much space, but the alterantives that use less space simply moveHp
back down again, which only costs one instruction.)
- Otherwise, if there no heap alloation upstream, put heap checks in each alternative. The resoning here was that if one alternative needs heap and the other one doesn't we don't want to pay the runtime for the heap check in the case where the heap-free alternative is taken.
Now, what is happening in your example is that
- There is no upstream heap usage
- Both alternatives allocate
Result: you get two heap checks instead of one. But if only one branch allocated, you'd probably want to have the heap check in that branch!
So I think the criterion should be that (assuming no upstream allocation)
- If all the branches allocate, do the heap check before the case
- Otherwise pay the price of a heap check in each branch
Or alterantively (less code size, slightly slower)
- If more than one branch allocates, do the heap check before the case
- If only one allocates, do it in the brcnch
The difficulty here is that it's hard to find out whether the branches allocate without running the code generator on them, and that's now how the current setup is structured. (When you run the code generator on some code, the monad keeps track of how much allocation is done; see StgCmmMonad.getHeapUsage
.) It might well be possible to move things around a bit, but it would need a little care.
But before doing that, the first thing is to decide what the criteria should be.
Simon
comment:14 Changed 5 years ago by
Thanks for detailed explanation. I was a bit concerned about all that extra stuff like saving cost centres in the general case. Don't we need to worry that any of these will impact performance of the compiled code?
The difficulty here is that it's hard to find out whether the branches allocate without running the code generator on them
Yes, I believe this was our conclusion when we discussed that during my internship. I don't see how to use StgCmmMonad.getHeapUsage
to implement the solution. We compile all the alternatives in one go by calling cgAlts (gc_plan,ret_kind) (NonVoid bndr) alt_type alts
, whereas to apply any of the criteria that you proposed we would need to compile the alternatives one by one and analyse their heap usage. My understanding is that if we simply call cgAlts
in a lambda passed to getHeapUsage
we will only know how much heap total is used by the compiled alternatives but we won't have any detailed knowledge about each of the alternatives. Is that correct?
But before doing that, the first thing is to decide what the criteria should be.
It's a bit hard to tell without doing the actual implementation and measuring the results. I guess there will be some programs that work better with the first criterion and some that work better with the second criterion. That being said, I'd go with the first one. Slightly larger binaries are a small price to pay for possibility of having better performance.
comment:15 Changed 5 years ago by
Well, of course the plumbing would need to change a bit. cgAlts
would have to return something saying which branches allocated. The difficulty is that at the moment the gc_plan
flag is passed into cgAlts
whereas now we are proposing that the plan will depend on something returned by cgAlts
. That might be ok, if we tied a recursive knot, provided cgAlts
was sufficiently lazy in its "plan" parameter.
More than that I cannot say without looking a lot harder at the code, something you can do just as well as I, perhaps better.
Simon
comment:16 Changed 5 years ago by
Well, of course the plumbing would need to change a bit. (...)
Right. I just wanted to make sure that I understood things correctly. I'll try to figure out how to make that change.
comment:17 Changed 5 years ago by
Status update:
I tried knot-tying but it doesn't work - cgAlts
is strict in gc_plan
and changing that doesn't look trivial. The only alternative I see at the moment is to:
a) compile the alternatives without heap checks; b) examine heap usage of compiled alternatives c) create a GC plan d) add heap checks to compiled alternatives, if necessary
That sounds simple but I have no idea how to leverage FCode monadery to add heap checks to compiled CmmAGraph
. Am I right to think that currently there is no plumbing for compiling more code on top of already existing CmmAGraph
? So far I was only able to came up with a prototype that implements point a-c above but instead of d) it re-compiles the alternatives from scratch.
comment:18 Changed 5 years ago by
BTW. I think it would be a Good Thing to make FCode
monad an instance of MonadFix
and replace StgCmmMonad.fixC
with mfix
. I think this would make code easier to read. People know what mfix
is, while implementation of fixC
might not be immediately obvious.
comment:19 Changed 5 years ago by
I'm sure this is do-able by knot-tying, but it'll need a bit of care.
- Each alternative can start by emitting a pure blob of code that is a function of the GC plan
- The GC plan is computed from the heap-allocation info from all the alternatives (this is the knot-tied bit)
- So it must be possible to run the alternative
FCode
blobs without yet knowing the GC plan.
- Currently
cgAlts
is strict in the plan, but I don't think it needs to be. After all, the code we generate for the alternatives does not depend on whether there's a heap check at the beginning. You probably need to allocate a label regardless since that is a monadic operation.
So not trivial, but ought to be quite possible. Semantically the data dependencies are just fine!
Simon
comment:20 Changed 5 years ago by
Yes, data dependencies are fine. My current patch actually does the things outlined above but without knot-tying. And it's buggy at the moment. I'll try to debug my patch and if that fails I'll try the approach you just described above but I admit I'm not too keen on it. The problem is that making cgAlts
non-strict in gc_plan
seems very non-trivial and seems to require *a lot* of changes in the structure of the code. By contrast, my patch is quite well localized and confined to a small area of code.
You probably need to allocate a label regardless since that is a monadic operation.
I don't understand that bit. Could you elaborate?
comment:21 Changed 5 years ago by
Owner: | jstolarek deleted |
---|
I'm unable to make further progress on this ticket. Sorry. I'm unassigning this ticket - perhaps someone else can take over.
comment:22 Changed 5 years ago by
Cc: | bgamari added |
---|
comment:23 Changed 4 years ago by
Cc: | rwbarton added |
---|
comment:24 Changed 4 years ago by
comment:25 Changed 4 years ago by
Type of failure: | None/Unknown → Runtime performance bug |
---|
comment:26 Changed 4 years ago by
comment:27 Changed 4 years ago by
I'm not convinced that we shouldn't just always do the heap check outside the case, even when there is only one branch that allocates. In the worst case we do an extra memory read (HpLim, probably in L1 cache), a comparison and a branch (which is highly predictable). However
- In simple functions (like the one in #10676) we don't need to allocate stack if we do the heap check up front. However a heap check in an alternative in this case requires allocating a stack frame (not sure whether occurs in all cases, or whether it is avoidable), so then we do have to do a stack check up front instead which is almost as expensive. (The difference is that SpLim is in a register while HpLim must be stored in memory.) In addition the code becomes substantially larger because we have two separate entries to the GC.
- In functions which need to allocate stack anyways, putting the heap check outside the case does mean we have to an extra check when we take a branch that does not allocate. On the other hand, we can then use the same GC entry code for the stack check and the heap check, so the code is again much smaller. This may pay for the extra instructions of the heap check, especially if the non-allocating branch(es) are infrequent anyways.
At any rate it's simpler than the knot-tying discussed in this ticket, so I will try it.
comment:28 Changed 4 years ago by
Good point. Trying that is an excellent idea. Note that in a tree of (primop) conditionals, if even one branch allocates, then we'll do a heap check in all cases.
It might be interesting to gather the following statistic in ticky-ticky profiling
- Total number of heap checks
- Total number of times that
Hp
is wound back to zero. This winding back is done byadjustHpBackwards
. You can tell if you are winding back to zero becausevHp
is zero.
If we wind back to zero, that means that we allocated nothing, so the original heap check was wasted. Then we can see what proportion of heap checks are wasted.
I guess the worry is that this might happen a lot in some hot inner loop, but let's see.
Thanks for doing this. If it looks reasonable it'd be a much simpler cleaner solution.
comment:29 Changed 4 years ago by
Simon, indeed tying the knot would at very least require that two labels be unconditionally allocated and carried rather deeply into StgCmmHeap
.
Another issue is heapCheck
, which is itself monadic as it ends up calling getHeapUsage
. My initial thought was that maybeAltHeapCheck
could be made lazy in gc_plan
with something like,
maybeAltHeapCheck :: (GcPlan,ReturnKind) -> FCode a -> FCode a maybeAltHeapCheck gc_plan code = do label1 <- allocLabelC label2 <- allocLabelC info_down <- getInfoDown getHeapUsage $ \hpHw -> do emit $ case gc_plan of (NoGcInAlts, _) -> emptyOL (GcInAlts regs, AssignedDirectly) -> ... (GcInAlts regs, ReturnedTo lref off) -> ... code
Unfortunately this becomes quite invasive as you must pass those labels and the CgInfoDownwards
pretty deeply into StgCmmHeap
. It may be that a refactoring will help a bit here but it is still likely to be quite messy as you essentially have to reimplement any helper you might need in the heap check body itself to avoid FCode
(e.g. mkCall
). Perhaps I am missing an obvious solution here?
In summary, as far as I can see there are several paths through the case alternative heap-check code (the code path column below shows the number of newLabelC
s appearing in each function in parentheses),
gc_plan | ret_kind | canned entry-pt. | labels needed | code path |
---|---|---|---|---|
NoGcInAlts | * | * | 0 | just run the code |
GcInAlts | AssignedDirectly | no | 1 | altOrNoEscapeHeapCheck -> genericGC (1) -> heapCheck
|
GcInAlts | AssignedDirectly | yes | 2 | altOrNoEscapeHeapCheck (2) -> cannedGCReturnsTo -> heapCheck
|
GcInAlts | ReturnedTo | no | 1 | altHeapCheckReturnsTo -> genericGC (1) -> heapCheck
|
GcInAlts | ReturnedTo | yes | 0 | altHeapCheckReturnsTo -> cannedGCReturnsTo -> heapCheck
|
comment:30 Changed 4 years ago by
Before delving into this, let's try the simpler approach you suggested!
simon
comment:31 Changed 4 years ago by
Actually, now that I look a bit more carefully StgCmmMonad.codeOnly
may be exactly what is needed here. I'll need to think a bit more to make sure this breaks all of the potential cycles, but I am optimistic.
comment:32 Changed 4 years ago by
I think the issue here is that while on paper the code we emit is independent of our GcPlan
, in practice we emit the code in most cases in heapCheck
and how we end up in heapCheck
is very much dependent on the GcPlan
.
I believe this almost does what we want,
maybeAltHeapCheck :: (GcPlan,ReturnKind) -> FCode a -> FCode a maybeAltHeapCheck gc_plan code = do codeOnly $ case gc_plan of (NoGcInAlts,_) -> code (GcInAlts regs, AssignedDirectly) -> altHeapCheck regs code (GcInAlts regs, ReturnedTo lret off) -> altHeapCheckReturnsTo regs regs lret off code
The trouble is you have now thrown away the heap usage information from the code you emitted in the NoGcInAlts
case.
Alternatively, you can lift the code emission out of heapCheck
, but this breaks its nice interface,
maybeAltHeapCheck :: (GcPlan,ReturnKind) -> FCode a -> FCode a maybeAltHeapCheck gc_plan code = do codeOnly $ case gc_plan of (NoGcInAlts,_) -> return () -- These now only use 'code' to compute the heap usage and do not emit the statements it produces (GcInAlts regs, AssignedDirectly) -> altHeapCheck regs code (GcInAlts regs, ReturnedTo lret off) -> altHeapCheckReturnsTo regs regs lret off code code
Anyways, let's see how Reid's approach performs; perhaps this is all irrelevant.
comment:33 Changed 4 years ago by
Anyways, let's see how Reid's approach performs; perhaps this is all irrelevant
Exactly!
Changed 4 years ago by
Attachment: | hc-nofib.txt added |
---|
comment:34 Changed 4 years ago by
Attached nofib results are for the patch
; simple_scrut <- isSimpleScrut scrut alt_type ; let do_gc | not simple_scrut = True - | isSingleton alts = False - | up_hp_usg > 0 = False - | otherwise = True + | otherwise = False -- ticket:8326#comment:27 -- cf Note [Compiling case expressions] gc_plan = if do_gc then GcInAlts alt_regs else NoGcInAlts
As can be seen from the fact that very few Module Sizes changed, this actually rarely makes a difference. The reason is that there are two special cases of cgCase
that always use NoGcInAlts. (I don't know why they do so, perhaps an oversight.)
- If the case has algebraic alternatives, then either the scrutinee is not simple and we must GcInAlts, or the scrutinee is an application of
tagToEnum#
and the first special case applies and always uses NoGcInAlts.
- If the case has primitive alternatives, then when the scrutinee is simply a variable, the second special case applies and always uses NoGcInAlts.
So this patch only makes a difference when all of the following hold:
- the case has primitive alternatives
- the scrutinee is an application of a primop (that does not allocate, so it is simple, but most primops do not)
- there is more than one alternative
- there is no upstream heap check already
- at least one alternative actually allocates (often CPR analysis has moved an allocation outside of the case)
The combination of the second and third items is fairly rare, it means you are comparing the result of a primop against a constant. A typical example would be testing whether an Int is even or odd.
Basically my conclusions are that
- it's acceptable to apply this patch and always allocate outside the case here, since nofib did not find any significant regressions
- it may still be worthwhile to try to make better decisions about whether to do heap checks in the alternatives, but then we should also do so in the special cases of
cgCase
comment:35 Changed 4 years ago by
Terrific. Could you try the effect of re-enabling the fix in 8317#comment:2. The code is still in Simplify.hs
, just commented out. Look for "Disabled until we fix #8326".
If that works (no perf cost, no binary size increase), then the next step is to remove the special case Note [case on bool]
in StgCmmExpr
; it will never be used because the Simplify
change will catch the case first.
Might you try those two steps? Thanks!
Simon
comment:37 follow-up: 38 Changed 4 years ago by
I ran a nofib benchmark with these changes and there was mostly no effect, aside from a few changes in the plus-or-minus 5-10% range that I could not explain. When I find a larger block of free time I'll return to this subject (should be more pleasant now that ghcspeed will benchmark wip branches).
comment:38 Changed 4 years ago by
comment:39 Changed 4 years ago by
Cc: | osa1 added |
---|
comment:40 Changed 3 years ago by
Cc: | tjakway added |
---|
comment:41 Changed 2 years ago by
Cc: | michalt added |
---|
comment:42 follow-up: 44 Changed 23 months ago by
What is the status of this ticket?
I've tried the patch suggested in comment:34, but my results of nofib were quite different, with some clear regressions (below I've removed anything where the difference was <2%)
NoFib Results -------------------------------------------------------------------------------- Program Size Allocs Runtime Elapsed TotalMem -------------------------------------------------------------------------------- CSD -0.4% 0.0% +7.4% +7.4% 0.0% S -0.5% 0.0% +3.2% +3.3% 0.0% VS -0.5% 0.0% -4.4% -4.4% 0.0% VSM -0.5% 0.0% +8.7% +8.7% 0.0% bspt -0.5% 0.0% 0.003 0.003 +50.0% constraints -0.4% 0.0% +2.5% +2.6% 0.0% cryptarithm1 -0.4% 0.0% +6.0% +6.0% 0.0% exact-reals -1.2% 0.0% -2.7% -2.7% 0.0% fannkuch-redux -0.4% 0.0% -2.5% -2.5% 0.0% fasta -0.4% 0.0% +7.7% +7.6% 0.0% k-nucleotide -0.8% +0.0% +17.9% +18.0% 0.0% lambda -0.4% 0.0% -2.2% -2.1% 0.0% linear -1.2% 0.0% -5.9% -5.9% 0.0% mate -0.4% 0.0% -2.2% -2.2% 0.0% n-body -0.9% 0.0% +3.2% +3.2% 0.0% -------------------------------------------------------------------------------- Min -1.4% -0.2% -5.9% -5.9% 0.0% Max -0.2% +0.0% +17.9% +18.0% +50.0% Geometric Mean -0.7% -0.0% +1.7% +1.6% +0.4%
I've tried to have a look into k-nucleotide
, which slowed down the most. My current understanding is that there's a tight loop within a single function that goes like this:
- A: some computation, eventually goes to B
- B: case on
andI# variable 127#
(one alternative is the slow path that allocates, the other is fast that doesn't) - C: alternative that does a bit of computation (but no allocation) and jumps back A
Now with the change we get a heap check in front of the case, which will now be executed on every iteration and slow everything down.
NOTE: I don't have much experience with investigations like this, so the whole analysis might be quite wrong. ;) Please let me know if something seems off. I'll attach the dump of STG/cmm/asm from both versions of k-nucleotide
(with and without the patch).
Changed 23 months ago by
Attachment: | knucleotide-master-and-patch-dumps.tar.bz2 added |
---|
STG/cmm/asm dumps of k-nucleotide from the current master and the patched GHC
comment:43 Changed 23 months ago by
What Michael says in comment:42 seems to be exactly what item (3) in the Description is all about. If the hot path does not allocate, then adding an allocation check into the hot path will cost time, eve if it reduces binary size.
comment:44 Changed 16 months ago by
Replying to michalt:
What is the status of this ticket?
I've tried the patch suggested in comment:34, but my results of nofib were quite different, with some clear regressions (below I've removed anything where the difference was <2%)
NOTE: I don't have much experience with investigations like this, so the whole analysis might be quite wrong. ;) Please let me know if something seems off. I'll attach the dump of STG/cmm/asm from both versions of
k-nucleotide
(with and without the patch).
When I experimented with the order in which we generate uniques I also got a regression of ~18% for one of the shootout benchmarks, I think it was k-nucleotide but could have been another one.
So while I don't doubt that there is a regression for k-nucleotide with this patch it doesn't have to be because the code we generate is worse for the general case. One really has to look at the Asm/Cmm for that.
it is not clear to me what to do if we have combination of the above (more than one branch that allocates heap and at least one branch that does not).
In the long run we should do some static analysis to help us determine the hot code path. Eg distinguish between:
- Alternatives leading to recursion
- Alternatives being called once.
- Bottoming alternatives.
There are some ideas and work on that in #14672.
comment:45 Changed 12 months ago by
Keywords: | CodeGen added |
---|
comment:46 Changed 11 months ago by
Description: | modified (diff) |
---|
In 53948f915140396acd1b80c6a7a252b2d1e12635/ghc: