Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#8275 closed bug (fixed)

Loopification breaks profiling

Reported by: jstolarek Owned by:
Priority: high Milestone: 7.8.1
Component: Profiling Version: 7.7
Keywords: Cc: leroux@…, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Building GHC failed Test Case:
Blocked By: #8456 Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by ezyang)

Profiling is currently broken in HEAD. Setting BuildFlavour = prof in build.mk leads to a segfault during build. After reverting d61c3ac186c94021c851f7a2a6d20631e35fc1ba (loopification performed in the code generator) everything builds without problems.

I am going to lower the priority since loopification is now no longer by default, and it looks like loopification by default will not make it to GHC 7.8.

Attachments (7)

working-profiling_cmm.txt (194.4 KB) - added by jstolarek 6 years ago.
Cmm dump with loopification patch reverted (profiling works)
broken-profiling_cmm.txt (195.9 KB) - added by jstolarek 6 years ago.
Cmm dump with loopification enabled (profiling broken)
T1735-loopification-input.cmm (4.6 KB) - added by jstolarek 6 years ago.
T1735-loopification-output.cmm (5.9 KB) - added by jstolarek 6 years ago.
T1735-no-loopification-input.cmm (4.7 KB) - added by jstolarek 6 years ago.
T1735-no-loopification-output.cmm (5.7 KB) - added by jstolarek 6 years ago.
0001-Fix-loopification-with-profiling-and-enable-it-by-de.patch (2.2 KB) - added by parcs 6 years ago.

Download all attachments as: .zip

Change History (50)

comment:1 Changed 6 years ago by leroux

Cc: leroux@… added

comment:2 Changed 6 years ago by ezyang

Priority: highhighest

comment:3 Changed 6 years ago by ezyang

Blocking: 8298 added

comment:4 Changed 6 years ago by ezyang

Program received signal SIGSEGV, Segmentation fault.
0x000000000509dced in enterFunCCS (reg=0x64540d8, ccsfn=0x644496000) at rts/Profiling.c:367
367         if (ccsfn->prevStack == CCS_MAIN) {
(gdb) bt
#0  0x000000000509dced in enterFunCCS (reg=0x64540d8, ccsfn=0x644496000) at rts/Profiling.c:367
#1  0x000000000095ad66 in s47v_info ()
#2  0x0000000000000000 in ?? ()

where 0x644496000 is a nonsense address. My guess is that the nodeReg (R1) isn't being restored properly upon the loop back. Some sample code which the loopification optimization triggers would be helpful for debugging; you will probably be able to figure it out by eyeballing the C--.

Last edited 6 years ago by ezyang (previous) (diff)

comment:5 Changed 6 years ago by jstolarek

Loopification triggers for fully-saturated (but not over-saturated!) tail calls. So in this code:

f 0 = 4
f 1 = 5
f n = case f (n - 2) of
        4 -> 4
        5 -> f (n - 1)

If will trigger for call to f (n - 1) in second branch of case, but will not trigger for f (n - 2) in case scrutinee. I checked and this code actually causes segfault when compiled with -prof -fprof-auto -rtsopts (assuming you add main = print (f 5) in the file), but it only happens when f :: Integer -> Integer and not when f :: Int -> Int, so I suspect that the bug might actually be hidden somewhere in the libraries and I might be looking at wrong code.

The idea behind the loopification is that it should put parameters in the local variables (instead of global registers) and make a jump (instead of call). f function begins like this in Cmm:

       cYp:
           _sUP::P64 = R2;
           _sUO::P64 = R1;
           if (%MO_UU_Conv_W32_W64(I32[era]) <= 0) goto cW1; else goto cVZ;
       cVZ:
           I64[R1 + 15] = I64[R1 + 15] & 1152921503533105152 | %MO_UU_Conv_W32_W64(I32[era]) | 1152921504606846976;
           goto cW1;
       cW1:
           if (Sp - 104 < SpLim) goto cYq; else goto cYr;
       cYr:
           Hp = Hp + 40;
           if (Hp > HpLim) goto cYt; else goto cYs

Without loopification tail call will be a normal call:

       cYM:
           I64[CCCS + 72] = I64[CCCS + 72] + %MO_UU_Conv_W64_W64(6 - 2);
           I64[Hp - 40] = sat_sVa_info;
           I64[Hp - 32] = CCCS;
           I64[Hp - 24] = (%MO_UU_Conv_W32_W64(I32[era]) << 30) | 0;
           P64[Hp - 8] = _sUN::P64;
           P64[Hp] = _sUP::P64;
           _cXT::P64 = Hp - 40;
           R2 = _cXT::P64;
           R1 = _sUO::P64;
           Sp = Sp + 40;
           call f1_sUQ_info(R2, R1) args: 8, res: 0, upd: 8;

With loopification we get:

       cYM:
           I64[CCCS + 72] = I64[CCCS + 72] + %MO_UU_Conv_W64_W64(6 - 2);
           I64[Hp - 40] = sat_sV8_info;
           I64[Hp - 32] = CCCS;
           I64[Hp - 24] = (%MO_UU_Conv_W32_W64(I32[era]) << 30) | 0;
           P64[Hp - 8] = _sUL::P64;
           P64[Hp] = _sUP::P64;
           _cXT::P64 = Hp - 40;
           _sUP::P64 = _cXT::P64;
           goto cW2;
       cW2:
           if (Sp - 104 < SpLim) goto uZq; else goto uZp;
       uZq:
           Sp = Sp + 40;
           goto cYq;
       cYq:
           R2 = _sUP::P64;
           R1 = _sUO::P64;
           call (stg_gc_fun)(R2, R1) args: 8, res: 0, upd: 8;
       uZp:
           Sp = Sp + 40;
           goto cYr;

What might be surprising is that value of _sUO is not set before making tail call, but that *seems* to be OK - it is only shuffled between the stack and local variable. Note also that loopified call doesn't jump directly to second label cVZ, but instead it jumps to cYr. In principle this is OK (we want to skip stack check but not heap check), but TBH I can't tell whether in this case that is correct - I don't know what the magical numbers in cVZ do.

Changed 6 years ago by jstolarek

Attachment: working-profiling_cmm.txt added

Cmm dump with loopification patch reverted (profiling works)

Changed 6 years ago by jstolarek

Attachment: broken-profiling_cmm.txt added

Cmm dump with loopification enabled (profiling broken)

comment:6 Changed 6 years ago by ezyang

The magic numbers are for LDV (lag-drag-void) accounting; they are probably irrelevant. No comment for the rest of the code fragments yet.

comment:7 Changed 6 years ago by jstolarek

Another thing I observed is that my example is very sensitive. If you modify it by removing any call to f (for example by changing scrutinee to case n of... or second branch to 5 -> 5) or replace second pattern match in case alternatives with _ (i.e second branch becomes _ -> f (n - 1)) the segfault disappears. I haven't yet examined the differences in Cmm between these versions, but if the problem is caused by code generated for f this should allow us to track the cause. I don't know yet how to debug this issue if the problem is hidden in the libraries :-/

comment:8 Changed 6 years ago by Jan Stolarek <jan.stolarek@…>

In 7a4c15a0ee045ac5d2b48865997c7f5e8697afa3/ghc:

Add flag to control loopification

It is off by default, which is meant to be a workaround for #8275.
Once #8275 is fixed we will enable this option by default.

comment:9 Changed 6 years ago by jstolarek

Edward, I disabled loopification by default - this should allow you to use profiling. I noticed that enabling loopification for my test program does not trigger segfault. So the problem was in the libraries, not in code snippet that I uploaded here.

comment:10 Changed 6 years ago by ezyang

OK, makes sense; I couldn't find anything obviously wrong in your example. My suggestion is to refactor it so the library dep goes away, and then we'll know what's up.

Changed 6 years ago by jstolarek

Changed 6 years ago by jstolarek

Changed 6 years ago by jstolarek

Changed 6 years ago by jstolarek

comment:11 Changed 6 years ago by jstolarek

Edward, there is some progress. There is a single test in the testsuite that fails when loopification is turned on, but works if it is off. That test is T1735. Assuming that you built GHC with profiling libraries you can run this test from testsuite/tests with:

make WAY=normal EXTRA_HC_OPTS="-prof -fprof-auto -rtsopts" TEST=T1735

and it should pass. Including -floopification on the list of extra options should make the test fail. Examining the Cmm dumps is a pain - they are over 200k lines - but I found a single place where loopification occurs. Attached are the dumps of this single function with and without loopification. -input suffix denotes code produced by the code generator, -output suffix denotes code after it has gone through all optimisations in Cmm pipeline.

If you look at the output code you will notice one incorrect transformation in the loopified version: stack check gets duplicated and inserted at the end of function, right before making a tail call. There are two issues here:

1) Correctness. Cmm pipeline performs an incorrect transformation on a valid Cmm input program. It duplicates stack check at the end of a function, which violates assumption that stack checks are made only at the entry to a function (see CmmLayoutStack, line 208). Putting stack check at the end makes it invalid, because we set required amount of stack to be a fixed value. As a result at the end of a function we are checking for more stack that we actually need (becuase we have moved the stack pointer). Simon proposes to fix that by referring to <old + 0> instead of Sp (before stack layout). To me it is not clear whether this stack check is direct cause of a segfault, though it certainly is not correct.

2) Performance. Putting stack check right before making a call makes no sense. This is probably the direct result of control flow optimisations pass, but there is a deeper problem here. When making a loopified tail call we want to avoid re-doing the stack check because it is not necessary to do it again, but we want to perform heap check. So loopified tail call should make a jump in between those two. Currently generation of stack and heap checks is closely coupled with each other in the code generator and it will probably take some work to separate them. Surprisingly, sometimes we get good code (loopification jumps between stack and heap checks) and sometimes we don't. Neither I nor Simon know why this happens - I need to look at the code and solve that mystery.

I wasn't able to figure out one thing. The loopified tail call omits one block of code related to profiling (the one with magic numbers). I don't know whether this is correct or not. Edward, are you able to tell whether this can be done or not?

comment:12 Changed 6 years ago by ezyang

If the C-- dump is too large, minimize the test case! Then you can look at each of the phases and see which one is inserting the extra stack check. It sounds like you have almost figured it out though.

As for whether or not the LDV code needs to be done again, it depends on whether node (R1) changes when looping. If it does, yes, you need to run the LDV code, since the closure needs to be marked as used. Otherwise it will be idempotent, and you only need to run it once.

This also reminds me, please make sure loopification interacts properly with -falways-yields (which guarantees that Haskell code will eventually yields); it sounds like either loopification must be disabled if always-yields is enabled, or the loop-back needs to insert a heap check for preemption purposes. And if a loop can be preempted, then the era can change, and yes, you need to run the LDV check after the preemption.

comment:13 in reply to:  12 Changed 6 years ago by jstolarek

Replying to ezyang:

As for whether or not the LDV code needs to be done again, it depends on whether node (R1) changes when looping.

Yes, R1 does change inside the loop.

the loop-back needs to insert a heap check for preemption purposes. And if a loop can be preempted, then the era can change, and yes, you need to run the LDV check after the preemption.

I don't fully understand this. What is preemption? If upon entry to the loop I will do the LDV code and heap check, will this ensure correctnes of profiling and -falways-yields? Is this described on the wiki somewhere (preemption and LDV)?

comment:14 Changed 6 years ago by jstolarek

Owner: jstolarek deleted

OK, I know what preemption is. There are three steps that need to be done:

1) Fix generation of stack checks, i.e. fix correctness (this should be simple, I know how to do this).

2) Make sure that stack check doesn't get duplicated, i.e. fix performance

3) Make sure that loopification jumps before LDV code and hope that this solves the problem.

There is one thing that worries me. We want the loop to begin after the stack check but before the heap check. Now if LDV code is placed before the stack check then of course this requirement cannot be fulfilled. I'm not sure how to handle such situation.

Anyway, I'm leaving MSR today and my work on this ticket will be put on hold for a couple of weeks. If anyone wants to take over feel free. I'm unassigning myself for the time being.

comment:15 Changed 6 years ago by jstolarek

I actually already implemented 1), but the implementation is only a proof-of-concept. It's available here, but it need testing, cleanup (we can just drop sp_high and use final_hwn instead and perhaps entry_args are not used anymore?) and improving constantFoldMachOpM in CmmOpt, so that if ((Sp + 56) - 96 < SpLim) goto u4yi; else goto u4yh; gets constant folded properly.

comment:16 Changed 6 years ago by ezyang

Description: modified (diff)
Milestone: 7.10.1
Priority: highesthigh

We want the loop to begin after the stack check but before the heap check. Now if LDV code is placed before the stack check then of course this requirement cannot be fulfilled.

One way to deal with this, at the cost of increased code size, is to place a second copy of the LDV code in the function epilogue right before the loopification jump back.

However, it might be OK to move the LDV check after the stack/heap checks. Simon Marlow might be able to better say.

comment:17 Changed 6 years ago by simonmar

Cc: simonmar added

comment:18 Changed 6 years ago by jstolarek

Milestone: 7.10.17.8.1

I'm setting this to milestone 7.8.1 - we certainly don't want that breakage released. I will be fairly busy for the next 2-3 weeks, but I should have time in November to look into this.

comment:19 Changed 6 years ago by Jan Stolarek <jan.stolarek@…>

In 94125c97e49987e91fa54da6c86bc6d17417f5cf/ghc:

Generate (old + 0) instead of Sp in stack checks

When compiling a function we can determine how much stack space it will
use. We therefore need to perform only a single stack check at the beginning
of a function to see if we have enough stack space. Instead of referring
directly to Sp - as we used to do in the past - the code generator uses
(old + 0) in the stack check. Stack layout phase turns (old + 0) into Sp.

The idea here is that, while we need to perform only one stack check for
each function, we could in theory place more stack checks later in the
function. They would be redundant, but not incorrect (in a sense that they
should not change program behaviour). We need to make sure however that a
stack check inserted after incrementing the stack pointer checks for a
respectively smaller stack space. This would not be the case if the code
generator produced direct references to Sp. By referencing (old + 0) we make
sure that we always check for a correct amount of stack: when converting
(old + 0) to Sp the stack layout phase takes into account changes already
made to stack pointer. The idea for this change came from observations made
while debugging #8275.

comment:20 Changed 6 years ago by jstolarek

Owner: set to jstolarek

Resuming work on this.

comment:21 Changed 6 years ago by jstolarek

(Comment redirected to #8456.) I believe I know the reason for duplicating the stack check. There is at least one bug in CmmContFlowOpt module. Consider this:

L1: goto L2
L2: whatever
L3: goto L1

We are processing blocks from the end. When we reach L3 first guard in maybe_concat function (CmmContFlowOpt.hs, line 123) will turn that blocks into:

L1: goto L2
L2: whatever
L3: goto L2

However, the number of predecessors of L2 block is not updated because backEdges is computed once before we run maybe_concat and is not updated when we make changes to the block structure.

Another issue, which I have not yet encountered but I believe may arise as well, comes from the fact that we may map one label to a different one, but again we don't that take into account when determining the number of predecessors. I wrote a fix for that sometime in July, but I did not merged into HEAD - I guess I'll do that now.

Last edited 6 years ago by simonpj (previous) (diff)

comment:22 Changed 6 years ago by simonmar

(Comment redirected to #8456.) Yes, good catch. I wonder if the right way to fix these issues is to not shortcut or concat a jump when the jump destination has not already been processed. This might be easier than keeping track of the predecessor counts accurately, and should give results that are just as good.

Last edited 6 years ago by simonpj (previous) (diff)

comment:23 Changed 6 years ago by jstolarek

Blocked By: 8456 added

comment:24 Changed 6 years ago by simonpj

It's absolutely right to refer to (old+0).

But I don't understand the patch (in comment 19), to CmmLayoutStack. The change made there is simply to pass a different (larger) value to manifestSp. And that's the only change to that module. So we'll generate different, wrong code. I don't see how this can possibly work. And yet apparently it does. What am I missing?

(The change to StgCmmHeap is different, and is fine.)

Simon

comment:25 Changed 6 years ago by simonpj

Stuff about CmmContFlowOpt has moved to #8456

Last edited 6 years ago by simonpj (previous) (diff)

comment:26 in reply to:  24 ; Changed 6 years ago by jstolarek

Replying to simonpj:

But I don't understand the patch (in comment 19), to CmmLayoutStack. The change made there is simply to pass a different (larger) value to manifestSp. And that's the only change to that module. So we'll generate different, wrong code. I don't see how this can possibly work. And yet apparently it does. What am I missing?

I'll try to explain this with an example of function that uses no stack space. Previously we generated stack checks that looked like this:

if (Sp - <highSp> < SpLim) ...

If the function used no stack, then stack layout (areaToSp function, second guard) transformed this into:

if (Sp - 0 < SpLim) ...

This check is always false and it used to be eliminated by the third guard in areaToSp. The value of <highSp> is computed to be 0 by the line of code that I removed in CmmLayoutStack.

Now consider new version. Code generator gives us:

if ((old + 0) - <highSp> < SpLim) ...

As you pointed out, <highSp> will now be larger. But (old + 0) will not be Sp! This check will be turned into something this:

if ((Sp + 8) - 8 < SpLim) ...

That is the reason why the elimination of always false stack checks was killed by my change.

BTW. Simon, you wrote the code for that patch when I was in Cambridge :-) Do you remember how we (well, mostly you) updated this wiki page and how we were confused with addressing offsets into an Old area?

comment:27 Changed 6 years ago by jstolarek

Simon, does my explanation make sense?

I fixed the duplicate stack check, but loopification still causes segfault with T1735.

comment:28 Changed 6 years ago by simonpj

Yes, that makes sense thank you. I was forgetting about the late-bound constant CmmHighStackMark, which manifestSp manifests along with the areas etc. I've added yet more comments!

Disappointing that you still get the seg fault. I hope you make progress in tracking it down.

Simon

comment:29 Changed 6 years ago by jstolarek

Owner: jstolarek deleted

Sadly, it quite possible that I won't have time to work on this for the next week (or even two, if things go bad). I'm unassigning myself from this ticket for the moment. If anyone wishes to take over the next thing to do is ensuring that loopified call jumps to the correct cmm block: after the stack check, but before LDV code and heap check.

Janek

comment:30 in reply to:  26 Changed 6 years ago by simonmar

Replying to jstolarek:

BTW. Simon, you wrote the code for that patch when I was in Cambridge :-) Do you remember how we (well, mostly you) updated this wiki page and how we were confused with addressing offsets into an Old area?

I am afraid that wiki page was not updated when I rewrote the stack layout algorithm, and it's woefully out of date. In particular there are no RegSlots any more.

comment:31 Changed 6 years ago by simonmar

Blocking: 8298 removed

comment:32 in reply to:  9 Changed 6 years ago by parcs

Status: newpatch

I have attached a patch that fixes loopification with profiling. The problem is that nodeReg aka R1 may get clobbered inside the body of a closure. With profiling, upon entering a closure, enterFunCCS is called with R1 passed to it. But since R1 may get clobbered inside the body of a closure, and since a self-recursive tail call does not restore R1, a subsequent call to enterFunCCS may be passed a bogus value from R1.

The solution is to not pass nodeReg aka R1 to enterFunCCS. Instead, we pass node, the callee-saved temporary that stores the original value of R1. This way R1 may get clobbered and loopification will not care.

I don't know whether it's necessary to re-run the LDV code after looping. Not doing so certainly does not cause a segfault, anyway.

comment:33 Changed 6 years ago by jstolarek

Thanks Patrick! Currently I'm without my main laptop so I can't verify your patch - I will do it in Wednesday, unless someone else can do it before.

comment:34 in reply to:  33 Changed 6 years ago by parcs

Replying to jstolarek:

Thanks Patrick! Currently I'm without my main laptop so I can't verify your patch - I will do it in Wednesday, unless someone else can do it before.

Okay, great.

And here's an ugly PoC patch that moves the self-loop label below the stack check:

  • compiler/codeGen/StgCmmBind.hs

    diff --git a/compiler/codeGen/StgCmmBind.hs b/compiler/codeGen/StgCmmBind.hs
    index 16477c8..029b8af 100644
    a b closureCodeBody top_lvl bndr cl_info cc args arity body fv_details 
    481481                -- of a self-recursive tail call. See Note
    482482                -- [Self-recursive tail calls] in StgCmmExpr
    483483                ; loop_header_id <- newLabelC
    484                 ; emitLabel loop_header_id
    485484                -- Extend reader monad with information that
    486485                -- self-recursive tail calls can be optimized into local
    487486                -- jumps
  • compiler/codeGen/StgCmmHeap.hs

    diff --git a/compiler/codeGen/StgCmmHeap.hs b/compiler/codeGen/StgCmmHeap.hs
    index 55ddfd4..f01d042 100644
    a b do_checks mb_stk_hwm checkYield mb_alloc_lit do_gc = do 
    615615    Nothing -> return ()
    616616    Just stk_hwm -> tickyStackCheck >> (emit =<< mkCmmIfGoto (sp_oflo stk_hwm) gc_id)
    617617
     618  self_loop_info <- getSelfLoop
     619  case self_loop_info of
     620    Just (_,loop_header_id,_)
     621        | checkYield && isJust mb_stk_hwm -> emitLabel loop_header_id
     622    _otherwise -> return ()
     623
    618624  if (isJust mb_alloc_lit)
    619625    then do
    620626     tickyHeapCheck

comment:35 Changed 6 years ago by jstolarek

Owner: set to jstolarek

OK, I got my laptop back :-) Patch validates, but before I push it I want to make a few more tests to see that it does the right thing.

And here's an ugly PoC patch that moves the self-loop label below the stack check:

Could you explain how and why this works? I am mostly interested in knowing how does it guarantee to place self-loop label after stack check but before heap check. I don't have more time to investigate this at the moment but if the patch is ready and working it would be good to have it pushed. Once I push your patch for loopification I'll close this ticket and start a new one for moving self-loop label after the stack check.

Last edited 6 years ago by jstolarek (previous) (diff)

comment:36 Changed 6 years ago by Jan Stolarek <jan.stolarek@…>

In adb9964e2f97338501411282c0bb6a9f47a56b1b/ghc:

Fix loopification with profiling and enable it by default (#8275)

comment:37 Changed 6 years ago by jstolarek

Resolution: fixed
Status: patchclosed

OK, I see how the loopification patch works. Previously we had:

c4AC:
    call "ccall" arg hints:  [PtrHint,
                              PtrHint]  result hints:  [] enterFunCCS(BaseReg, I64[R1 - 1 + 8]);

Now we have:

c4AC:
    call "ccall" arg hints:  [PtrHint,
                              PtrHint]  result hints:  [] enterFunCCS(BaseReg, I64[_s2UW::P64 - 1 + 8]);

The _s2UW::P64 is declared anyway so no extra work required :-)

comment:38 Changed 6 years ago by hvr

btw, shall we mention the new -floopification flag in the 7.8.1 release notes?

comment:39 Changed 6 years ago by Jan Stolarek <jan.stolarek@…>

In 6d24076be9aff562cd449aa1d39b8eb14638bcd6/ghc:

Document solution to #8275

comment:40 Changed 6 years ago by simonmar

Owner: jstolarek deleted
Resolution: fixed
Status: closednew

The fix looks right to me, thanks. (we should never be referring to global regs explicity except during copy-in or copy-out, so it was just a bug before.)

I don't know whether it's necessary to re-run the LDV code after looping. Not doing so certainly does not cause a segfault, anyway.

Yes, it is necessary to do this. Remember that looping should behave in exactly the same way as making a recursive call to the function. Could someone look into this please?

comment:41 Changed 6 years ago by Patrick Palka <patrick@…>

In ac31b79924e9330c577ea63af9b0f10f4fc6c8f6/ghc:

Move the LDV code below the self-loop label (#8275)

comment:42 in reply to:  40 Changed 6 years ago by parcs

Replying to simonmar:

The fix looks right to me, thanks. (we should never be referring to global regs explicity except during copy-in or copy-out, so it was just a bug before.)

I don't know whether it's necessary to re-run the LDV code after looping. Not doing so certainly does not cause a segfault, anyway.

Yes, it is necessary to do this. Remember that looping should behave in exactly the same way as making a recursive call to the function. Could someone look into this please?

Okay, I optimistically pushed a pair of commits to make the LDV coderun after looping. It involved eliminating another explicit use of nodeReg and then moving the LDV code below the self-loop label. Is this fine?

comment:43 Changed 6 years ago by simonmar

Resolution: fixed
Status: newclosed

Ok, looks good, thanks. I'll re-open if I find any problems.

Note: See TracTickets for help on using tickets.