Opened 13 months ago

Last modified 6 months ago

#15508 new bug

concprog001 fails with various errors

Reported by: osa1 Owned by: osa1
Priority: highest Milestone: 8.6.4
Component: Compiler Version: 8.5
Keywords: Cc: alpmestan, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: #15571 Differential Rev(s): Phab:D5051 (reverted), Phab:D5165, Phab:D5178, MR:103, MR:146
Wiki Page:

Description (last modified by osa1)

I'm observing a few different runtime errors. I'm not sure if they're because of different bugs so I'm filing one ticket for now.

The reproduce with GHC HEAD:

prog001 git:(master) $ ghc-stage2 Mult.hs -prof -fprof-auto -fforce-recomp -threaded
[1 of 7] Compiling Stream           ( Stream.hs, Stream.o )
[2 of 7] Compiling Converter        ( Converter.hs, Converter.o )
[3 of 7] Compiling Thread           ( Thread.hs, Thread.o )
[4 of 7] Compiling Utilities        ( Utilities.hs, Utilities.o )
[5 of 7] Compiling Trit             ( Trit.hs, Trit.o )
[6 of 7] Compiling Arithmetic       ( Arithmetic.hs, Arithmetic.o )
[7 of 7] Compiling Main             ( Mult.hs, Mult.o )
Linking Mult ...

prog001 git:(master) $ ./Mult +RTS -N2
Mult: internal error: scavenge_one: strange object 624722688
    (GHC version 8.7.20180809 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
zsh: abort (core dumped)  ./Mult +RTS -N2

prog001 git:(master) $ ./Mult +RTS -N2
Mult: internal error: scavenge_stack: weird activation record found on stack: 0
    (GHC version 8.7.20180821 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
zsh: abort (core dumped)  ./Mult +RTS -N2

prog001 git:(master) $ ./Mult +RTS -N2
zsh: segmentation fault (core dumped)  ./Mult +RTS -N2

It's very easy to trigger other kind of panics in concprog001, just try different compile and runtime flag combinations. Note that for the examples below you'll need debug + profiling or debug + profiling + threaded runtimes, which are not built by default. To build those apply this patch:

diff --git a/mk/config.mk.in b/mk/config.mk.in
index 11050120d4..f083abad22 100644
--- a/mk/config.mk.in
+++ b/mk/config.mk.in
@@ -297,6 +297,7 @@ GhcRTSWays=l
 
 # Usually want the debug version
 GhcRTSWays += debug
+GhcRTSWays += thr_debug_p debug_p
 
 # We always have the threaded versions, but note that SMP support may be disabled
 # (see GhcWithSMP).
diff --git a/rts/ghc.mk b/rts/ghc.mk
index 532c9aa175..ff3f18f30c 100644
--- a/rts/ghc.mk
+++ b/rts/ghc.mk
@@ -329,7 +329,6 @@ WARNING_OPTS += -Wstrict-prototypes
 WARNING_OPTS += -Wmissing-prototypes 
 WARNING_OPTS += -Wmissing-declarations
 WARNING_OPTS += -Winline
-WARNING_OPTS += -Waggregate-return
 WARNING_OPTS += -Wpointer-arith
 WARNING_OPTS += -Wmissing-noreturn
 WARNING_OPTS += -Wnested-externs

Examples:

prog001 git:(master) $ ghc-stage2 Mult.hs -prof -fprof-auto -fforce-recomp -debug
[1 of 7] Compiling Stream           ( Stream.hs, Stream.o )
[2 of 7] Compiling Converter        ( Converter.hs, Converter.o )
[3 of 7] Compiling Thread           ( Thread.hs, Thread.o )
[4 of 7] Compiling Utilities        ( Utilities.hs, Utilities.o )
[5 of 7] Compiling Trit             ( Trit.hs, Trit.o )
[6 of 7] Compiling Arithmetic       ( Arithmetic.hs, Arithmetic.o )
[7 of 7] Compiling Main             ( Mult.hs, Mult.o )
Linking Mult ...

prog001 git:(master) $ ./Mult +RTS -DS
Mult: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 210

    (GHC version 8.7.20180809 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
zsh: abort (core dumped)  ./Mult +RTS -DS

(I tried to fix this in Phab:D5051 but it's causing a segfault in test T11108 when run with profiling. Not sure what the problem is yet.)

Another way:

prog001 git:(master) $ ghc-stage2 Mult.hs -prof -fprof-auto -fforce-recomp -threaded
[1 of 7] Compiling Stream           ( Stream.hs, Stream.o )
[2 of 7] Compiling Converter        ( Converter.hs, Converter.o )
[3 of 7] Compiling Thread           ( Thread.hs, Thread.o )
[4 of 7] Compiling Utilities        ( Utilities.hs, Utilities.o )
[5 of 7] Compiling Trit             ( Trit.hs, Trit.o )
[6 of 7] Compiling Arithmetic       ( Arithmetic.hs, Arithmetic.o )
[7 of 7] Compiling Main             ( Mult.hs, Mult.o )
Linking Mult ...

prog001 git:(master) $ ./Mult +RTS -N2
zsh: segmentation fault (core dumped)  ./Mult +RTS -N2

Yet another way:

prog001 git:(master) $ ghc-stage2 Mult.hs -prof -fprof-auto -fforce-recomp -threaded -debug
[1 of 7] Compiling Stream           ( Stream.hs, Stream.o )
[2 of 7] Compiling Converter        ( Converter.hs, Converter.o )
[3 of 7] Compiling Thread           ( Thread.hs, Thread.o )
[4 of 7] Compiling Utilities        ( Utilities.hs, Utilities.o )
[5 of 7] Compiling Trit             ( Trit.hs, Trit.o )
[6 of 7] Compiling Arithmetic       ( Arithmetic.hs, Arithmetic.o )
[7 of 7] Compiling Main             ( Mult.hs, Mult.o )
Linking Mult ...

prog001 git:(master) $ ./Mult +RTS -N2
Mult: internal error: invalid closure, info=0x947edc
    (GHC version 8.7.20180809 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
zsh: abort (core dumped)  ./Mult +RTS -N2

It seems to work fine when not compiled for profiling so marking this bug as a profiler bug.

This program fails sanity checks with non-threaded and non-profiling runtime too:

prog001 git:(master) $ ghc-stage2 Mult.hs -fforce-recomp -debug -rtsopts
[1 of 7] Compiling Stream           ( Stream.hs, Stream.o )
[2 of 7] Compiling Converter        ( Converter.hs, Converter.o )
[3 of 7] Compiling Thread           ( Thread.hs, Thread.o )
[4 of 7] Compiling Utilities        ( Utilities.hs, Utilities.o )
[5 of 7] Compiling Trit             ( Trit.hs, Trit.o )
[6 of 7] Compiling Arithmetic       ( Arithmetic.hs, Arithmetic.o )
[7 of 7] Compiling Main             ( Mult.hs, Mult.o )
Linking Mult ...

prog001 git:(master) $ ./Mult +RTS -DS
Mult: internal error: checkClosure: stack frame
    (GHC version 8.7.20180821 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
zsh: abort (core dumped)  ./Mult +RTS -DS

Change History (49)

comment:1 Changed 13 months ago by osa1

Description: modified (diff)

comment:2 Changed 12 months ago by bgamari

Priority: normalhigh
Type of failure: None/UnknownRuntime crash

comment:3 Changed 12 months ago by Ben Gamari <ben@…>

In 2693eb1/ghc:

Properly tag fun field of PAPs generated by ap_0_fast

Currently ap_0_fast doesn't maintain the invariant for PAP fun fields
which says if the closure can be tagged, it should be. This is checked
by `Sanity.c:checkPAP` and correctly implemented by `genautoapply`.

This causes sanity check failures when we have a profiling code like

    f = {-# SCC scc #-} g

where g is a PAP or a FUN, and `scc` is different than the current cost
centre.

Test Plan: Slow validate (not done yet)

Reviewers: simonmar, bgamari, erikd

Reviewed By: simonmar

Subscribers: rwbarton, carter

GHC Trac Issues: #15508

Differential Revision: https://phabricator.haskell.org/D5051

comment:4 Changed 12 months ago by bgamari

Status: newmerge

comment:5 Changed 12 months ago by osa1

bgamari, the commit above should be reverted as it breaks another program (not caught by the test suite, see my comments in the diff).

comment:6 Changed 12 months ago by osa1

Status: mergenew

Reverted the commit with ab55b4ddb7.

Last edited 12 months ago by osa1 (previous) (diff)

comment:7 Changed 12 months ago by osa1

Description: modified (diff)

comment:8 Changed 12 months ago by osa1

So, as shown in the description this program crashes in many ways. I started debugging with one of those errors, and found out that at some point cap->r.rCCCS (a capability's "current cost centre" stack) actually points to a heap closure instead of a cost centre stack:

>>> print cap->r.rCCCS
$13 = (struct CostCentreStack_ *) 0x4200213000

>>> call printClosure(cap->r.rCCCS)
integer-gmp:GHC.Integer.Type.S#((nil)#)

Then when this capability does allocation accountAllocation overwrites a info table pointer in this line

CCS_ALLOC(cap->r.rCCCS,n); // Storage.c:802

CCS_ALLOC defined as:

#define CCS_ALLOC(ccs, size) (ccs)->mem_alloc += ((size)-sizeofW(StgProfHeader))

mem_alloc is at this address:

>>> print &cap->r.rCCCS->mem_alloc
$15 = (StgWord64 *) 0x4200213048

Which is also the info ptr of another closure:

>>> print &((StgClosure*)0x4200213048)->header.info
$20 = (const StgInfoTable **) 0x4200213048

Originally this closure is;

>>> call printClosure(0x4200213048)
BLACKHOLE(0x42000dfa40)

So this assignment causes a memory corruption.

Note that I'm using +RTS -DS during all this so any heap location that is not filled with 0 are actually in use. (collected space is filled with 0s with -DS)

comment:9 Changed 12 months ago by osa1

I wonder what would be a good sanity check for cost centre fields, e.g. a function void checkCCS(CostCentreStack *ccs) that can be used in checkClosure and to check cap->r.rCCCS maybe before every GC.

comment:10 Changed 12 months ago by osa1

Component: ProfilingCompiler
Description: modified (diff)

While messing around with different flags I realized that we don't need profiling or threaded flags, it fails some sanity checks in non-threaded non-profiled runtime. Updated the description and reverted the component from profiler to compiler (not sure if this is a codegen or RTS issue).

comment:11 Changed 12 months ago by osa1

Summary: concprog001 fails with various errors when compiled with -profconcprog001 fails with various errors

comment:12 Changed 12 months ago by osa1

comment:13 Changed 12 months ago by osa1

Owner: set to osa1

comment:14 Changed 11 months ago by simonmar

Just came across this ticket. It looks like the problems fall into two categories:

  • There are segfaults and various barfs when the test is compiled with -prof
  • There is a sanity check failure when the test is compiled without -prof, but no segfaults/barfs

Is that right?

comment:15 Changed 11 months ago by osa1

That's right. In my testing this program does not fail if I don't use debug or prof runtimes.

I think the debug runtime issue is #15571. I don't know if there are other problems though.

comment:16 Changed 11 months ago by osa1

Differential Rev(s): Phab:D5051Phab:D5051 (reverted), Phab:D5165

Fixing one bug reveals others.

I'm currently getting this

Mult: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 296
    (GHC version 8.7.20180919 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

The stack trace:

>>> bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f7bf5709801 in __GI_abort () at abort.c:79
#2  0x00000000007313b5 in rtsFatalInternalErrorFn (s=0x79fd90 "ASSERTION FAILED: file %s, line %u\n", ap=0x7ffe92feb8b8) at rts/RtsMessages.c:186
#3  0x0000000000730fd9 in barf (s=0x79fd90 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:48
#4  0x000000000073103e in _assertFail (filename=0x7a55f0 "rts/sm/Sanity.c", linenum=296) at rts/RtsMessages.c:63
#5  0x0000000000751be3 in checkClosure (p=0x42008e45e0) at rts/sm/Sanity.c:295
#6  0x00000000007521b0 in checkHeapChain (bd=0x4200803900) at rts/sm/Sanity.c:450
#7  0x0000000000752bf1 in checkGeneration (gen=0x10ec408, after_major_gc=false) at rts/sm/Sanity.c:745
#8  0x0000000000752cc1 in checkFullHeap (after_major_gc=false) at rts/sm/Sanity.c:764
#9  0x0000000000752d3b in checkSanity (after_gc=false, major_gc=false) at rts/sm/Sanity.c:773
#10 0x000000000074e99b in GarbageCollect (collect_gen=0, do_heap_census=false, gc_type=0, cap=0x90e1c0 <MainCapability>, idle_cap=0x0) at rts/sm/GC.c:320
#11 0x0000000000734141 in scheduleDoGC (pcap=0x7ffe92febcb0, task=0x10fdd50, force_major=false) at rts/Schedule.c:1800
#12 0x0000000000733619 in schedule (initialCapability=0x90e1c0 <MainCapability>, task=0x10fdd50) at rts/Schedule.c:546
#13 0x0000000000734b13 in scheduleWaitThread (tso=0x4200105388, ret=0x0, pcap=0x7ffe92febdb0) at rts/Schedule.c:2537
#14 0x0000000000743205 in rts_evalLazyIO (cap=0x7ffe92febdb0, p=0x7d5710, ret=0x0) at rts/RtsAPI.c:530
#15 0x000000000074394c in hs_main (argc=1, argv=0x7ffe92febfa8, main_closure=0x7d5710, rts_config=...) at rts/RtsMain.c:72
#16 0x000000000042fd6e in main ()

The assertion that fails:

ASSERT(bq->queue == (MessageBlackHole*)END_TSO_QUEUE
       || bq->queue->header.info == &stg_MSG_BLACKHOLE_info);

bq->queue is actually an IND:

>>> print bq->queue
$1 = (struct MessageBlackHole_ *) 0x42001df098
>>> call printClosure(bq->queue)
IND(0x42001de350)

There's a note about this in Evac.c:

// Note [BLACKHOLE pointing to IND]
//
// BLOCKING_QUEUE can be overwritten by IND (see
// wakeBlockingQueue()). However, when this happens we must
// be updating the BLACKHOLE, so the BLACKHOLE's indirectee
// should now point to the value.
//
// The mutator might observe an inconsistent state, because
// the writes are happening in another thread, so it's
// possible for the mutator to follow an indirectee and find
// an IND. But this should never happen in the GC, because
// the mutators are all stopped and the writes have
// completed.

It seems like this currently does not hold. We're in GC and observe that a BLOCKING_QUEUE is actually an IND. Adding watchpoint to see when it became an IND:

>>> watch ((StgBlockingQueue *) 0x42008e45e0)->queue->header.info
Hardware watchpoint 1: ((StgBlockingQueue *) 0x42008e45e0)->queue->header.info
>>> reverse-continue
>>> bt
#0  0x00000000007406a0 in SET_INFO (c=0x42001df098, info=0x763e08 <stg_IND_info>) at includes/rts/storage/ClosureMacros.h:49
#1  0x0000000000740dfe in throwToMsg (cap=0x90e1c0 <MainCapability>, msg=0x42000cc180) at rts/RaiseAsync.c:412
#2  0x0000000000740905 in throwTo (cap=0x90e1c0 <MainCapability>, source=0x420040e438, target=0x42008bcc40, exception=0x7df158) at rts/RaiseAsync.c:213
#3  0x00000000007613c2 in stg_killThreadzh ()
#4  0x0000000000000000 in ?? ()

I'm not sure when this object is supposed to become a BLACKHOLE (or END_TSO_QUEUE) again. Simon, any ideas? I'm also confused why a note about blackholes talk about BLOCKING_QUEUE...

comment:17 Changed 11 months ago by simonmar

Yes, the ASSERT is wrong here.

Note [BLACKHOLE pointing to IND] is talking about when a BLOCKING_QUEUE gets overwritten by IND, but this case is different: the ASSERT is looking at bq->queue which is normally a MSG_BLACKHOLE. A MSG_BLACKHOLE can be revoked by overwriting with IND, as per the code in throwToMsg that you found. This is all fine, so the ASSERT needs to account for bq->queue being an IND instead.

comment:18 Changed 11 months ago by osa1

This is easy to fix, but I'm trying to understand the code:

// Revoke the message by replacing it with IND. We're not
// locking anything here, so we might still get a TRY_WAKEUP
// message from the owner of the blackhole some time in the
// future, but that doesn't matter.
ASSERT(target->block_info.bh->header.info == &stg_MSG_BLACKHOLE_info);
OVERWRITE_INFO(target->block_info.bh, &stg_IND_info);
raiseAsync(cap, target, msg->exception, false, NULL);
return THROWTO_SUCCESS;

(in throwToMsg())

Why we can make a MessageBlackHole an IND without setting the indirectee? Is this used to remove (by skipping) a MessageBlackHole from the queue? (because actually removing would be impossible as we don't have a prev field in MessageBlackHole?)

I'm quite confused about how these two types are used:

typedef struct StgBlockingQueue_ {
    StgHeader   header;
    struct StgBlockingQueue_ *link; // here so it looks like an IND
    StgClosure *bh;  // the BLACKHOLE
    StgTSO     *owner;
    struct MessageBlackHole_ *queue;
} StgBlockingQueue;

typedef struct MessageBlackHole_ {
    StgHeader   header;
    struct MessageBlackHole_ *link;
    StgTSO     *tso;
    StgClosure *bh;
} MessageBlackHole;

As far as I understand, a BLACKHOLE can become a BLOCKING_QUEUE (in messageBlackHole()), and queue of a BLOCKING_QUEUE is for threads that are blocked on this BLACKHOLE. But then

  • Why do we have a bh field in messageBlackHole? We could pass the bh as a parameter to messageBlackHole() and bh would the blackhole that we just looked at to find the MessageBlackHole.
  • Why do we need a list of BLOCKING_QUEUEs? Is this only to be able to implement checkBlockingQueues()? The comments around checkBlockingQueues() say
// If we update a closure that we know we BLACKHOLE'd, and the closure
// no longer points to the current TSO as its owner, then there may be
// an orphaned BLOCKING_QUEUE closure with blocked threads attached to
// it.  We therefore traverse the BLOCKING_QUEUEs attached to the
// current TSO to see if any can now be woken up.

but I don't understand how can owner of a BLACKHOLE not point to our capability if we BLACKHOLEd it.

Could you say a few workds about this?

I'll submit a patch but I don't know what to say in the commit message as I don't understand this code ...

comment:19 Changed 11 months ago by simonmar

Why we can make a MessageBlackHole an IND without setting the indirectee? Is this used to remove (by skipping) a MessageBlackHole from the queue? (because actually removing would be impossible as we don't have a prev field in MessageBlackHole?)

Yes, exactly.

As far as I understand, a BLACKHOLE can become a BLOCKING_QUEUE (in messageBlackHole()), and queue of a BLOCKING_QUEUE is for threads that are blocked on this BLACKHOLE.

Just to be clear, A BLACKHOLE doesn't "become" a BLOCKING_QUEUE, the BLACKHOLE's indirectee field points to the BLOCKING_QUEUE.

Why do we have a bh field in messageBlackHole? We could pass the bh as a parameter to messageBlackHole() and bh would the blackhole that we just looked at to find the MessageBlackHole.

A MessageBlackHole is sent to another capability, which needs to know which BLACKHOL to add it to.

Why do we need a list of BLOCKING_QUEUEs? Is this only to be able to implement checkBlockingQueues()? The comments around checkBlockingQueues() say

See this comment in messageBlackHole():

        // All BLOCKING_QUEUES are linked in a list on owner->bq, so
        // that we can search through them in the event that there is
        // a collision to update a BLACKHOLE and a BLOCKING_QUEUE
        // becomes orphaned (see updateThunk()).

but I don't understand how can owner of a BLACKHOLE not point to our capability if we BLACKHOLEd it.

A BLACKHOLE can be updated by another thread at any time. The fact that it is a BLACKHOLE is not an exclusive ownership. We use compare-and-swap to install the BLACKHOLE in threadPaused(), but another thread might already be evaluating the thunk and will update it with the value later, and we don't use a compare-and-swap when updating because that would be too expensive.

comment:20 Changed 11 months ago by osa1

Differential Rev(s): Phab:D5051 (reverted), Phab:D5165Phab:D5051 (reverted), Phab:D5165, Phab:D5178

Thanks Simon. Submitted a patch with some comments and an update to the sanity check.

Current status:

  • debug runtime seems to work (the program takes too long to run and I killed it after 30 mins or so)
  • debug + threaded runtime seems to work (did 10 runs with +RTS -DS -N2)
  • debug + prof fails with
    Mult: internal error: ASSERTION FAILED: file rts/sm/Sanity.c, line 210
    
      (GHC version 8.7.20180923 for x86_64_unknown_linux)
      Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
    
    (the PAP invariant bug)
  • debug + prof + threaded fails with different errors like
    Mult: internal error: invalid closure, info=0x53b130
      (GHC version 8.7.20180923 for x86_64_unknown_linux)
      Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
    
    or
    Mult: internal error: ASSERTION FAILED: file rts/sm/Evac.c, line 890
      (GHC version 8.7.20180923 for x86_64_unknown_linux)
      Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
    

comment:21 Changed 11 months ago by Ömer Sinan Ağacan <omeragacan@…>

In d90946c/ghc:

Fix a MSG_BLACKHOLE sanity check, add some comments

Reviewers: simonmar, bgamari, erikd

Reviewed By: simonmar

Subscribers: rwbarton, carter

GHC Trac Issues: #15508

Differential Revision: https://phabricator.haskell.org/D5178

comment:22 Changed 11 months ago by bgamari

Does comment:21 fix this issue, Omer?

comment:23 Changed 11 months ago by osa1

No, comment:20 is still valid. I used commit in comment:21 when writing comment:20.

comment:24 Changed 9 months ago by bgamari

Milestone: 8.6.18.8.1
Priority: highhighest

We should really try getting to the bottom of this; this causes spurious testsuite failures at least once a week.

comment:25 Changed 9 months ago by alpmestan

Cc: alpmestan added

comment:26 Changed 8 months ago by osa1

One of the remaining issues is this: sometimes with prof+threaded builds we increment info pointers. This either segfaults, or the GC thinks that the object is already evacuated (because info pointer looks like a forwarding pointer when we increment it) and chaos ensues. Here's an example, in this generated code for GHC.Conc.Sync.childHandler:

Dump of assembler code for function base_GHCziConcziSync_childHandler1_info:
   0x000000000051c5c8 <+0>:     add    $0x28,%r12
   0x000000000051c5cc <+4>:     cmp    0x358(%r13),%r12
   0x000000000051c5d3 <+11>:    ja     0x51c61a <base_GHCziConcziSync_childHandler1_info+82>
   0x000000000051c5d5 <+13>:    mov    0x360(%r13),%rax
=> 0x000000000051c5dc <+20>:    addq   $0x3,0x48(%rax)
   0x000000000051c5e1 <+25>:    movq   $0x51bd50,-0x20(%r12)
   0x000000000051c5ea <+34>:    mov    0x360(%r13),%rax
   0x000000000051c5f1 <+41>:    mov    %rax,-0x18(%r12)
   0x000000000051c5f6 <+46>:    mov    0xd34988,%eax
   0x000000000051c5fd <+53>:    shl    $0x1e,%rax
   0x000000000051c601 <+57>:    mov    %rax,-0x10(%r12)
   0x000000000051c606 <+62>:    mov    %r14,(%r12)
   0x000000000051c60a <+66>:    mov    $0xb57742,%r14d
   0x000000000051c610 <+72>:    lea    -0x20(%r12),%rbx
   0x000000000051c615 <+77>:    jmpq   0xa3c4d0 <stg_catchzh>
   0x000000000051c61a <+82>:    movq   $0x28,0x388(%r13)
   0x000000000051c625 <+93>:    mov    $0xb57770,%ebx
   0x000000000051c62a <+98>:    jmpq   *-0x8(%r13)

The highlighted line increments this object's info pointer:

>>> print *(StgClosure*)0x4207528048
$34 = {
  header = {
    info = 0x9f2200 <ghczmprim_GHCziTypes_ZC_con_info>,
    prof = {
      ccs = 0xb44390,
      hp = {
        rs = 0x0,
        ldvw = 0
      }
    }
  },
  payload = 0x4207528060
}
>>> call printClosure((StgClosure*)0x4207528048)
ghc-prim:GHC.Types.:(0x42075280b9, 0x4207528092)

After that line:

>>> print *(StgClosure*)0x4207528048
$40 = {
  header = {
    info = 0x9f2203 <ghczmprim_GHCziTypes_ZC_con_info+3>,
    prof = {
      ccs = 0xb44390,
      hp = {
        rs = 0x0,
        ldvw = 0
      }
    }
  },
  payload = 0x4207528060
}

This is not a valid closure anymore.

I don't know what that code is supposed to increment. I'll try to generate Cmm for GHC.Conc.Sync.childHandler and see.

comment:27 Changed 8 months ago by osa1

Looking at the assembly again, this line

   0x000000000051c5cc <+4>:     cmp    0x358(%r13),%r12

does heap limit check for allocation (%r12 is the reg table, 0x358 = 856, offset for rHpLim), then this line

   0x000000000051c5d5 <+13>:    mov    0x360(%r13),%rax

Moves rCCCS (current cost centre stack) to %rax, so this line

   0x000000000051c5dc <+20>:    addq   $0x3,0x48(%rax)

Tries to increment current cost centre stack's mem_alloc! Somehow rCCCS->mem_alloc is actually info pointer of another closure!

comment:28 Changed 8 months ago by osa1

Here's where we break the rCCCS register:

Thread 2 hit Hardware watchpoint 4: ((StgRegTable*)0xd35718)->rCCCS

Old value = (struct CostCentreStack_ *) 0x4207528000
New value = (struct CostCentreStack_ *) 0xd2df60 <CCS_SYSTEM>
0x0000000000a376f8 in stg_returnToStackTop ()

>>> bt
#0  0x0000000000a376f8 in stg_returnToStackTop ()
#1  0x0000000000000000 in ?? ()

>>> call printClosure(0x4207528000)
ghc-prim:GHC.Types.:(0x4207528071, 0x420752804a)

comment:29 Changed 8 months ago by osa1

Looking at the assembly for returnToStackTop, we're loading a TSO's prof field to rCCCS so the problem is with the prof field of a TSO. Debugging further, here's where we first update prof field of a TSO with the problematic value (0x4207528000):

Thread 2 hit Hardware watchpoint 6: ((StgTSO *) 0x420019d7e0)->prof.cccs

Old value = (CostCentreStack *) 0x4207528000
New value = (CostCentreStack *) 0xb44390
0x0000000000a37819 in stg_returnToSched ()

>>> bt
#0  0x0000000000a37819 in stg_returnToSched ()
#1  0x0000000000000000 in ?? ()

>>> print *(CostCentreStack *)0xb44390
$60 = {
  ccsID = 241,
  cc = 0xb44350,
  prevStack = 0xd2df00 <CCS_MAIN>,
  indexTable = 0x0,
  root = 0xb44390,
  depth = 1,
  scc_count = 0,
  selected = 1,
  time_ticks = 0,
  mem_alloc = 59522365,
  inherited_alloc = 0,
  inherited_ticks = 0
}

(Note that this is reverse executions so "old" is actually new)

Interestingly at this point 0x4207528000 actually looks like a valid cost centre stack:

>>> print *(CostCentreStack *)0x4207528000
$62 = {
  ccsID = 283607809913,
  cc = 0xb44390,
  prevStack = 0x5,
  indexTable = 0x4207528071,
  root = 0x420752804a,
  depth = 283607809953,
  scc_count = 11813776,
  selected = 4,
  time_ticks = 0,
  mem_alloc = 283607809985,
  inherited_alloc = 11813776,
  inherited_ticks = 5
}

>>> print *((CostCentreStack *)0x4207528000)->cc
$64 = {
  ccID = 241,
  label = 0xb44350 "\a",
  module = 0xd2df00 <CCS_MAIN> "|",
  srcloc = 0x0,
  mem_alloc = 11813776,
  time_ticks = 1,
  is_caf = 0,
  link = 0x1
}

So currently we don't have a problem. If I continue:

Thread 2 hit Hardware watchpoint 8: ((StgTSO *) 0x420019d7e0)->prof.cccs

Old value = (CostCentreStack *) 0xb44390
New value = (CostCentreStack *) 0x4207528000
0x0000000000a3d1c2 in stg_AP_STACK_info ()

>>> bt
#0  0x0000000000a3d1c2 in stg_AP_STACK_info ()
#1  0x0000000000000000 in ?? ()

>>> print *(StgClosure*)0x4207528000
$69 = {
  header = {
    info = 0x9f2200 <ghczmprim_GHCziTypes_ZC_con_info>,
    prof = {
      ccs = 0xb44390,
      hp = {
        rs = 0x0,
        ldvw = 0
      }
    }
  },
  payload = 0x4207528018
}

This is the second time we update the TSO->prof with the problematic value, and NOW things look bad.

I don't understand how we can use the same location for a cost centre stack and a heap closure. As far as I know cost centre stacks are allocated in an arena, and are never freed.

comment:30 Changed 8 months ago by osa1

Cc: simonmar added

I came across something interesting. While digging deeper I found that this is where we assign a location of heap (instead of location of the arena where we allocate cost centre stacks) to a closure's CCS field:

#0  0x0000000000a15794 in raiseAsync (cap=0xd35700 <MainCapability>, tso=0x420320f000, exception=0xb570c0, stop_at_atomically=false, stop_here=0x0) at rts/RaiseAsync.c:873
#1  0x0000000000a14aa6 in throwToMsg (cap=0xd35700 <MainCapability>, msg=0x42000be730) at rts/RaiseAsync.c:275
#2  0x0000000000a148f0 in throwTo (cap=0xd35700 <MainCapability>, source=0x4205b6f728, target=0x420320f000, exception=0xb570c0) at rts/RaiseAsync.c:213
#3  0x0000000000a3c323 in stg_killThreadzh ()
#4  0x0000000000000000 in ?? ()

The closure (an AP_STACK) previously has this cost centre stack:

>>> print ((StgClosure *) 0x4200345e70)->header.prof.ccs
$30 = (CostCentreStack *) 0xb44390
>>> print *((StgClosure *) 0x4200345e70)->header.prof.ccs
$31 = {
  ccsID = 241,
  cc = 0xb44350,
  prevStack = 0xd2df00 <CCS_MAIN>,
  indexTable = 0x0,
  root = 0xb44390,
  depth = 1,
  scc_count = 0,
  selected = 1,
  time_ticks = 0,
  mem_alloc = 59522362,
  inherited_alloc = 0,
  inherited_ticks = 0
}

The new value is 0x4207528000, a location in the heap.

The new location is coming from another closure so I should debug more and see how the value ended up in that closure, but I noticed a TODO comment around the code that makes this assignment:

            -- RaiseAsync.c:873
            SET_HDR(ap,&stg_AP_STACK_info,
                    ((StgClosure *)frame)->header.prof.ccs /* ToDo */);

I wonder if that ToDo has anything to do with this. @simonmar, any ideas? The commit that added that ToDo: b1953bbb1ed3cb16497e5447db7487f0c2d9e41a.

comment:31 Changed 8 months ago by simonmar

The code is getting the ccs for the AP_STACK closure from the update frame. I imagine the "ToDo" was to think about whether this is really the right thing. It shouldn't cause a crash - the CCS in the update frame should be valid - and I don't know where else we would get a useful CCS from for the AP_STACK, so it looks reasonable to me.

comment:32 Changed 8 months ago by osa1

Found another bug. In RaiseAsync.c:

            -- RaiseAsync.c:924
            SET_HDR(ap,&stg_AP_STACK_NOUPD_info,
                    ((StgClosure *)frame)->header.prof.ccs /* ToDo */);

We're reading CCS of an underflow frame, but underflow frames do not have profiling headers:

typedef struct {
    const StgInfoTable* info;
    struct StgStack_ *next_chunk;
} StgUnderflowFrame;

I also checked other stack frame definitions in Closures.h. Most of them have a proper header (StgHeader), except StgUnderflowFrame and StgRetFun. Simon, should we add proper headers to these frames or is the bug in somewhere else? I'm not sure when should a stack frame has a proper header (that includes profiling stuff in profiling builds) vs. just a info ptr.

comment:34 Changed 8 months ago by osa1

Differential Rev(s): Phab:D5051 (reverted), Phab:D5165, Phab:D5178Phab:D5051 (reverted), Phab:D5165, Phab:D5178, MR:103

comment:35 Changed 8 months ago by osa1

Now that MR:103 is merged things are pretty good. Currently I can only make this test fail with -DS in non-threaded runtime, and that's because of the new sanity check I added with 82d1a88dec216d761b17252ede760da5c566007f. Hopefully this should be easier to fix as assertions failures are usually easy to figure out.

The commit merged with MR:103: 74cd4ec5d2f9321aad5db3285cb60d78f2562996

Note that the commit in comment:3 was later reverted because it broke things. I'll try that commit again as it may have just revealed some of the bugs we fixed in the meantime. In the meantime if it was merged to 8.8 branch we should also merge the revert commit.

comment:36 Changed 8 months ago by osa1

Ah, that's a bug in the sanity check. It turns out we don't update free pointers of arena blocks, instead we have a cached value of the free pointer and update that. Patch coming.

comment:37 Changed 8 months ago by osa1

Submitted MR:118 for this. Now only panic I can reproduce is Sanity.c:210 (the PAP invariant bug). I'll try enabling the commit in comment:3 again and see if it works after fixing all those other bugs.

comment:38 Changed 8 months ago by osa1

The commit in comment:3 still causes problems. I think the problem is somewhere in stg_ap_0_fast in the profiling runtime.

comment:39 Changed 7 months ago by osa1

Status: newmerge

Summary for 8.8.1:

  • If d13b7d6065 is merged, 8a6aa87766 (revert commit) should be merged too.
  • d90946cea1 and 74cd4ec5d2 should be merged.

Updating the status to put this into the merge queue but we should revert this to "new" afterwards.

comment:40 Changed 7 months ago by osa1

Here are some notes. Compile and run:

ghc-stage2 Mult.hs -prof -fprof-auto -fforce-recomp -debug && ./Mult +RTS -DS

At one point we do a GC and continue:

>>> bt
#0  0x0000000000a0ed5d in StgRunIsImplementedInAssembler () at rts/StgCRun.c:370
#1  0x0000000000a0226a in schedule (initialCapability=0xd2c600 <MainCapability>, task=0x242dd70) at rts/Schedule.c:452
#2  0x0000000000a03b7b in scheduleWaitThread (tso=0x4200105370, ret=0x0, pcap=0x7ffdbfd88f10) at rts/Schedule.c:2547
#3  0x0000000000a15e4b in rts_evalLazyIO (cap=0x7ffdbfd88f10, p=0xb3bb98, ret=0x0) at rts/RtsAPI.c:530
#4  0x0000000000a175d3 in hs_main (argc=1, argv=0x7ffdbfd89108, main_closure=0xb3bb98, rts_config=...) at rts/RtsMain.c:72
#5  0x0000000000463e78 in main ()

If I step a few instructions I see these instructions being executed

    0x0000000000a35d60 stg_ap_0_fast+336 mov    0x8(%rbp),%rax
    0x0000000000a35d64 stg_ap_0_fast+340 mov    0x10(%rbp),%rbx
    0x0000000000a35d68 stg_ap_0_fast+344 mov    0x18(%rbp),%rcx
    0x0000000000a35d6c stg_ap_0_fast+348 add    $0x20,%rbp
    0x0000000000a35d70 stg_ap_0_fast+352 mov    %rcx,%rdx

This is in stg_ap_0_fast and %rbx is supposed to be a FUN for a PAP, but it's actually collected by the GC:

>>> print (StgClosure*)$rbx
$27 = (StgClosure *) 0x42000b2868
>>> print *(StgClosure*)$rbx
$28 = {
  header = {
    info = 0xaaaaaaaaaaaaaaaa,
    prof = {
      ccs = 0xaaaaaaaaaaaaaaaa,
      hp = {
        rs = 0xaaaaaaaaaaaaaaaa,
        ldvw = 12297829382473034410
      }
    }
  },
  payload = 0x42000b2880
}

Interestingly only some part of it is collected (the first 3 words).

comment:41 Changed 7 months ago by Ömer Sinan Ağacan <omeragacan@…>

In 82d1a88d/ghc:

Implement a sanity check for CCS fields in profiling builds

This helped me debug one of the bugs in #15508. I'm not sure if this is
a good idea, but it worked for me, so wanted to submit this as a MR.

comment:42 Changed 7 months ago by Ömer Sinan Ağacan <omeragacan@…>

In 74cd4ec5/ghc:

Fix raiseAsync() UNDERFLOW_FRAME handling in profiling runtime

UNDERFLOW_FRAMEs don't have profiling headers so we have to use the
AP_STACK's function's CCS as the new frame's CCS.

Fixes one of the many bugs caught by concprog001 (#15508).

comment:43 Changed 7 months ago by bgamari

Milestone: 8.8.18.6.4

I'm very happy to have this fixed.

comment:44 Changed 7 months ago by osa1

Differential Rev(s): Phab:D5051 (reverted), Phab:D5165, Phab:D5178, MR:103Phab:D5051 (reverted), Phab:D5165, Phab:D5178, MR:103, MR:146

Last bug is fixed in MR:146.

Unfortunately we won't be able to test this on CI as this program takes forever to run in non-threaded configurations. It'd be great to have some tests for the stuff we fixed in this ticket.

comment:45 Changed 7 months ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:46 Changed 7 months ago by osa1

Owner: osa1 deleted
Resolution: fixed
Status: closednew

This is still not closed as https://gitlab.haskell.org/ghc/ghc/merge_requests/146 is still not merged.

comment:47 Changed 7 months ago by osa1

Owner: set to osa1

comment:48 Changed 6 months ago by osa1

bgamari, I think we want to merge !146 for the next release.

comment:49 Changed 6 months ago by Marge Bot <ben+marge-bot@…>

In 908b4b86/ghc:

Fix two bugs in stg_ap_0_fast in profiling runtime

This includes two bug fixes in profiling version of stg_ap_0_fast:

- PAPs allocated by stg_ap_0_fast are now correctly tagged. This
  invariant is checked in Sanity.c:checkPAP.

  (This was originally implemented in 2693eb11f5, later reverted with
  ab55b4ddb7 because it revealed the bug below, but it wasn't clear at
  the time whether the bug was the one below or something in the commit)

- The local variable `untaggedfun` is now marked as a pointer so it
  survives GC.

With this we finally fix all known bugs caught in #15508. `concprog001`
now works reliably with prof+threaded and prof runtimes (with and
without -debug).
Note: See TracTickets for help on using tickets.