Opened 9 years ago

Closed 9 years ago

#4303 closed bug (wontfix)

Segfault in RTS (apparently only MacOS) FFI related?

Reported by: patperry Owned by: igloo
Priority: highest Milestone: 7.2.1
Component: Runtime System Version: 6.12.3
Keywords: Cc: pho@…, jwlato@…
Operating System: MacOS X Architecture: x86
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The RTS crashes in the middle of a long-running computation involving multiple FFI calls. I suspect this has something to do with the foreign call spawning multiple threads, but I'm not sure.

Specifically, the crash is in the middle of a long loop. Each iteration performs a lot of memory allocation, so GC collections happen fairly frequently. Also, each iteration involves calls to the (external) BLAS function DGER. On my machine (dual-processor Mac OS with vecLib), BLAS functions are often multi-threaded. When I comment out the call to DGER, the bug does not manifest.

Here is the site of the crash (as reported by gdb):

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xc000162d
0x003154ce in StgReturn () at rts/StgCRun.c:137
137	    __asm__ volatile (


Sometimes the crash is at:

rtsbug: internal error: ASSERTION FAILED: file rts/Schedule.c, line 536

I'm attaching source files for the buggy program. Unfortunately, they require BLAS and LAPACK. If there's anything I can do to help, I would be glad to assist.

Attachments (2)

RTSBug.tar.gz (151.3 KB) - added by patperry 9 years ago.
Source files
Fisher.hs (930 bytes) - added by jwlato 9 years ago.
Fisher.hs file, with type annotation required for ghc7

Download all attachments as: .zip

Change History (33)

Changed 9 years ago by patperry

Attachment: RTSBug.tar.gz added

Source files

comment:1 Changed 9 years ago by patperry

To run the program, you need sqlite3, BLAS, and LAPACK installed.

You also need the data file "enron.db" to be in the current directory. This file is too large to upload to track, but I am hosting it at http://people.seas.harvard.edu/~patperry/enron.db

Here is how to reproduce the bug:

download RTSBug.tar.gz (attached, above), and enron.db to the current directory

tar xzvf RTSBug.tar.gz
cd RTSBug
runhaskell Setup.lhs configure
runhaskell Setup.lhs build
cp ../enron.db .
./dist/build/rtsbug/rtsbug

You should see output like:

0
1
2
3
4
5
6
7
8
9
10
.
.
.

The numbers will increase until you get a segfault. It usually happens before 10000.

comment:2 Changed 9 years ago by patperry

A few notes:

Marking the FFI call as safe or unsafe doesn't seem to matter. Compiling with or without "-threaded" doesn't seem to matter, either.

There is a little more information as to the relevant source files in the README inside RTSBug.tar.gz

comment:3 Changed 9 years ago by simonmar

Milestone: 7.0.1
Owner: set to simonmar
Priority: normalhighest

comment:4 Changed 9 years ago by patperry

Bad (good?) news-- I haven't been able to reproduce the bug. The test program was segfaulting reliably a few days ago, but is running smoothly today. This may have to do with the programs I was running the the background. Today, the only thing fishy going on is that when I run the program in gdb, I get sporadic warnings:

warning: error on line 120 of
"/SourceCache/gdb/gdb-1461.2/src/gdb/macosx/macosx-nat-infthread.c" in
function "modify_trace_bit": (os/kern) invalid argument (0x4x)
warning: error on line 396 of
"/SourceCache/gdb/gdb-1461.2/src/gdb/macosx/macosx-nat-infthread.c" in
function "prepare_threads_after_stop": (os/kern) invalid argument
(0x4x)
warning: error on line 120 of
"/SourceCache/gdb/gdb-1461.2/src/gdb/macosx/macosx-nat-infthread.c" in
function "modify_trace_bit": (ipc/send) invalid destination port
(0x268435459x)
warning: error on line 396 of
"/SourceCache/gdb/gdb-1461.2/src/gdb/macosx/macosx-nat-infthread.c" in
function "prepare_threads_after_stop": (ipc/send) invalid destination
port (0x268435459x)

(the relevant gdb source is at http://www.opensource.apple.com/source/gdb/gdb-1461.2/src/gdb/macosx/macosx-nat-infthread.c ).

This happens reliably on my laptop, running Mac OS 10.6.4 on a 2.4 GHz Intel Core 2 Duo. I haven't observed any bad behavior on my desktop (Mac OS 10.5.8, 2 X 2.26 GHz Quad-Core Intel Xeon).

Sorry for potentially wasting your time. If none of this means anything to you, you should probably close the bug.

comment:5 Changed 9 years ago by patperry

DISREGARD the last comment. I forgot that I had implemented a work-around.

The only useful information in that last comment is that I was *not* able to duplicate the bug on a machine with 8 cores (Mac OS 10.5.8, 2 X 2.26 GHz Quad-Core Intel Xeon).

comment:6 Changed 9 years ago by simonmar

Many thanks for producing such a complete bug report.

Unfortunately I haven't been able to reproduce it so far, the program is up to 20000 on x86_64/Linux with GHC HEAD. I'll try some other compiler/architecture combinations.

comment:7 Changed 9 years ago by simonmar

Not reproducible on x86_64/Linux with GHC 6.12.3.

comment:8 Changed 9 years ago by simonmar

Not reproducible on x86/Linux with GHC 6.12.3.

Ian, could you try on your Mac?

comment:9 Changed 9 years ago by simonmar

Owner: changed from simonmar to igloo

comment:10 Changed 9 years ago by igloo

patperry. where did you get your BLAS and LAPACK libraries from?

comment:11 Changed 9 years ago by patperry

I'm using vecLib, which comes pre-installed with Mac OS.

comment:12 Changed 9 years ago by igloo

Owner: igloo deleted

Ah, great, that makes things easy.

It works for me. After ~17.5 mins it terminates. The end of stderr is

21632
21633
21634

and the end of stdout is

fisher: [NaN,-7.026716759083081e289,-3.1065971684421363e296,[...],1.5422275944534226e164,2.3284647584471463e290]

comment:13 Changed 9 years ago by patperry

igloo: how many cores are there on your machine?

Summary (on my end):

Mac OS 10.6.4, vecLib, 2 cores: segfault Mac OS 10.5.8, vecLib, 8 cores: no segfault

comment:14 Changed 9 years ago by patperry

Here's more output from gdb:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xc0001635
0x003144ce in StgReturn () at rts/StgCRun.c:137
137	    __asm__ volatile (
(gdb) info registers
eax            0x38a8cc	3713228
ecx            0xb4b000	11841536
edx            0xbffff780	-1073744000
ebx            0xa6d01a69	-1496311191
esp            0xbfffd780	0xbfffd780
ebp            0xc0001641	0xc0001641
esi            0xbf7a6d01	-1082495743
edi            0x1a6d01a	27709466
eip            0x3144ce	0x3144ce <StgReturn+21>
eflags         0x10282	66178
cs             0x17	23
ss             0x1f	31
ds             0x1f	31
es             0x1f	31
fs             0x0	0
gs             0x37	55
(gdb) p &r
$1 = (StgRegTable **) 0xc0001635
(gdb) 

comment:15 Changed 9 years ago by igloo

$ sysctl -n hw.ncpu
2

OS X version 10.5.8.

comment:16 Changed 9 years ago by simonmar

So perhaps this only happens on 10.6?

comment:17 Changed 9 years ago by PHO

Cc: pho@… added

comment:18 in reply to:  16 Changed 9 years ago by patperry

Replying to simonmar:

So perhaps this only happens on 10.6?

Looks like it. This doesn't mean that the OS is causing the problem, though. It's possible that Apple upgraded vecLib with the 10.6 release.

I'll be traveling for the next week, but when I get back I'll try upgrading my desktop to 10.6 to see if that changes anything.

comment:19 Changed 9 years ago by patperry

I've upgraded my desktop to 10.6.4 (the same OS as my laptop). I am unable to reproduce the bug on my desktop, but I am consistently able to reproduce it on my laptop.

Here are the architectures of my machines:

Laptop: 2.4 GHz Intel Core 2 Duo (4 GB RAM) Desktop: 2 x 2.26 Ghz Quad-Core Intel Xeon (6 GB RAM)

comment:20 Changed 9 years ago by patperry

CORRECTION!!!! I just got a segfault on my desktop.

comment:21 Changed 9 years ago by simonpj

Summary: Segfault in RTS (FFI related?)Segfault in RTS (apparently only MacOS) FFI related?

comment:22 Changed 9 years ago by simonpj

Owner: set to simonmar

comment:23 Changed 9 years ago by igloo

Milestone: 7.0.17.0.2

comment:24 Changed 9 years ago by igloo

Milestone: 7.0.27.0.3

comment:25 Changed 9 years ago by simonmar

This might possibly be the same as #4850

comment:26 Changed 9 years ago by jwlato

Cc: jwlato@… added

I was able to run tests with a few different ghc versions. My system: 2 GHz Core2Duo, OS 10.6.2 (Snow Leopard)

Results are:

ghc 7.0.1          32bit - crash
ghc 7.1.20110113   32bit - crash
ghc 7.0.1          64bit - could not reproduce
ghc 7.0.1.20101215 64bit - could not reproduce

The crash was the same as reported by patperry.

I did have to modify one of the files slightly (Fisher.hs, escaped skolem) to build with ghc-7. ieee also needed a minor tweak

Changed 9 years ago by jwlato

Attachment: Fisher.hs added

Fisher.hs file, with type annotation required for ghc7

comment:27 Changed 9 years ago by igloo

Owner: changed from simonmar to igloo

Having upgraded to 10.6.3 (from 10.5.8) I can reproduce this with 6.12.3.

comment:28 Changed 9 years ago by igloo

And after a little tweaking to get it to build, with 7.0.2 RC 2:

2288
2289
rtsbug: internal error: ASSERTION FAILED: file rts/Schedule.c, line 493

    (GHC version 7.0.1.20110217 for i386_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
Abort trap

comment:29 Changed 9 years ago by simonmar

I've been looking at this, and have been able to reproduce it, but I don't know what the problem is. Here is what I've discovered:

The program does not seem to crash when run with -V0. It crashes slightly more often with -C0.001. Suggesting that the problem is somehow related to timing interrupts, which would explain why it is still non-deterministic even with the non-threaded RTS.

When StgReturn is called, the C stack is supposed to look something like this:

(gdb) p32 $edx-64
0xbffff99c:     0x120ab0b <schedule+948>
0xbffff998:     0xbffff9e8
0xbffff994:     0xbffffb20
0xbffff990:     0x24
0xbffff98c:     0x2cfe000
0xbffff988:     0xbbbbbbbb
0xbffff984:     0xbbbbbbbb
0xbffff980:     0xbbbbbbbb
0xbffff97c:     0xbbbbbbbb
0xbffff978:     0xbbbbbbbb
0xbffff974:     0xbbbbbbbb
0xbffff970:     0xbbbbbbbb
0xbffff96c:     0xbffff998
0xbffff968:     0x0
0xbffff964:     0x0
0xbffff960:     0xbffffb20
0xbffff95c:     0xbbbbbbbb
0xbffff958:     0xbbbbbbbb
0xbffff954:     0xbbbbbbbb
0xbffff950:     0xbbbbbbbb
0xbffff94c:     0xbbbbbbbb
0xbffff948:     0xbbbbbbbb
0xbffff944:     0xbbbbbbbb
0xbffff940:     0xbbbbbbbb
0xbffff93c:     0xbbbbbbbb
0xbffff938:     0xbbbbbbbb
0xbffff934:     0xbbbbbbbb
0xbffff930:     0xbbbbbbbb
0xbffff92c:     0xbbbbbbbb
0xbffff928:     0xbbbbbbbb
0xbffff924:     0xbbbbbbbb
0xbffff920:     0xbbbbbbbb

(I added a memset to fill in the unused space with 0xbbbbbbbb so we can easily see what has been written).

The 4 words around 0xbffff960 are the saved values of some registers, which are restored by StgReturn before returning to the RTS. Now, when the crash happens, the stack looks something like this:

(gdb) p32 $edx-64
0xbffffa0c:     0x120ab0b <schedule+948>
0xbffffa08:     0xbffffa58
0xbffffa04:     0xbffffb98
0xbffffa00:     0x0
0xbffff9fc:     0x1
0xbffff9f8:     0xbbbbbbbb
0xbffff9f4:     0xbbbbbbbb
0xbffff9f0:     0xbbbbbbbb
0xbffff9ec:     0xbf8a6d01
0xbffff9e8:     0xa6d01a6d
0xbffff9e4:     0xbbbbbbbb
0xbffff9e0:     0xbbbbbbbb
0xbffff9dc:     0xc0001771
0xbffff9d8:     0x1a6d01a
0xbffff9d4:     0xbf7a6d01
0xbffff9d0:     0xbffffb98
0xbffff9cc:     0xbf7a6d01
0xbffff9c8:     0xa6d01a64
0xbffff9c4:     0xbbbbbbbb
0xbffff9c0:     0xbbbbbbbb
0xbffff9bc:     0xbbbbbbbb
0xbffff9b8:     0xbbbbbbbb
0xbffff9b4:     0xbbbbbbbb
0xbffff9b0:     0xbbbbbbbb
0xbffff9ac:     0xbbbbbbbb
0xbffff9a8:     0xbbbbbbbb
0xbffff9a4:     0xbbbbbbbb
0xbffff9a0:     0xbbbbbbbb
0xbffff99c:     0xbbbbbbbb
0xbffff998:     0xbbbbbbbb
0xbffff994:     0xbf7a6d01
0xbffff990:     0xa6d01a67

i.e. those register values have been overwritten by junk. In fact, it looks a lot like floating point values to me.

The exact values are different each time, although I've seen similar values crop up. The exact addresses that have been overwritten also differ. Here's another sample:

(gdb) p32 $edx-64
0xbffffa0c:     0x120ab0b <schedule+948>
0xbffffa08:     0xbffffa58
0xbffffa04:     0xbffffba0
0xbffffa00:     0x0
0xbffff9fc:     0x4a39000
0xbffff9f8:     0xbbbbbbbb
0xbffff9f4:     0xbbbbbbbb
0xbffff9f0:     0xbbbbbbbb
0xbffff9ec:     0xbbbbbbbb
0xbffff9e8:     0xbbbbbbbb
0xbffff9e4:     0xbbbbbbbb
0xbffff9e0:     0xbbbbbbbb
0xbffff9dc:     0xc0001771
0xbffff9d8:     0x1a6d01a
0xbffff9d4:     0xbf7a6d01
0xbffff9d0:     0xbffffba0
0xbffff9cc:     0xbf7a6d01
0xbffff9c8:     0xa6d01a64
0xbffff9c4:     0xbbbbbbbb
0xbffff9c0:     0xbbbbbbbb
0xbffff9bc:     0xbbbbbbbb
0xbffff9b8:     0xbbbbbbbb
0xbffff9b4:     0xbbbbbbbb
0xbffff9b0:     0xbbbbbbbb
0xbffff9ac:     0xbbbbbbbb
0xbffff9a8:     0xbbbbbbbb
0xbffff9a4:     0xbbbbbbbb
0xbffff9a0:     0xbbbbbbbb
0xbffff99c:     0xbbbbbbbb
0xbffff998:     0xbbbbbbbb
0xbffff994:     0xbbbbbbbb
0xbffff990:     0xbbbbbbbb

So something is corrupting memory, in an unpredictable way.

Is there any way to get the size of the example down at all?

comment:30 in reply to:  29 Changed 9 years ago by patperry

Replying to simonmar:

Is there any way to get the size of the example down at all?

At some point I made an attempt, but didn't get very far. Unfortunately I don't currently have time to help with this.

comment:31 Changed 9 years ago by igloo

Resolution: wontfix
Status: newclosed

Unfortunately, it's not clear whether this is really a GHC bug or a bug in one of the libraries, and we don't have the resources to minimise the testcase, so I'll close this ticket. If you manage to make a small testcase, please open another ticket.

Note: See TracTickets for help on using tickets.