Opened 6 years ago

Closed 5 years ago

#7993 closed bug (worksforme)

ghc 7.6 (not 7.4) sometimes hangs at child process exit on s390x

Reported by: cjwatson Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 7.6.3
Keywords: Cc: simonmar
Operating System: Linux Architecture: Other
Type of failure: Other Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


On Debian's s390x architecture (64-bit S/390, Linux kernel), builds of several packages hang with GHC 7.6 where they did not hang with GHC 7.4. In particular, ghc itself hangs during its own build when bootstrapping with 7.6. This is quite easy to reproduce on affected systems, although it doesn't hang in exactly the same place every time. It appears that the runtime sometimes deadlocks when a subprocess exits; the strace looks like this:

7523  exit_group(0)                     = ?
6680  <... futex resumed> )             = ? ERESTARTSYS (To be restarted)
6680  --- SIGCHLD (Child exited) @ 0 (0) ---
6680  futex(0x84fa86ac, FUTEX_WAIT_PRIVATE, 1143, NULL) = ? ERESTARTSYS (To be restarted)
6680  --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
6680  sigreturn()                       = ? (mask now [])
6680  futex(0x84fa86ac, FUTEX_WAIT_PRIVATE, 1143, NULL) = ? ERESTARTSYS (To be restarted)
6680  --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
6680  sigreturn()                       = ? (mask now [])
6680  futex(0x84fa86ac, FUTEX_WAIT_PRIVATE, 1143, NULL) = ? ERESTARTSYS (To be restarted)
[repeats forever]

ghc spawns enough subprocesses (gcc etc.) that it's essentially bound to hit this sooner or later. I suspect perhaps a lack of signal-safety somewhere - at an extremely wild guess, perhaps the type of an important variable written in a signal handler happens to exceed the size of sig_atomic_t on s390x and not elsewhere - but I haven't yet been able to track this down in the time available to me.

If you don't immediately recognise this as something obvious, then perhaps somebody more fluent in Haskell than I would be good enough to suggest test code that exercises this and is somewhat simpler than "build ghc"? If my analysis is at all close to the mark, then something that sits in a loop forking and reaping a trivial child process on each iteration should be enough to reproduce this. On the assumption that most non-Debian-developers don't have convenient access to S/390 machines (Debian developers can use, I'd be happy to try things out.

Change History (6)

comment:1 Changed 6 years ago by nomeata

difficulty: Unknown

I tried to find out if -V0 helps, but unfortunately it does not.

comment:2 Changed 6 years ago by nomeata

I tried to reproduce the problem by spawning lots of processes, but

import System.Process
main = mapM_ (\_ -> readProcess "/bin/echo" ["hello", "world"] "") [0..10000]

did not deadlock.

comment:3 Changed 6 years ago by pmylund

Cc: simonmar added

I am experiencing the same issue, but on x86_64, and with my own application which uses GHC (7.6.2) threads. On occasion a thread will loop forever, and give the same kind of output from strace. (Is it the same issue?)

Unfortunately I don't know how to begin troubleshooting. Whatever I do to try to reproduce it in a smaller test, the problem goes away if I don't use the combination of threads, STM and exception handling that I have in my larger application.

I will keep trying and report back, but any input is appreciated.

Update: Please ignore the above. It turns out I had become trapped in an infinite loop inside my own recursive function, and that the strace output is to be expected from applications that are actually doing something (like infinitely looping.) Sorry about that.

Last edited 6 years ago by pmylund (previous) (diff)

comment:4 Changed 6 years ago by simonmar

Getting a stack trace would probably help. You want to make sure that GHC itself is built with -debug: set GhcDebugged=YES in your (this will slow down the build, but you can remove it later). When the process hangs, attach to it with gdb and get a backtrace of all the threads.

comment:5 Changed 5 years ago by thomie

Status: newinfoneeded

Does this problem still occur with 7.8.3?

comment:6 Changed 5 years ago by nomeata

Resolution: worksforme
Status: infoneededclosed

At least ghc itself seems to build fine:

I did not yet try to upload separate packages to be built with this.

I guess we can close/ignore this for now, and revisit if it occurs again with 7.8.

Note: See TracTickets for help on using tickets.