Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#9423 closed bug (fixed)

shutdownCapability sometimes loops indefinitely on OSX after hs_exit()

Reported by: AndreasVoellmy Owned by:
Priority: normal Milestone: 7.10.1
Component: Runtime System Version: 7.8.2
Keywords: Cc: simonmar
Operating System: MacOS X Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: 9284 Differential Rev(s): Phab:D129
Wiki Page:

Description

Issue #9284 relates to forkProcess, which previously invoked the same code that is invoked by hs_exit and uncovered this problem. The resolution of #9284 is to not invoke the equivalent of hs_exit (for reasons that you can see in #9284). However, hs_exit can be called by programs that explicitly create and teardown a Haskell runtime, so the problem displayed by #9284 can still occur for those programs.

The problem has only been observed on OS X, though it probably could occur on Linux OSes as well.

Attachments (2)

Foo.hs (157 bytes) - added by AndreasVoellmy 5 years ago.
FooMain.c (258 bytes) - added by AndreasVoellmy 5 years ago.

Download all attachments as: .zip

Change History (12)

Changed 5 years ago by AndreasVoellmy

Attachment: Foo.hs added

Changed 5 years ago by AndreasVoellmy

Attachment: FooMain.c added

comment:1 Changed 5 years ago by AndreasVoellmy

The attached program illustrates the problem. Compile like this, where <your-ghc> should be a recent (7.8.x) GHC:

<your-ghc> -c Foo.hs
<your-ghc> -threaded -no-hs-main FooMain.c Foo.o

Then run a.out. You should see some printouts and then it should hang (i.e. fail to terminate). You may need to run it a few times t see the behavior.

comment:2 Changed 5 years ago by AndreasVoellmy

Differential Rev(s): Phab:D129

comment:3 Changed 5 years ago by Austin Seipp <austin@…>

In f9f89b7884ccc8ee5047cf4fffdf2b36df6832df/ghc:

rts/base: Fix #9423

Summary:
Fix #9423.

The problem in #9423 is caused when code invoked by `hs_exit()` waits
on all foreign calls to return, but some IO managers are in `safe` foreign
calls and do not return. The previous design signaled to the timer manager
(via its control pipe) that it should "die" and when the timer manager
returned to Haskell-land, the Haskell code in timer manager then signalled
to the IO manager threads that they should return from foreign calls and
`die`. Unfortunately, in the shutdown sequence the timer manager is unable
to return to Haskell-land fast enough and so the code that signals to the
IO manager threads (via their control pipes) is never executed and the IO
manager threads remain out in the foreign calls.

This patch solves this problem by having the RTS signal to all the IO
manager threads (via their control pipes; and in addition to signalling
to the timer manager thread) that they should shutdown (in `ioManagerDie()`
in `rts/Signals.c`. To do this, we arrange for each IO manager thread to
register its control pipe with the RTS (in `GHC.Thread.startIOManagerThread`).
In addition, `GHC.Thread.startTimerManagerThread` registers its control pipe.
These are registered via C functions `setTimerManagerControlFd` (in
`rts/Signals.c`) and `setIOManagerControlFd` (in `rts/Capability.c`). The IO
manager control pipe file descriptors are stored in a new field of the
`Capability_ struct`.

Test Plan: See the notes on #9423 to recreate the problem and to verify that it no longer occurs with the fix.

Auditors: simonmar

Reviewers: simonmar, edsko, ezyang, austin

Reviewed By: austin

Subscribers: phaskell, simonmar, ezyang, carter, relrod

Differential Revision: https://phabricator.haskell.org/D129

GHC Trac Issues: #9423, #9284

comment:4 Changed 5 years ago by thoughtpolice

Milestone: 7.10.1
Resolution: fixed
Status: newclosed

Merged, thanks Andreas!

comment:5 Changed 5 years ago by Austin Seipp <austin@…>

In 4748f5936fe72d96edfa17b153dbfd84f2c4c053/ghc:

Revert "rts/base: Fix #9423"

This should fix the Windows fallout, and hopefully this will be fixed
once that's sorted out.

This reverts commit f9f89b7884ccc8ee5047cf4fffdf2b36df6832df.

Signed-off-by: Austin Seipp <austin@well-typed.com>

comment:6 Changed 5 years ago by thoughtpolice

Owner: simonmar deleted
Resolution: fixed
Status: closednew

comment:7 Changed 5 years ago by AndreasVoellmy

Are there ticket numbers for the "Windows fallout" stuff?

comment:8 Changed 5 years ago by Austin Seipp <austin@…>

In 7e658bc14e2dd6baf208deebbdab9e1285ce4c72/ghc:

Revert "Revert "rts/base: Fix #9423"" and resolve issue that caused the revert.

Summary:
This reverts commit 4748f5936fe72d96edfa17b153dbfd84f2c4c053. The fix for #9423
was reverted because this commit introduced a C function setIOManagerControlFd()
(defined in Schedule.c) defined for all OS types, while the prototype
(in includes/rts/IOManager.h) was only included when mingw32_HOST_OS is
not defined. This broke Windows builds.

This commit reverts the original commit and resolves the problem by only defining
setIOManagerControlFd() when mingw32_HOST_OS is defined. Hence the missing prototype
error should not occur on Windows.

In addition, since the io_manager_control_wr_fd field of the Capability struct is only
usd by the setIOManagerControlFd, this commit includes the io_manager_control_wr_fd
field in the Capability struct only when mingw32_HOST_OS is not defined.

Test Plan: Try to compile successfully on all platforms.

Reviewers: austin

Reviewed By: austin

Subscribers: simonmar, ezyang, carter

Differential Revision: https://phabricator.haskell.org/D174

comment:9 Changed 5 years ago by thoughtpolice

Resolution: fixed
Status: newclosed

OK, this should be fixed For Real this time.

comment:10 Changed 5 years ago by Andreas Voellmy <andreas.voellmy@…>

In 92c93544939199f6ef758e1658149a971d4437c9/ghc:

Fix #10017

Summary:
In the threaded RTS, a signal is delivered from the RTS to Haskell
user code by writing to file that one of the IO managers watches (via
an instance of GHC.Event.Control.Control). When the IO manager
receives the signal, it calls GHC.Conc.Signal.runHandlers to invoke
Haskell signal handler. In the move from a single IO manager to one IO
manager per capability, the behavior was (wrongly) extended so that a
signal is delivered to every event manager (see #9423), each of which
invoke Haskell signal handlers, leading to multiple invocations of
Haskell signal handlers for a single signal. This change fixes this
problem by having the RTS (in generic_handler()) notify only the
Control instance used by the TimerManager, rather than all the
per-capability IO managers.

Reviewers: austin, hvr, simonmar, Mikolaj

Reviewed By: simonmar, Mikolaj

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D641
Note: See TracTickets for help on using tickets.