Opened 4 years ago

Closed 4 years ago

#10545 closed bug (fixed)

Deadlock in the threaded RTS

Reported by: simonmar Owned by: simonmar
Priority: highest Milestone: 7.10.2
Component: Runtime System Version: 7.10.1
Keywords: Cc: niteria, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The following program deadlocks with high probability:

-- ghc -rtsopts -threaded -debug performGC.hs 
-- ./performGC 1000 +RTS -qg -N2

-- -qg turns off parallel GC, needed to trigger the bug
-- -N2 or greater is needed

module Main (main) where

import System.Environment
import Control.Concurrent
import Control.Exception
import Control.Monad
import System.Random
import System.Mem
import qualified Data.Set as Set

main = do
  [n] <- getArgs
  forkIO $ doSomeWork
  forM [1..read n] $ \n -> do print n; threadDelay 1000; performMinorGC

doSomeWork :: IO ()
doSomeWork = forever $ do
  ns <- replicateM 10000 randomIO :: IO [Int]
  ms <- replicateM 1000 randomIO
  let set = Set.fromList ns
      elems = filter (`Set.member` set) ms
 evaluate $ sum elems

There are a few ways that this bug can be triggered:

  • At shutdown, when there are other threads still running. This is how we first encountered it.
  • Using performGC, as above. I think it's necessary to call it from a bound thread (e.g. the main thread) to get bad things to happen.
  • I think forkProcess might also trigger it, but I haven't observed it.

I'm working on a fix.

Change History (3)

comment:1 Changed 4 years ago by Simon Marlow <marlowsd@…>

In 111ba4beda4ffc48381723da12e5b237d7f9ac59/ghc:

Fix deadlock (#10545)

yieldCapability() was not prepared to be called by a Task that is not
either a worker or a bound Task.  This could happen if we ended up in
yieldCapability via this call stack:

performGC()
scheduleDoGC()
requestSync()
yieldCapability()

and there were a few other ways this could happen via requestSync.
The fix is to handle this case in yieldCapability(): when the Task is
not a worker or a bound Task, we put it on the returning_workers
queue, where it will be woken up again.

Summary of changes:

* `yieldCapability`: factored out subroutine waitForWorkerCapability`
* `waitForReturnCapability` renamed to `waitForCapability`, and
  factored out subroutine `waitForReturnCapability`
* `releaseCapabilityAndQueue` worker renamed to `enqueueWorker`, does
  not take a lock and no longer tests if `!isBoundTask()`
* `yieldCapability` adjusted for refactorings, only change in behavior
  is when it is not a worker or bound task.

Test Plan:
* new test concurrent/should_run/performGC
* validate

Reviewers: niteria, austin, ezyang, bgamari

Subscribers: thomie, bgamari

Differential Revision: https://phabricator.haskell.org/D997

GHC Trac Issues: #10545

comment:2 Changed 4 years ago by simonmar

Status: newmerge

comment:3 Changed 4 years ago by thoughtpolice

Resolution: fixed
Status: mergeclosed

Merged to ghc-7.10 (I also pulled in be0ce8718ea40b091e69dd48fe6bc62b6b551154).

Note: See TracTickets for help on using tickets.