Ticket #9 (closed defect: fixed)

Opened 8 years ago

Last modified 7 years ago

Race/Async Exception issue in Network.Socket.connect

Reported by: sclv Owned by:
Priority: major Milestone:
Component: network Version:
Keywords: Cc: mail@…


Submitted here as well: http://hackage.haskell.org/trac/ghc/ticket/3225

import Control.Concurrent
import Control.Monad
import Network.Socket
import Control.Exception as C
import System.Timeout
import Network.BSD(hostAddresses, getHostByName)
import System.IO.Error
import Data.Maybe

-- someHostName should be replaced by a real host that gives
-- "connection refused" errors on connection to ports in the range.
-- The latter ip is a junk one that should cause connections to
-- hang indefinitely.
-- More hostnames with either characteristic can be added to taste
-- if that helps to reproduce the bug.
servers = [ "someHostName", ""]
ports = [9001..9099] :: [Int]

conns = [(h,p) | h <- servers, p <- ports]

connectSock :: Integral a => String -> a -> IO Socket
connectSock host port = do
   hn <- maybe (ioError . mkIOError doesNotExistErrorType "No Host Address" Nothing $ Just host) return . listToMaybe . hostAddresses =<< getHostByName host
   sk <- socket AF_INET Stream 6
   connect sk (SockAddrInet (fromIntegral port) hn) `C.onException` sClose sk
   return sk

pMapM f xs = mapM (\x -> forkIO $ f x) xs

mapM' f xs = mapM (\x -> (C.try :: IO a -> IO (Either C.SomeException a)) (f x)) xs

main = do
  -- This is the canary thread in the bugmine
  forkIO $ forever $ putStrLn "chirp" >> threadDelay 100000

  -- This is the bug thread
  forever $ pMapM (\(h,p) -> timeout 1000000 (connectSock h p) >> return ()) conns >> threadDelay 2000000

The above code, compiled with the threaded runtime, causes a race condition. After roughly one to two cycles of the bug thread, the canary thread stops running, indicating that the program has become somehow trashed. (The bug thread stops running as well).

In experiments, this race condition is best triggered with at least two servers, one of which yields "connection refused" on connection, and the other of which simply hangs -- the nonsense ip address provided above works for the latter.

If the pMapM is replaced by mapM' (i.e. we switch from parallel to serial connection) then the bug does not appear to be triggered.

Wrapping the call to sClose in a mutex didn't seem to help, so it seems the race condition is in the connect call.

Change History

Changed 7 years ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

I managed to reproduce the symptom with 6.12, but it seems to be gone with 7.0.1. I'm fairly sure I've found the cause: a bug in the IO manager, which explains why the bug doesn't exist in 7.0.1 (the IO manager was rewritten).

This line in GHC/Conc.lhs

                  _ | err == eBADF ->  return (True, delays)

should be

                  _ | err == eBADF ->  return (True, delays')

those missing primes can be a real pain :-)

Changed 7 years ago by StefanWehr

  • cc mail@… added
Note: See TracTickets for help on using tickets.