Opened 3 years ago

Last modified 14 months ago

#13194 new bug

Concurrent modifications of package.cache are not safe

Reported by: arybczak Owned by:
Priority: normal Milestone: 8.2.1
Component: ghc-pkg Version: 8.0.1
Keywords: Cc: domen@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #13354, #13375, #14017 Differential Rev(s): Phab:D3090
Wiki Page:

Description

There are a couple of different issues here.

  1. On Linux, issuing ghc-pkg register for multiple packages in parallel might result in lost updates to package database because of how registerPackage function works - it reads existing package databases, picks the one to modify, then checks that package info for the package to register is fine and replaces package database with what was read in the beginning + new package info.

Therefore, if updates interleave, it might happen that process1 reads the database, then process2 updates it while process1 still has the old version and uses it for its update later, so update made by process2 is lost.

  1. On Windows, update to package database might fail - the issue is that GHC attempts to update it using rename trick, which fails whenever any other process has file to be replaced open for reading. Combine that with the fact that GHC reads package database when compiling packages and you get problems in both Stack (https://github.com/commercialhaskell/stack/issues/2617) and Cabal (https://github.com/haskell/cabal/issues/4005).

BTW, rename trick (used for atomic database updates) not only doesn't work on Windows, it's also not atomic e.g. on NFS (https://stackoverflow.com/questions/41362016/rename-atomicity-and-nfs).

The solution to both problems is to use OS specific features to lock database file (in shared mode when reading and in exclusive mode when writing). This can be done on Windows using LockFileEx. Unfortunately for POSIX things are a bit more complicated.

There are two ways to lock a file on Linux:

  1. Using fcntl(F_SET_LK) (POSIX API)
  1. Using flock (BSD API)

However, fcntl locks have a serious limitation:

The record locks described above are associated with the process (unlike the open file description locks described below). This has some unfortunate consequences:

  • If a process closes any file descriptor referring to a file, then

all of the process's locks on that file are released, regardless of the file descriptor(s) on which the locks were obtained. This is bad: it means that a process can lose its locks on a file such as /etc/passwd or /etc/mtab when for some reason a library function decides to open, read, and close the same file.

  • The threads in a process share locks. In other words, a

multithreaded program can't use record locking to ensure that threads don't simultaneously access the same region of a file.

Whereas flock is not guaranteed to work with NFS, according to https://en.wikipedia.org/wiki/File_locking#Problems:

Whether and how flock locks work on network filesystems, such as NFS, is implementation dependent. On BSD systems, flock calls on a file descriptor open to a file on an NFS-mounted partition are successful no-ops. On Linux prior to 2.6.12, flock calls on NFS files would act only locally. Kernel 2.6.12 and above implement flock calls on NFS files using POSIX byte-range locks. These locks will be visible to other NFS clients that implement fcntl-style POSIX locks, but invisible to those that do not.[4]

Assuming that the solution would be to go with locking the database, we would need to:

  1. In registerPackage, lock all read databases in shared mode except for the database that will later be modified, which has to be locked in exclusive mode. The handle also would need to be kept open and passed to changeDB later and used for rewriting the database with updated version in GHC.PackageDb.writePackageDb instead of writeFileAtomic (which is not actually unconditionally atomic, as demonstrated above).
  1. GHC.PackageDb.decodeFromFile would lock a file in appropriate mode and return the handle to open file if appropriate.
  1. Add support for locking a file. This should be fairly easy to do in GHC.IO.Handle.FD by extending function openFile' with appropriate parameters and then adding wrapper function openLockedFile or something. We can add both blocking and non-blocking locking to make ghc-pkg show information about waiting for locked package database if appropriate.

Alternatively we could add a function similar to the following: hLock :: Handle -> LockMode -> Bool{-block-} -> IO Bool, but that requires extracting file descriptor from Handle, which as far as I see is problematic.

Is going with locking an acceptable solution here?

Change History (15)

comment:1 Changed 3 years ago by arybczak

Summary: Concurrent modification of package.cache is not safeConcurrent modifications of package.cache are not safe

comment:2 Changed 3 years ago by duncan

The fnctl semantics is problematic in general but not for our use case for ghc/ghc-pkg with the package.conf file. Making a general reusable API is nice, but not essential. The flock API is probably the one to go with, despite NFS issues.

As for a suitable API, either internal or for more general consumption, the API presented by the filelock is a useful place to start (and it has good code we can borrow for the FFI bits). My first guess is that a reasonable route to try is to go with a Handle based API, like

hLockFile :: Handle -> LockMode -> IO FileLock
hTryLockFile :: Handle -> LockMode -> IO (Maybe FileLock)
hUnlockFile :: Handle -> FileLock -> IO ()
withFileLock :: Handle -> LockMode -> (FileLock -> IO a) -> IO a

I'm not yet totally clear what lives in the FileLock. Is this for the sake of the Win32 API because it returns a lock handle? And perhaps hUnlockFile doesn't need the Handle too.

but that requires extracting file descriptor from Handle, which as far as I see is problematic.

It's actually totally possible. It simply fails for Handles that are not based on FDs, but that's ok.

comment:3 Changed 3 years ago by domenkozar

Cc: domen@… added

comment:4 Changed 3 years ago by arybczak

Owner: set to arybczak

comment:5 Changed 3 years ago by arybczak

Differential Rev(s): D3090

comment:7 Changed 3 years ago by mpickering

Differential Rev(s): D3090Phab:D3090

comment:8 Changed 3 years ago by arybczak

While the patch is for 8.1, I adapted it for 8.0.2 (by basically adjusting MIN_VERSION_base to 4.9.1) and tested on Windows. It works fine, i.e. cabal using patched GHC was able to build dependencies for stack concurrently without failures (with -j8) a couple of times, whereas cabal using stock 8.0.1 failed every time with "permission denied" error during register phase of one of the packages.

comment:9 Changed 3 years ago by bgamari

Milestone: 8.2.1
Status: newpatch

comment:10 Changed 3 years ago by Ben Gamari <ben@…>

In 0d86aa59/ghc:

Add support for concurrent package db access and updates

Trac issues: #13194

Reviewers: austin, hvr, erikd, bgamari, dfeuer, duncan

Subscribers: DemiMarie, dfeuer, thomie

Differential Revision: https://phabricator.haskell.org/D3090

comment:11 Changed 3 years ago by bgamari

Resolution: fixed
Status: patchclosed

comment:12 Changed 2 years ago by bgamari

comment:13 Changed 15 months ago by bgamari

As noted in #15313, comment:10 is broken on Windows, failing to ensure mutual exclusion under high load.

comment:14 Changed 14 months ago by Tamar Christina <tamar@…>

In b290f15c/ghc:

testsuite: force plugin tests sequentially on Windows.

Summary:
Package registration does not seem to be thread-safe on
Windows.  Placing the system under heavily load seems to
trigger registration failures even though they are all
different package-dbs.   This makes the plugin tests
a bit flaky.

I think this is because on Windows we use pessimistic locks
while on Linux we use atomic file replacement.

On Windows ReplaceFile is atomic, just the metadata write
may not be.  Since the metadata is not of importance
we should either switch over to ReplaceFile or
fix the locking code to not error out but wait.

For now however I have to force these 25 tests to run
serially in order to guarantee their correctness.

Test Plan: ./validate

Reviewers: bgamari

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #15313, #13194

Differential Revision: https://phabricator.haskell.org/D4918

comment:15 Changed 14 months ago by Phyx-

Owner: arybczak deleted
Resolution: fixed
Status: closednew

This seems to still have some issues on Windows.

Note: See TracTickets for help on using tickets.