Opened 2 years ago

Last modified 9 months ago

#13944 new feature request

Introduce synchronized FFI

Reported by: winter Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler Version: 8.0.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by winter)

During hacking on ghc I/O libraries, i find an interesting situation, to summarize: There're some FFI calls that vary a lot in time duration, this makes unsafe FFI bad because it make GC hard to synchronize all capacities. OTOH safe FFI 's overhead is too large to justify its usage.

After some thought i think we should introduce a new FFI type, let call it synchronized for now, it should:

  1. Have the same requirement as safe FFI of what kind of argument we can pass, e.g. pinned ByteArray, pass by value primitive type, etc.
  1. It works like unsafe FFI call by directly jump to c code using current capacity's stack.
  1. It should mark current capacity GC ready, and doesn't participate next GC.
  1. When the FFI finish, if we're GC ing, put current capaity into sleep(if we can find a way to let it join parallel GC that would be better). Otherwise it continue to work.

The point is that if a FFI call doesn't need GC guarantee, then it shouldn't stop GC running. The next GC will not be running with all threads, but that's not too terrible since it doesn't scale too much anyway, we can combine this with the new -qn RTS parameter: as long as there're more than n threads can join GC, we do GC.

Change History (17)

comment:1 Changed 2 years ago by winter

Description: modified (diff)

comment:2 Changed 2 years ago by winter

Description: modified (diff)

comment:3 Changed 2 years ago by bgamari

I don't see how this could work (or I've misunderstood your proposal). Consider that you have a thread with some reference to some heap object (call it A) sitting in another capability's nursery. Let's say that it enters a synchronized call and while the call is running a GC is started. During this GC another other capability evacuates A (producing a new object, say A') and the nursery in which it lived is freed. Now we have a situation where the thread who made the call has a reference to A, an object which existed in a no-longer-existing heap block, which it may attempt to dereference once the call completes. The distinguishing feature of the safe call mechanism is that we suspend the running thread, saving all of its state back to its TSO, where they can be handled by the garbage collector.

comment:4 Changed 2 years ago by winter

I'm not even sure that my proposal work :) but let's try to have a discussion on this idea.

Remember the new -qn RTS flag, which allow some inactive threads doesn't participate GC, (The word participate seems too vague, i think it means the thread doesn't do GC work, but do get GCed by other GC worker thread).

Now a thread which enter synchronized call will behave exactly like an inactive thread which is not selected to do GC work, that means its root will get GCed by some other thread, which is OK since synchronized FFI restrict what kind of heap object it can reference.

I'm not totally clear how this idea works because i'm not an expert on RTS, all clarifications and suggestions are welcomed!

comment:5 Changed 2 years ago by winter

OK, now i'm pretty sure what synchronized would do, it will behave just like a safe FFI call with one difference: a task which comes across a synchronized call will give off its capacity, but will not hand off it to other tasks unless that's a GC task. This means synchronized calls will still block whole capacity, but it will not spawn new threads, nor blocking GC.

I'm not sure if this make sense, it seems the overhead of safe FFI calls come from both the capacity transfer and creating new OS threads. But i'm not sure which is more costly.

comment:6 Changed 2 years ago by bgamari

Alright, I think I see what you want. So you are targetting applications where you have a potentially-blocking foreign call and therefore can't use unsafe calls but want to avoid the cost associated with suspending a Haskell thread. This cost comes in a variety of forms,

  1. a stack walk, due to the duplicate computation check in suspendThread
  2. a few lock operations to safely modify the capability
  3. the cost of saving and restoring the thread state to the TSO

Judging from this I'd guess this will cost anywhere from hundreds to thousands of cycles (largely depending upon stack depth, I suspect), and consequently it seems plausible that we'd want a way to avoid this. Have you done any characterisation of this?

The problem is that, other than perhaps the stack walk, I don't think any of these costs can be avoided. Afterall, we need to save the thread state to the TSO in order to safely GC (since TSOs are GC roots). I suspect it would also be necessary to lock the capability. Consequently, it's really not clear to me that this new call type would pull its weight. It would be helpful if we had actual measurements of how much the various contributions above cost.

comment:7 in reply to:  6 Changed 2 years ago by winter

Alright, I think I see what you want. So you are targetting applications where you have a potentially-blocking foreign call and therefore can't use unsafe calls but want to avoid the cost associated with suspending a Haskell thread.

Exactly! For example current I/O manager first perform an unsafe read and check if result is eAgain, but for regular file it will never return eAgain, which turn this unsafe read into a blocking read. If unfortunately OS didn't manage page cache for us, we end up with a long capacity blocking read.

The problem is that, other than perhaps the stack walk, I don't think any of these costs can be avoided. Afterall, we need to save the thread state to the TSO in order to safely GC (since TSOs are GC roots). I suspect it would also be necessary to lock the capability.

Is there a way to do it lazily? By say lazily i meaning, can we just perform the FFI call and mark the capacity with some kind of SYNC flag. If a GC thread is asking for sync during FFI, then let it perform these stack saving for us. Otherwise we just happily continue, and saving all the overheads.

I think this might work since once we enter a FFI call, the haskell thread's stack are not touch anymore, and the whole capacity is freezed.

comment:8 Changed 2 years ago by bgamari

The problem is that, other than perhaps the stack walk, I don't think any of these costs can be avoided. Afterall, we need to save the thread state to the TSO in order to safely GC (since TSOs are GC roots). I suspect it would also be necessary to lock the capability.

Is there a way to do it lazily? By say lazily i meaning, can we just perform the FFI call and mark the capacity with some kind of SYNC flag. If a GC thread is asking for sync during FFI, then let it perform these stack saving for us. Otherwise we just happily continue, and saving all the overheads.

I'll assume by "it" you mean the thread-state save. I don't believe it is possible; afterall, there may be thread state (which may contain reference to heap objects) sitting in machine registers.

comment:9 Changed 2 years ago by winter

I'll assume by "it" you mean the thread-state save. I don't believe it is possible; afterall, there may be thread state (which may contain reference to heap objects) sitting in machine registers.

Yes, i mean all the thread pause stuff. I want to understand what is the minimum preparation before we can make FFI calls which doesn't get in GC's way. I believe this helps to reduce GC pause.

comment:10 Changed 2 years ago by bgamari

Yes, i mean all the thread pause stuff. I want to understand what is the minimum preparation before we can make FFI calls which doesn't get in GC's way. I believe this helps to reduce GC pause.

I don't believe there is anything that we do currently that can simply be omitted without careful consideration at very least.

Have you confirmed that safe is really as slow as you think it is? I think that we should better characterise the problem before we begin discussing possible solutions. I think it would be helpful if you had a benchmark and could use it answer a few questions,

  1. What overhead is incurred by a unsafe call?
  2. What overhead is incurred by a safe call?
  3. Where specifically is the additional overhead of the safe call coming from?

comment:11 Changed 2 years ago by winter

I have did some benchmarkshere, my conclusion is that under concurent disk I/O situations, safe FFI can cause a slow down up to 50%.

The problem is if the disk is slow enough, or the file is not in page cache, then safe FFI may be a good choice. OTOH, if a blocking read doesn't take too much time, we'd better simply issue an unsafe read.

I can't answer all the questions you're asking now though, i'm thinking for disk I/O a small thread pool maybe a better answer, since the I/O throughput will not increase or even downgrade when conncurrent number exceed a certain value. (for example, libuv do this).

comment:12 Changed 2 years ago by bgamari

I have did some benchmarks​ here, my conclusion is that under concurent disk I/O situations, safe FFI can cause a slow down up to 50%.

I can't reproduce this result. I've checked out your benchmark and see the following (only the 10m runtimes were large enough to be meaningful),

unsafe

[1250 ben@ben-laptop diskIO(master)] $ time bin/unsafe-ffi 10m

real	0m0.657s
user	0m0.000s
sys	0m1.035s
[1251 ben@ben-laptop diskIO(master)] $ time bin/unsafe-ffi 10m

real	0m1.766s
user	0m0.000s
sys	0m1.912s
[1251 ben@ben-laptop diskIO(master)] $ time bin/unsafe-ffi 10m

real	0m1.460s
user	0m0.008s
sys	0m1.570s

safe

[1252 ben@ben-laptop diskIO(master)] $ time bin/safe-ffi 10m

real	0m0.867s
user	0m0.000s
sys	0m0.966s
[1252 ben@ben-laptop diskIO(master)] $ time bin/safe-ffi 10m
^[[A

real	0m0.473s
user	0m0.001s
sys	0m0.636s
[1252 ben@ben-laptop diskIO(master)] $ time bin/safe-ffi 10m

real	0m0.794s
user	0m0.009s
sys	0m0.864s

If anything, it seems that safe FFI is dramatically faster than unsafe. Admittedly this is rather surprising, but I wouldn't expect a factor of two slower for this particular test. In the safe case perf confirmed that less than 3% of runtime was spent in suspendThread and resumeThread combined.

Last edited 2 years ago by bgamari (previous) (diff)

comment:13 Changed 2 years ago by winter

I'm not sure if only the 10m runtimes were large enough to be meaningful, the problems is that for large files, we indeed spend more time on read and write instead of FFI scheduling, and unsafe FFI's blocking behaviour is indeed bad in this case.

The different size of test is really supposed to test how different strategy react to each kind of workload, but i may have been failed to do that in that test though.

FYI, currently in base we always do an unsafe read/write before call threadWaitRead/Write on *nix system, and since disk file never block, this is simply same as issue an unsafe call.

comment:14 Changed 2 years ago by bgamari

I'm not sure if only the 10m runtimes were large enough to be meaningful

The benchmark when run against the 1m input took less than 50 milliseconds. I doubt that this is enough for the benchmark to get into enough of a steady-state to be representative of actual performance.

comment:15 Changed 20 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:16 Changed 15 months ago by bgamari

Milestone: 8.6.18.8.1

Won't happen for 8.6.

comment:17 Changed 9 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.