Opened 6 years ago

Last modified 3 years ago

#8732 new bug

Global big object heap allocator lock causes contention

Reported by: tibbe Owned by: simonmar
Priority: normal Milestone:
Component: Runtime System Version: 7.6.3
Keywords: Cc: hvr, simonmar, idhameed@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by tibbe)

The lock allocate takes when allocating big objects hurts scalability of I/O bound application. Network.Socket.ByteString.recv is typically called with a buffer size of 4096, which causes a ByteString of that size to be allocated. The size of this ByteString causes it to be allocated from the big object space, which causes contention of the global lock that guards that space.

See http://www.yesodweb.com/blog/2014/02/new-warp for a real world example.

Change History (19)

comment:1 Changed 6 years ago by tibbe

Description: modified (diff)

comment:2 Changed 6 years ago by hvr

Cc: hvr added
Milestone: 7.10.1
Type of failure: None/UnknownRuntime performance bug

comment:3 Changed 6 years ago by ezyang

It is a good thing that these blocks are considered big blocks, since we don't really want to be copying the buffers around. So one thought might be to make the large block list in generation-0 per-thread, and perform allocations from a thread-local block list. But you have to be careful: objects that are larger than a block need contiguous blocks, so unless you are only going to enable this for large objects that still fit in a single block, you'll have to maintain multiple lists with the sizes you want.

comment:4 Changed 6 years ago by tibbe

But you have to be careful: objects that are larger than a block need contiguous blocks, so unless you are only going to enable this for large objects that still fit in a single block, you'll have to maintain multiple lists with the sizes you want.

I think malloc already does that, so we could copy whatever they do perhaps.

comment:5 Changed 6 years ago by ezyang

It's pretty standard, yes (we implement for handling the global block pool), but it does mean all of that code would have to be made thread-local.

comment:6 in reply to:  5 Changed 6 years ago by tibbe

Replying to ezyang:

It's pretty standard, yes (we implement for handling the global block pool), but it does mean all of that code would have to be made thread-local.

I guess that means even worse performance problems on OS X? Even if it does, it sounds like the right thing to do.

comment:7 Changed 6 years ago by carter

@tibbe, becaue TLS is slow on OS X currently? (mind you, my understanding is that the other RTS issues go away when building GHC with a REAL GCC, right? I take it thats not the case for this discussion? )

comment:8 Changed 6 years ago by ezyang

In this case, slowness of TLS is not an issue, because we manually pass around pointers to structs which are known to be per-capability (and can be accessed in an unsynchronized way), so you don't actually need thread-local *state*.

comment:9 Changed 6 years ago by simonmar

I don't really understand why in mighty he couldn't just re-use the same block.

I'm kind of surprised that this is a bottleneck, and I think it needs more investigation. We only take the lock for large objects, so typically there's going to be a lot of computation going on per allocation.

I suppose if it really is a problem then we could just have a per-thread block pool at the granularity of a megablock to avoid fragmentation issues. We just push the global lock back to the megablock free list. This has the danger that we might have a lot of free blocks owned by one thread that don't get used, though, so we might want to redistribute the free blocks at GC. Things start to get annoyingly complicated.

comment:10 Changed 6 years ago by simonmar

It's even harder than that, because a block can be allocated by one thread and freed by another thread, so we lose block coalescing, even if it can be made to work safely.

So I suggest if we want to do anything at all here, we just do the really simple thing: we allocate a chunk of contiguous memory, keep it in the capability, and use that to satisfy large block requests if it's large enough.

comment:11 Changed 5 years ago by ihameed

Cc: idhameed@… added

comment:12 Changed 5 years ago by carter

with the new contiguous heap design for x86_64 systems that just got merged in, do some of the ideas here become easier?

comment:13 Changed 5 years ago by ezyang

carter: contiguous heap has not been merged in, and it doesn't really help for this problem.

comment:14 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:15 Changed 5 years ago by tibbe

What about the idea of just using malloc? Modern mallocs like TCMalloc are already multithreaded and seem to just deal with all the annoying issues. Gregory Collins said that in Snap they just don't use the "built-in" ByteString construction functions and instead just call malloc.

comment:16 Changed 5 years ago by simonmar

Malloc is fine for ByteStrings, but we can't use it for heap-resident objects due to the way block descriptors work. Our memory is always MB-aligned, so that we can put the block descriptors at the beginning of the MB. Also the GC has to be able to distinguish heap memory from non-heap memory, and we currently take advantage of the fact that memory is allocated in MB chunks to reduce the granularity that we have to map the address space. The contiguous-heap patch solves this in a different way (that is also incompatible with malloc).

comment:17 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:18 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:19 Changed 3 years ago by dobenour

Ask the TCMalloc or JEmalloc developers? They have solved this problem, and even if GHC can't use them directly, the algorithms used in them could be used.

Also, I am wondering if the current large object limit is too small.

Note: See TracTickets for help on using tickets.