Opened 12 months ago

Last modified 12 months ago

#15665 new feature request

Break up the stable pointer table

Reported by: dfeuer Owned by:
Priority: normal Milestone: 8.6.1
Component: Compiler Version: 8.4.3
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #7670 Differential Rev(s):
Wiki Page:

Description

I see no obvious reason for the stable pointer table to be a single array. Indeed, that leads to all sorts of complications. It smells to me like the simplest thing would be to make it a chain of blocks, and make a StablePtr# a real pointer to the relevant entry. No relocation ever!

Change History (12)

comment:1 Changed 12 months ago by dfeuer

Well, I guess that's used to compress the free list, but is it really worth it? We could either uncompress the free list or untag stored pointers, I imagine.

comment:2 Changed 12 months ago by osa1

I don't understand comment:1. What do you mean by "compress the free list"? What do you mean by "untag stored pointers"?

I think the only problem would be that currently free slots in stable ptr table hold offset of the next free slot. So allocating a new stable pointer (assuming we have space in the table) is as simple as stable_ptr_free = (spEntry*)(stable_ptr_free->addr);. If we switch to an array-list representation this operation will be more expensive, we'll have to traverse the chain of arrays until we find the array with the next empty slot.

If the array-list representation will fix the table copying on GC I think this may worth it though.

comment:3 Changed 12 months ago by dfeuer

Currently, a stable pointer table entry consists of a single pointer, either into the heap (if it's used) or to the next free entry (if it's not). Garbage collection traverses the whole table and distinguishes between used and free entries by whether they point into the table or not. This leads to a very compact table, but it tightly restricts the implementation. If we "uncompress" it, we use two words per entry: one for the payload and one for a next pointer.

The uncompressed representation offers a lot more flexibility. We can immediately drop the array doubling mess. But we get a lot more flexibility elsewhere. In particular, we can have non-free lists, one per generation, along with the free list. Now when we collect a generation, we traverse only its non-free list, marking objects and moving entries to the non-free list for the next generation.

I think we can probably even go to a nearly lock-free mechanism, using something like a Harris-style lock-free linked list. Here's a rough sketch:

Make a stable pointer

Pop an entry from the free list. If the free list is empty, take a lock and allocate a new block of entries. Populate the entry appropriately and add it to the Gen0 non-free list. Perform one step of maintenance.

Delete a stable pointer

Tag the entry deleted (Harris stage 1 of deletion). Possibly perform one step

Maintenance

Traverse all the non-free lists. Physically delete (Harris stage 2) each entry tagged as deleted and push it onto the free list.

comment:4 Changed 12 months ago by osa1

If we "uncompress" it, we use two words per entry: one for the payload and one for a next pointer

But you'll still allocate an array instead of allocating each (payload, next), right?

(Adding #7670 as related ticket as this proposes a fix for it)

comment:5 Changed 12 months ago by dfeuer

Right. Each allocation would be a full block, and would add all the entries in the block to the free list. Trying to go lock-free looks more complicated the more I think about it, but generational collection looks easy.

comment:6 Changed 12 months ago by dfeuer

Note: I think any approach using lists of active entries will need incremental deletion or similar to avoid needing a doubly linked list, but that should be pretty easy with a global lock.

comment:7 Changed 12 months ago by dfeuer

One challenge is that the active entry list for a generation could jump all over the heap, making major GC more expensive than it is now when the table is well populated. Is there a sufficiently cheap way to ameliorate this? I believe there is. The trick is to have one active entry list per generation in each block, and an active block list in each generation. So now GC for a generation will always finish one block before it moves to the next, which I imagine should be much easier on the caches. We just need to reserve room for as many pointers in each block as there are GC generations, which doesn't look like too high a price to me.

Can we improve concurrency, which was really what got me started thinking about this machine? I don't really know. The easiest thing would be to divide everything strictly among the capabilities. Each capability would have its own free list and its own active lists. Each block would be owned by just one capability. Only that capability would be permitted to allocate stable pointers in that block, though others could mark pointers for deletion. The trouble with that rigidity is that one capability could have loads of entries on its free list while another is flat broke and forced to allocate fresh blocks (for example, a heavy StablePtr allocator could migrate from one capability to another). Is there some sufficiently cheap way to tax the rich to feed the poor without too much bureaucracy? I don't see any, but I'm surely no expert.

comment:8 Changed 12 months ago by dfeuer

Ah, I think I do see one. Start by using one free list per block, and tracking the number of free list entries both per block and per capability. When adding an entry to the free list for a block, check whether both of the following are true: 1. The block has "sufficient" free entries (to be worth the synchronization overhead) and 2. The capability has "sufficient" other free entries (so it's not going to have to jump straight from the donor line to the recipient line). In that case, the capability relinquishes the block to a shared list. Any capability that runs out of free entries can try to get a (partially used) block from that list and only allocate a fresh one if that fails.

comment:9 Changed 12 months ago by dfeuer

How can we use entries as efficiently as possible?

When allocating, we are always better off allocating into a block with a non-empty free list if we have one. But which one should we pick? If we pick one with a long free list, then we'll be able to stick with it for a while before we have to pick another, which should make allocation efficient. On the flip side, we may want to slowly concentrate live entries to be able to offer blocks to other threads. I don't feel like I have a clue how to manage that balance as yet.

How should we focus our maintenance work? I'm not really sure. The simplest thing seems to be to just go through all the blocks in the order they were added and then go back to the beginning. We should probably keep "deletion pending" counts (updated with FAA) per block to avoid traversing blocks with very little maintenance work available--I believe setting the right threshold there should improve the worst-case block allocation behavior.

Speaking of worst-case block allocation behavior.... Doing maintenance work (deleting marked-deleted blocks) only on stable pointer allocation and GC) seems to be pretty important for controlling complexity if we want to avoid licking. But it can cause trouble for certain patterns of allocation and deallocation. A thread could use up all its capacity, so all its free lists are empty, then delete almost all its stable pointers, leaving its free lists still empty. Allocating a new stable pointer could naively allocate a fresh block, which would be most unfortunate. We might want to find a block with lots of pending deletions, perform them all, and then continue. But maintaining a lock-free priority queue of blocks with lots of pending deletions sounds too hard. Hrm... Perhaps we can detect this situation with a global counter and use the garbage collector to clean it up? Not so pretty, but maybe.

Last edited 12 months ago by dfeuer (previous) (diff)

comment:10 Changed 12 months ago by dfeuer

Ah.... Another option: We could make the generation lists doubly linked. This is much less expensive than I initially thought, because the links are intra-block, so two can fit in one word. We then add a many-pushers-one-popper maintenance list running through the target pointers of entries marked deleted. We'd keep a similar list of blocks that need more than a little maintenance. Now maintenance is much more effective! So effective that we don't need to do it on stable pointer allocation at all, unless the free list is empty.

comment:11 Changed 12 months ago by dfeuer

Deletion marking in the doubly linked regime:

  1. Set the key pointer to the head of the maintenance list for this block. The garbage collector can realize this as within-block and thus not a heap pointer.
  1. Use CAS to set the head of the maintenance list. On failure, go back to 1.

If the thread marking deletion pauses between these steps, then a concurrent maintenance pass may fail to delete the entry. That's okay; it'll be cleaned up on the next pass. Even if the thread fails altogether, that just means the entry can never be reused.

Note: the maintenance routine doesn't risk ABA problems because it's the only one popping.

comment:12 Changed 12 months ago by dfeuer

This stream of consciousness has gotten a bit long. So here's a bit of a consolidation of my proposed implementation:

Stable pointers are managed in blocks. Each block belongs (at any given time, anyway) to at most one capability. That capability is the only one that allocates stable pointers in the block.

Each block contains a table of entries, as many non-free lists (doubly linked stacks) as there are GC generations, a free list (singly linked), and a maintenance list (a read stack and a write stack, each singly linked).

An entry consists of a pointer and two half-word fields. An entry cycles through three states: live, marked deleted, and free. In any non-free entry, the half-word fields point to the previous and next non-free entries in the generation. In a live entry, the pointer points into the Haskell heap. In a marked-deleted entry, the pointer points to the next entry in the maintenance list. In a free entry, the pointer points to the next entry in the free list, and the half-word fields are unused.

Allocation into a block

If there is an entry on the free list, remove it from the free list, populate its pointer, and add it to the non-free list in its generation. Otherwise, perform a maintenance step to free an entry and use it.

Deletion

To delete a block, add it to the maintenance list. This can be done with a CAS loop by any thread.

Maintenance

If the read end of the maintenance list is empty, use an exchange operation to remove the write end and then install it as the read end (we don't care too much about order). Pop an entry off the read end, physically delete it from the non-free list, and (unless it is needed immediately) as it to the free list.

Garbage collection

Traverse the read end of the maintenance list performing maintenance. Exchange out the write end and do the same. The maintenance list may continue to grow during collection; that will be cleaned up later. The purpose of performing maintenance during GC is to avoid having to traverse the same deleted entries over and over when there isn't much stable pointer allocation.

Traverse the non-free list for the current generation. Do whatever's needed to mark and update the pointers in live entries.

Block management

Most of the open questions are here. I'll leave them out of this comment.

Note: See TracTickets for help on using tickets.