Opened 3 years ago

Last modified 10 months ago

#13362 patch feature request

GHC first generation of GC to be as large as largest cache size by default

Reported by: varosi Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 8.0.2
Keywords: numa cache gc newcomer Cc: sjakobi
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D4679
Wiki Page:

Description (last modified by sjakobi)

This will improve performance a lot by default.

If anyone needs different size - there is "-A" RTS option. Machines are very different. Since 8.2 this size has been 1MB by default for all different architectures and hardware no matter what. In most cases machines with larger caches have more RAM as well and vice versa. So this will affect positively both small and larger machines. It will be most efficient in most cases to keep short lived objects in caches. Most modern workstation and server machines have L3 cache as well, that is why I'm asking for "largest cache size".

Second idea will be if there are two short lived generations on machines with second and third level caches with sizes that match both.

For NUMA machines with non-unified caches (like this strange and non-common ARM) the common solution could be to set first generation to be with size of the largest cache of smallest core. Which will not be the optimal, but close to.

Related Reddit discussion

Change History (23)

comment:1 Changed 3 years ago by dfeuer

Milestone: 8.4.1
Type of failure: None/UnknownRuntime performance bug

comment:2 Changed 20 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:3 Changed 20 months ago by varosi

I vote for this! It's not hard to be implemented I hope, but affect performance a lot on more hardware.

comment:4 Changed 20 months ago by bgamari

Keywords: newcomers added

Auto-sizing the allocation area sounds like a reasonable idea to me. A portable solution is perhaps tricky, but I doubt a solution that works on the major operating systems is far from reach. Perhaps you want to have a look?

comment:5 Changed 20 months ago by carter

do we mean the Nursery or the Gen1 heap after the nursery?

I'd imagine we want the nursery to fit in L1 or L2 caches (where applicable) and the Gen1 heap to fit in the rest of the Cache left in Level3 after we account for nurseries?

Perhaps

size of nursery = size of L2 cache per cpu core
size of gen1 >= max(#capabilities * size of nursery
                               , size of L3 cache in socket - (#capabilities * size of  nursery) )

we definitely (at least in many core systems) do *not* want nurseries on the same Socket creating cache thrash with eachother (ie under heavy allocation workloads)?

comment:6 Changed 20 months ago by klapaucius

Stephen M Blackburn, Perry Cheng, Kathryn S McKinley - Myths and Realities: The Performance Impact of Garbage Collection p. 10 5.4.5 Sizing the nursery

"Figure 4(a) shows a small improvement with larger nurseries in mutator performance due to fewer L2 (Figure 4(e)) and TLB misses (Figure 4(f)). However, the difference in GC time dominates: smaller nurseries demand more frequent collection and thus a substantially higher load. We measured the fixed overhead of each collection <...> The garbage collection cost tapers off between 4MB and 8MB as the fixed collection costs become insignificant. These results debunk the myth that the nursery size should be matched to the L2 cache size (512KB on all three architectures)."

comment:7 Changed 20 months ago by varosi

For nursery may be there is no such deal, but see a benchmark made on a small raytracer (https://bitbucket.org/varosi/cgraytrace/overview) that do a lot of allocations on two different machines:

https://docs.google.com/spreadsheets/d/1dnhQTrm_EgKab3IJQAC4Rw1IJOfOcryJ56W0z9aeo0k/edit?usp=sharing

There is a clear difference for different GC sizes. As sizes grow beyond times of needed memory GC is almost not used and then the benefit of larger GC is brought back.

@klapaucius, this paper is really interesting!

Last edited 20 months ago by varosi (previous) (diff)

comment:8 Changed 19 months ago by sjakobi

Cc: sjakobi added

comment:9 Changed 19 months ago by sjakobi

Description: modified (diff)

comment:10 Changed 19 months ago by sjakobi

Owner: set to sjakobi

comment:11 Changed 19 months ago by sjakobi

I have a branch that works for me on Windows 10 and Linux using an i7-4790K CPU.

It would be great if y'all could test this on:

  • macOS / OS X
  • FreeBSD
  • ARM and other non-x86 architectures
  • Intel Haswell and Broadwell CPUs with L4 cache

To ensure that the code works as intended, run a "Hello world"-program with +RTS -s and check that the report shows (N+1) MB total memory in use where N MB is the size of your largest cache.

PRs to support other operating systems are also very welcome! :)

Open design questions as of now:

  1. If we only find an L1 cache, should we really go with an allocation area of typically just 32 or 64 kB?

IMHO it might be better to ignore any L1 caches and to simply default to the old 1 MB in these cases.

  1. What if we find an L4 cache with 64 or 128 MB? This would easier to decide if we got some benchmark results, for example in the style of varosi's.

comment:12 Changed 19 months ago by varosi

Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?

comment:13 in reply to:  12 ; Changed 19 months ago by sjakobi

Replying to varosi:

Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?

You can download a binary distribution here. It's not an optimized build though, so at least building with it should be slower than with official releases.

Regarding running on Windows machines with close to a hundred cores, the current implementation will only detect caches within its current processor group of at most 64 logical processors (see "Remarks" here). As long as there aren't any larger caches outside of the processor group it will still set the allocation area to the correct size.

comment:14 Changed 18 months ago by varosi

@klapaucius, "Multicore Garbage Collection with Local Heaps" by Simon Marlow and Simon Peyton Jones in chapter 6.1.1 state: "Nevertheless, we do find that on average there is a local minimum around 1MB on this hardware. ... staying within the cache becomes more beneficial as contention for main memory increases."

I could experiment more on different processors.

comment:15 in reply to:  13 ; Changed 18 months ago by varosi

How can I experiment with non-optimized version as I doesn't have reference for comparison? I could try to build some optimized version.

Replying to sjakobi:

Replying to varosi:

Great! Is it possible to share your Windows executable so I could experiment on a few machines from a few cores up to close to hundred?

You can download a binary distribution here. It's not an optimized build though, so at least building with it should be slower than with official releases.

Regarding running on Windows machines with close to a hundred cores, the current implementation will only detect caches within its current processor group of at most 64 logical processors (see "Remarks" here). As long as there aren't any larger caches outside of the processor group it will still set the allocation area to the correct size.

comment:16 in reply to:  15 Changed 18 months ago by sjakobi

Replying to varosi:

How can I experiment with non-optimized version as I doesn't have reference for comparison? I could try to build some optimized version.

Building your own binary should be pretty straightforward. In order to investigate the effects of my patch you don't really need a different build anyway. You can simply find out what size your L3 cache has and pass that size to the -A RTS flag.

comment:17 Changed 18 months ago by varosi

I cannot build my ray-tracer with the new GHC version, but I'll try to fix problems and get back here.

comment:18 Changed 17 months ago by sjakobi

Differential Rev(s): Phab:D4679

In order to move things forward, I have uploaded a patch.

Regarding my questions from comment:11, I propose only looking at L3 and L2 caches for now.

comment:19 Changed 17 months ago by sjakobi

Owner: sjakobi deleted

There are several interesting comments on ​Phab:D4679 pointing out that a more intricate method is required for auto-sizing the allocation area.

As I need to focus on my GSoC project (and due to my lack of understanding of the GC) I'm unlikely to be of much further help with this ticket.

I'm also unsure whether this is still a good newcomer ticket.

comment:20 Changed 15 months ago by bgamari

Status: newpatch

comment:21 Changed 15 months ago by bgamari

Milestone: 8.6.18.8.1

We won't be doing this for 8.6. Bumping to 8.8.

comment:22 Changed 12 months ago by monoidal

Keywords: newcomer added; newcomers removed

comment:23 Changed 10 months ago by bgamari

Milestone: 8.8.1

This will certainly need more measurement before we do this.

Note: See TracTickets for help on using tickets.