Opened 6 years ago

Last modified 4 years ago

#8199 new feature request

Get rid of HEAP_ALLOCED on Windows (and other non-Linux platforms)

Reported by: ezyang Owned by: ezyang
Priority: low Milestone:
Component: Compiler Version: 7.7
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: #5435 Blocking:
Related Tickets: Differential Rev(s): D207
Wiki Page:

Description (last modified by ezyang)

Update for readers: we fixed this for Linux with #9706. But may systems (including Windows) can't handle reserving this much memory. Some other technique is necessary in these cases. This is one possibility.


This bug is to track progress of removing HEAP_ALLOCED from GHC, promising faster GC (especially for large/scattered heaps), as long as we can keep the cost of indirections down.

The relevant wiki article: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapAlloced ; we are implementing method 2. Version 2 of the patchset is probably correct.

Blocking problems:

  • Properly handle the Windows DLL case (e.g. SRTs). We will probably have to reorganize how the indirections are laid out.
  • Make it work for GHCi linking of static objects. Blocked on #2841, I have it working for ELF, and can make it work for other platforms as soon as I get relevant machines.
  • Bikeshed hs_main API changes (because closures are not valid prior to RTS initialization, so you have to pass in an indirection instead)
  • Does not work with unloadObj (see comments below)

Performance improvements possible:

  • This patch introduces a lot of new symbols; ensure we are not unduly polluting the symbol tables. (In particular, I think _static_closure symbols can be made hidden). I've eliminated all of these except for the init symbols, which cross the stub object and assembly file boundary, and so would need to be made invisible by the linker. I needed to make local info tables public.
  • Don't pay for a double indirection when -fPIC is turned on. Probably the easiest way to do this is to *not* bake in the indirections into compiled code when it is -fPIC'd, and instead scribble over the GOT. However, I don't know how to go backwards from a symbol to a GOT entry, so we might need some heinous assembly hacks to get this working.
  • The old HEAP_ALLOCED is supposed to be pessimal on very large heaps. Do some performance tests under those workloads.
  • Make sure the extra indirection is not causing any C-- optimizations to stop firing (it might be, because I put it in as a literal CmmLoad)
  • Once an static thunk is updated, we can tag the indirection to let other code segments to know about the good news. One way to do this is have the update frame for a static indirection should have a reference to the *indirection*, not the closure itself. However, this scheme will not affect other static closures which have references to the thunk.
  • Closure tables should have their indirections short-circuited by the initialization code. But maybe it is not worth the cost of telling the RTS about the closure tables (also, they would need to be made writeable).
  • We are paying an indirection when a GC occurs when the closure is not in R1. According to the wiki page, technically this is not needed, but I don't know how to eliminate references to the closure itself from stg_gc_fun.
  • Save tags inside the indirection tables, so that we don't spend instructions retagging after the following the indirection. Done.
  • Improve static indirection and stable pointer registration, avoiding binary bloat from __attribute(constructor)__ stubs. After discussing this with some folks, it seems that there isn't really a portable way to do this when we are using the system dynamic linker to load libraries at startup. The problem is that we need to somehow get a list of all the GHC-compiled libraries which got loaded, and really the easiest way to get that is to just build it ourselves.
  • Need to implement a new megablock tracking structure so we can free/check for lost blocks. Now that efficient lookup is not necessary, perhaps we can write-optimize the megablock tracking structures.

Speculative improvements:

  • Now that static lives in a block, can we GC them like we GC normal data? This would be beneficial for static thunks, which now can have their indirections completely removed; reverting CAFs may be somewhat tricky, however.
  • Now that HEAP_ALLOCED is greatly simplified, can we further simply some aspects of the GC? At the very least, we ought to be able to make megablock allocation cheaper, by figuring out how to remove the spinlocks, etc.
  • Another possibility is to adopt a hybrid approach, where we manually lay out closures when compiling statically, and indirect when compiling dynamically. In some sense, this gets the best of both worlds, since we expect to not pay any extra cost for indirection due to PIC.

Attachments (5)

heap-alloced-20130829.patch (147.9 KB) - added by ezyang 6 years ago.
Revision 1
heap-alloced-20130830.patch (143.5 KB) - added by ezyang 6 years ago.
Version 2: now with less deadlock
heap-alloced-20130831.patch (164.2 KB) - added by ezyang 6 years ago.
Version 3: improved generated code by caching tags
heap-alloced-20130901.patch (12.5 KB) - added by ezyang 6 years ago.
Version 4: Reduce number of exported symbols
heap-alloced-20140410.patch (126.9 KB) - added by ezyang 5 years ago.
Multifarious improvements

Download all attachments as: .zip

Change History (61)

Changed 6 years ago by ezyang

Attachment: heap-alloced-20130829.patch added

Revision 1

comment:1 Changed 6 years ago by ezyang

Preliminary performance results (not all of nofib was compiling when I did these):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +5.1%     +0.0%      0.00      0.00     +0.0%
           atom          +5.1%     +0.0%     -3.5%     -3.1%     +0.0%
     bernouilli          +5.1%     +0.0%     +1.7%     +1.7%     +0.0%
         exp3_8          +5.1%     +0.0%     +0.0%     +0.0%     +0.0%
    gen_regexps          +5.0%     +0.0%      0.00      0.00     +0.0%
      integrate          +5.1%     +0.0%      0.19      0.20     +0.0%
          kahan          +4.9%     +0.0%     +0.0%     -0.7%     +0.0%
      paraffins          +5.0%     +0.0%      0.11      0.11     +0.0%
         primes          +5.1%     +0.0%      0.07      0.07     +0.0%
         queens          +5.1%     +0.0%      0.02      0.02     +0.0%
           rfib          +5.1%     +0.0%      0.02      0.02     +0.0%
            tak          +5.1%     +0.0%      0.02      0.02     +0.0%
   wheel-sieve1          +5.1%     +0.0%     +0.5%     +0.0%     +0.0%
   wheel-sieve2          +5.0%     +0.0%     -4.8%     -4.8%     +0.0%
           x2n1          +5.1%     +0.0%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          +4.9%     +0.0%     -4.8%     -4.8%     +0.0%
            Max          +5.1%     +0.0%     +1.7%     +1.7%     +0.0%
 Geometric Mean          +5.0%     +0.0%     -1.0%     -1.2%     +0.0%

Binary size bloat comes from all of the extra static indirection tables nonsense; I'm sure we can squeeze it down a bit. Some pretty good improvements in some cases. What I didn't see was any massive slowdowns like I am seeing in ghc-stage2. So I ran nofib again -dynamic:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +5.9%     +0.0%      0.00      0.00     +0.0%
           atom          +4.5%     +0.0%     -3.4%     -3.8%     +0.0%
     bernouilli          +5.3%     +0.0%     +0.0%     +0.0%     +0.0%
         exp3_8          +5.6%     +0.0%     +0.0%     +0.0%     +0.0%
    gen_regexps          +3.9%     +0.0%      0.00      0.00     +0.0%
      integrate          +4.6%     +0.0%     -2.7%     -2.7%     +0.0%
          kahan          +5.1%     +0.0%     +8.4%     +8.4%     +0.0%
      paraffins          +2.9%     +0.0%      0.11      0.11     +0.0%
         primes          +5.5%     +0.0%      0.07      0.07     +0.0%
         queens          +5.0%     +0.0%      0.02      0.02     +0.0%
           rfib          +6.0%     +0.0%      0.02      0.02     +0.0%
            tak          +5.5%     +0.0%      0.01      0.01     +0.0%
   wheel-sieve1          +4.4%     +0.0%     -2.4%     -2.4%     +0.0%
   wheel-sieve2          +3.7%     +0.0%     -7.1%     -6.3%     +2.1%
           x2n1          +4.1%     +0.0%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          +2.9%     +0.0%     -7.1%     -6.3%     +0.0%
            Max          +6.0%     +0.0%     +8.4%     +8.4%     +2.1%
 Geometric Mean          +4.8%     +0.0%     -1.1%     -1.1%     +0.1%

kahan slows down, but not by much; certainly not the order of magnitude I was seeing in ghc-stage2. I still don't really know what's up, so perhaps the massive ghc-stage2 slowdown is a correctness bug somewhere, or perhaps pessimal behavior when a lot of modules are loaded. Ugh! The worst kind of bug.

Changed 6 years ago by ezyang

Attachment: heap-alloced-20130830.patch added

Version 2: now with less deadlock

comment:2 Changed 6 years ago by ezyang

ghc-stage2 "slow down" was actually a spinlock deadlock due to me excising too much code from the old MBlock implementation. As a bonus, the last bullet point is fixed. Version 2: now with less deadlock!

comment:3 Changed 6 years ago by ezyang

Description: modified (diff)

comment:4 Changed 6 years ago by ezyang

Description: modified (diff)

comment:5 Changed 6 years ago by ezyang

Description: modified (diff)

comment:6 Changed 6 years ago by ezyang

Blocked By: 2841 added
Description: modified (diff)

comment:7 Changed 6 years ago by ezyang

Full perf numbers on latest head, including the outliers (in both directions).

   binary-trees          +5.2%     +0.0%     +6.4%     +6.8%     +0.0%
   k-nucleotide          +5.2%     +0.0%     +3.7%     +3.7%     +0.0%
        integer          +5.1%     +0.0%     -8.2%     -8.1%     +0.0%
-----------------------------------------------------------------------
            Min          +4.9%     +0.0%     -8.2%     -8.1%    -12.5%
            Max          +6.0%     +0.0%     +6.4%     +6.8%   +100.0%
 Geometric Mean          +5.1%     +0.0%     -1.2%     -1.1%     +6.2%

Overall, most programs saw a slight increase in runtime (in the 0-0.2 range), while a number of programs saw a 1-4% reduction in runtime. I wonder if the slight increase in runtime is primarily due to the cost of initializing the indirections.

comment:8 Changed 6 years ago by ezyang

Description: modified (diff)

comment:9 Changed 6 years ago by ezyang

Description: modified (diff)

Changed 6 years ago by ezyang

Attachment: heap-alloced-20130831.patch added

Version 3: improved generated code by caching tags

comment:10 Changed 6 years ago by ezyang

Caching tags improves binary sizes and performance improvement a little bit. Once again I've included some outliers.

   binary-trees          +4.9%     +0.0%     +2.4%     +3.1%     +0.0%
   cryptarithm1          +4.8%     +0.0%     -7.0%     -7.0%   +100.0%
        integer          +4.9%     +0.0%     -8.2%     -8.4%     +0.0%
   k-nucleotide          +4.9%     +0.0%     -9.5%     -9.4%     +0.0%
      typecheck          +4.9%     +0.0%     +8.0%     +8.0%   +100.0%
-----------------------------------------------------------------------
            Min          +4.6%     +0.0%     -9.5%     -9.4%     -1.1%
            Max          +5.8%     +0.0%     +8.0%     +8.0%   +100.0%
 Geometric Mean          +4.9%     -0.0%     -2.0%     -2.1%     +6.4%

Here are the changes for dynamically linked nofib:

   binary-trees          +4.1%     +0.0%     -3.0%     -3.2%     +0.0%
   cryptarithm1          +4.5%     +0.0%     -0.6%     -0.6%   +100.0%
   wheel-sieve2          +3.6%     +0.0%     -5.4%     -5.4%     +2.1%
        integer          +4.5%     +0.0%     +3.4%     +3.1%     +0.0%
   k-nucleotide          +1.8%     +0.0%     -8.3%     -8.3%     +0.0%
      typecheck          +5.3%     +0.0%     -0.8%     +0.0%   +100.0%
---------------------------------------------------------------------------
            Min          +1.5%     +0.0%    -10.0%     -9.6%     +0.0%
            Max         +12.1%     +0.0%     +3.4%     +3.2%   +100.0%
 Geometric Mean          +4.1%     -0.0%     -2.2%     -2.1%    +12.0%
Last edited 6 years ago by ezyang (previous) (diff)

comment:11 Changed 6 years ago by ezyang

Description: modified (diff)

Changed 6 years ago by ezyang

Attachment: heap-alloced-20130901.patch added

Version 4: Reduce number of exported symbols

comment:12 Changed 6 years ago by simonpj

Why is this blocked by #2841?

comment:13 Changed 6 years ago by ezyang

Just for support for init_array sections in the GHCi linker. This might not actually be related to the particular issue reported in #2841, but I'm fairly confident foreign export won't work properly without it.

comment:14 Changed 6 years ago by simonmar

Cc: simonmar added

No time to review the patch yet, but this looks like great work. Have you measured how long the initialisation takes on a large binary, e.g. GHC itself? And how much does a long compilation speed up?

comment:15 Changed 6 years ago by ezyang

Blocked By: 2841 removed

comment:16 Changed 6 years ago by ezyang

Blocked By: 5435 added

comment:17 Changed 6 years ago by ezyang

Haven't done the GHC tests yet, but over ICFP I realized I hadn't run the GC nofib tests. They look really good:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
        circsim          +5.5%     +0.0%     -1.3%     -1.4%     +0.0%
    constraints          +5.5%     +0.0%     -3.6%     -3.6%     +0.0%
       fibheaps          +5.5%     +0.0%     -1.0%     -1.0%     +0.0%
         fulsom          +5.7%     +0.0%     -1.1%     -1.0%     +3.0%
       gc_bench          +5.5%     +0.0%     -1.0%     -1.0%     +0.0%
          happy          +5.9%     +0.0%     -0.3%     -0.3%     -0.3%
           hash          +5.5%     +0.0%     -4.7%     -4.6%     +0.1%
           lcss          +5.5%     +0.0%     -7.2%     -7.2%     +0.0%
      mutstore1          +5.5%     +0.0%     -1.5%     -1.5%     +0.0%
      mutstore2          +5.5%     +0.0%     -3.3%     -3.4%     +0.0%
          power          +5.5%     +0.0%     -2.4%     -2.9%     -1.0%
     spellcheck          +7.4%     +0.0%     -6.5%     -6.5%     +0.0%
--------------------------------------------------------------------------------
            Min          +5.5%     +0.0%     -7.2%     -7.2%     -1.0%
            Max          +7.4%     +0.0%     -0.3%     -0.3%     +3.0%
 Geometric Mean          +5.7%     +0.0%     -2.9%     -2.9%     +0.1%

comment:18 Changed 6 years ago by ezyang

Simon and I chatted about how to move this patchset forward last night on IRC, I'm recording the main points here for posterity:

  • Static closures originally did not work on Windows with shared libs. Since these are now all copied to dynamic heap, it should be possible to reinstate the Int/Char static closure optimization, among other things.
  • This is a good step towards supporting multiple RTSes. However, we would need to store the indirection table at an offset from BaseReg. It's not obvious to me how to go about doing that, since we still need some cooperation from the linker to manage all the offsets.

I ran some quick tests on the impact of this patch on the startup time of GHC. It seems to add something like 0.02s to initialization time, which is quite reasonable. Simon's primary other concern is binary size. The primary things that are inflating binary size in this patch are as follows:

  • We are using movq (memory-to-register) instead of movl (constant-to-register).
  • If we were moving a static constructor into memory, we now need to do it in two instructions, since there are no memory-to-memory movs:
-       movq $s_closure+1,-24(%rbp)
+       movq s_static_closure_ind,%rax
+       movq %rax,-24(%rbp)
  • The closures stored in the indirection tables are tagged, which means that if the code expects an untagged closure, we need an extra instruction to deal with it:
-       movl $Main_and2_closure,%ebx
+       movq Main_and2_static_closure_ind,%rbx
+       andq $-8,%rbx
  • The indirections table takes up one word per static closure
  • The initialization code imposes a per-module cost

I don't know which of these are the biggest offenders yet.

comment:19 Changed 6 years ago by simonmar

FYI, I think you uploaded the wrong patch for the latest patch (version 4).

Points 1 and 2 are unavoidable.

Point 3 is interesting - I think we could probably do better there, but we should look at the context a bit more. In many cases where we have a tagged pointer, we take the tag into account when operating on it (e.g. field index), but here it looks like we're just unconditionally zeroing the tag bits.

Point 4 might be improved by overlapping the indirection word with the closure itself.

comment:20 Changed 6 years ago by ezyang

I discovered an interesting flaw in the current design while debugging a segfault today. The problem goes like this: currently, on a per-module basis we generate an extra C file responsible for registering staticclosureinds with the initialization loop which will run at the beginning of the file. The implicit assumption being made here, of course, is that a statically linked module will be loaded atomically together, all of the object files being brought in for the ride.

However, this is NOT TRUE when split objects are being used! In this case, it is quite possible for the object file containing the initialization function to *not* be loaded, while another split object which uses an indirection does get loaded. In this case, the static closure never gets initialized, and we'll get a segfault if we try to use it.

How can we fix this? One possibility is, when splitting objects, to ensure if the staticclosureinds section is nonzero, then we have a reference to the initialization function somewhere else in the resulting object file, so it gets loaded in. However, it's not even clear that the starts/ends trick is going to even keep working, unless the splitter is very careful.

Another possibility is to axe the initialization function, and try harder to detect the staticclosureinds section at runtime. Doing this properly in a cross-platform way is tricky (I originally tried to do it with linker scripts, but gave up when it (1) didn't work, and (2) wasn't going to work on other platforms.)

I'll upload an up-to-date patchset for GHC 7.8 at some point. There's been a recent change to integer-gmp which needs to be updated for HEAP_ALLOCED.

comment:21 Changed 6 years ago by simonpj

Good catch Edward! You are such a star. I am not qualified to comment on solutions (at least not without doing quite a bit of homework), but I am very appreciative.

Simon

comment:22 Changed 6 years ago by simonmar

Simon - just to allay any confusion, this is a patchset that Edward is working on that isn't currently in GHC, so the bug that he found is not a GHC issue.

Edward: we must have to solve this same problem for the initialisation sections that create StablePtrs for foreign exports, and register cost centres for profiling. I don't remember exactly what we do there, but we probably ought to do it the same way.

comment:23 Changed 6 years ago by ezyang

I think in both cases, the usage-pattern of the symbols makes things "just work". Maybe this should be documented.

In the case of foreign exports, you will usually end up linking the object file containing the init_array, because the exported symbols are in the same object file. This means we don't setup a stable pointer if the exported symbol is never referred to in the linking process, but that's desired behavior.

In the case of profiling, functions which SCC emit code using 'emitSetCCC'. This will create a reference to a cost-centre object, which is contained in the object file that also has the init_array. If there are no SCCs in a module, the registration doesn't get run, but that's also desired behavior.

So, it would seem that the appropriate thing to do here is ensure that there is an appropriate dependence between object files which have static closures and the initialization code. Frankly, I'm surprised that this isn't the case already; perhaps I need to link together some object files.

comment:24 Changed 5 years ago by ezyang

Today, I took a closer look, and actually it looks like profiling does have a hack to ensure that the constructor is loaded:

        -- Note [pipeline-split-init]
        -- If we have a stub file, it may contain constructor
        -- functions for initialisation of this module.  We can't
        -- simply leave the stub as a separate object file, because it
        -- will never be linked in: nothing refers to it.  We need to
        -- ensure that if we ever refer to the data in this module
        -- that needs initialisation, then we also pull in the
        -- initialisation routine.
        --
        -- To that end, we make a DANGEROUS ASSUMPTION here: the data
        -- that needs to be initialised is all in the FIRST split
        -- object.  See Note [codegen-split-init].

(We probably annotate the link-step in the verbose output to this effect, as a reminder what the link step is doing). So the way to fix this problem here is to arrange for the first split object to contain the "data that needs to be initialized."

In particular, all static closures have to be collected in the first split section. But this might actually be quite bad for split objects: with profiling, it is not a big deal if profiling data for a module is always loaded, since it doesn't contain any pointers to further data. But static closures contain pointers to their relevant code segments, and thus unconditionally loading the static closures could greatly increase the size of split-obj'd executables. I should probably implement this version and see how bad the bloat is, before we try to do something more complicated. Collecting this data through the compiler will be annoying, though, since it involves threading another list of SCCs through the various phases.

If the static indirections table is placed in a single object file, which is a good idea, since otherwise the ordering of the various split object files which are linked together is not well-defined, then it may be very difficult to overcome this problem: the indirections table must point to the closures, and so if the indirections table is linked in, so are the rest of the objects.

comment:25 Changed 5 years ago by ezyang

OK, here is the proposed approach. Since we won't have collected all the static closures in time to place them in the first CmmGroup (due to streaming), we merely have to change the convention so the *last* block contains the data that needs to be initialized. This might not be ideal for linker efficiency but it prevents a space leak.

diff --git a/compiler/codeGen/StgCmm.hs b/compiler/codeGen/StgCmm.hs
index e2a5a07..e9f5668 100644
--- a/compiler/codeGen/StgCmm.hs
+++ b/compiler/codeGen/StgCmm.hs
@@ -79,12 +79,6 @@ codeGen dflags this_mod data_tycons
                          return a
                 yield cmm
 
-               -- Note [codegen-split-init] the cmm_init block must come
-               -- FIRST.  This is because when -split-objs is on we need to
-               -- combine this block with its initialisation routines; see
-               -- Note [pipeline-split-init].
-        ; cg (mkModuleInit cost_centre_info this_mod hpc_info)
-
         ; mapM_ (cg . cgTopBinding dflags) stg_binds
 
                 -- Put datatype_stuff after code_stuff, because the
@@ -99,6 +93,12 @@ codeGen dflags this_mod data_tycons
                  mapM_ (cg . cgDataCon) (tyConDataCons tycon)
 
         ; mapM_ do_tycon data_tycons
+
+               -- Note [codegen-split-init] the cmm_init block must come
+               -- FIRST.  This is because when -split-objs is on we need to
+               -- combine this block with its initialisation routines; see
+               -- Note [pipeline-split-init].
+        ; cg (mkModuleInit cost_centre_info this_mod hpc_info)
         }
 
 ---------------------------------------------------------------
diff --git a/compiler/main/DriverPipeline.hs b/compiler/main/DriverPipeline.hs
index 517ba6c..2f7b847 100644
--- a/compiler/main/DriverPipeline.hs
+++ b/compiler/main/DriverPipeline.hs
@@ -1313,18 +1313,18 @@ runPhase (RealPhase SplitAs) _input_fn dflags
         -- initialisation routine.
         --
         -- To that end, we make a DANGEROUS ASSUMPTION here: the data
-        -- that needs to be initialised is all in the FIRST split
+        -- that needs to be initialised is all in the LAST split
         -- object.  See Note [codegen-split-init].
 
         PipeState{maybe_stub_o} <- getPipeState
         case maybe_stub_o of
             Nothing     -> return ()
             Just stub_o -> liftIO $ do
-                     tmp_split_1 <- newTempName dflags osuf
-                     let split_1 = split_obj 1
-                     copyFile split_1 tmp_split_1
-                     removeFile split_1
-                     joinObjectFiles dflags [tmp_split_1, stub_o] split_1
+                     tmp_split_n <- newTempName dflags osuf
+                     let split_n = split_obj n
+                     copyFile split_n tmp_split_n
+                     removeFile split_n
+                     joinObjectFiles dflags [tmp_split_n, stub_o] split_n
 
         -- join them into a single .o file
         liftIO $ joinObjectFiles dflags (map split_obj [1..n]) output_fn
diff --git a/compiler/main/HscMain.hs b/compiler/main/HscMain.hs
index 26aca2a..97802b4 100644
--- a/compiler/main/HscMain.hs
+++ b/compiler/main/HscMain.hs
@@ -1670,13 +1670,36 @@ showModuleIndex (i,n) = "[" ++ padded ++ " of " ++ n_str ++ "] "
     i_str = show i
     padded = replicate (length n_str - length i_str) ' ' ++ i_str
 
+-- Slightly inefficient, in that it needs to lookahead in order to
+-- determine if it needs to cat the closures on immediatel
+deferStaticClosures :: Monad m =>
+       Either CLabel Module
+    -> Stream m [GenCmmDecl CmmStatics info stmt] b
+    -> [GenCmmDecl CmmStatics info stmt]
+    -> [GenCmmDecl CmmStatics info stmt]
+    -> Stream m [GenCmmDecl CmmStatics info stmt] b
+deferStaticClosures lbl_or_mod str prev !closures = Stream.Stream $ do
+    r <- Stream.runStream str
+    case r of
+        Left x -> do
+            let addLabel starts = mkDataLits StaticClosureInds (mkStaticClosureIndsLabel lbl_or_mod starts) []
+            return (Right (addLabel True : closures ++ [addLabel False] ++ prev, Stream.Stream (return (Left x))))
+        Right (next, str') -> do
+            let isStaticClosure (CmmData StaticClosures _) = True
+                isStaticClosure (CmmData StaticClosureInds _) = True
+                isStaticClosure _ = False
+                newClosures = filter isStaticClosure next
+                next' = filter (not . isStaticClosure) next
+                closures' = newClosures `seq` (closures ++ newClosures)
+            if null prev
+                then Stream.runStream (deferStaticClosures lbl_or_mod str' next' closures')
+                else return (Right (prev, deferStaticClosures lbl_or_mod str' next' closures'))
+
 prepareStaticClosures :: Monad m => Either CLabel Module
      -> Stream m [GenCmmDecl CmmStatics info stmt] b -> ForeignStubs
      -> (Stream m [GenCmmDecl CmmStatics info stmt] b, ForeignStubs)
 prepareStaticClosures lbl_or_mod cmms0 foreign_stubs0 =
-    let cmms = addLabel True >> cmms0 >>= (\r -> addLabel False >> return r)
-        addLabel starts =
-            Stream.yield [mkDataLits StaticClosureInds (mkStaticClosureIndsLabel lbl_or_mod starts) []]
+    let cmms = deferStaticClosures lbl_or_mod cmms0 [] []
         foreign_stubs = foreign_stubs0 `appendStubC` static_closure_inds_init
         tag = case lbl_or_mod of
                 Left lbl -> text "cmm_" <> ppr lbl

comment:26 Changed 5 years ago by ezyang

As it turns out, removing HEAP_ALLOCED has a bad interaction with object unloading. The immediate failure is that the unload-object code thinks that all objects are unreferenced when checking static closures, because the static closure doesn't live in the object image but in the dynamic heap.

One way to fix this is to add some extra data to all static closures saying where their indirection lives and use that to determine if a static closure is associated an unloaded object. Then when we dynamically load an object, we place all of its static closures in a block allocated specifically for that object (this would only work for GHC's linker, but as it so happens, at the moment we only support unloading static objects and not dynamic libraries); once we conclude the object is dead we free the block back to the memory manager.

Here's an alternative scheme. Since the static closures no longer live in the object image, we can treat them as normal closures and just check if their info tables are keeping an object live. However, we still need to make sure that space in the static blocks ends up getting freed; the most reliable way to do that is to actually GC static closures the same we GC normal objects. (But of course, if we do that, we still need pointers to the static indirections to update them when the static object moves somewhere else.)

Changed 5 years ago by ezyang

Attachment: heap-alloced-20140410.patch added

Multifarious improvements

comment:27 Changed 5 years ago by ezyang

I've been keeping an eye on the performance numbers, and this patch to the compiler results in a modest bump in some of the stats that we are testing on:

=====> T3294(normal) 1380 of 3913 [0, 3, 0] 
cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc-stage2' -fforce-recom
p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T3294.hs   +RTS -V0 -tT32
94.comp.stats --machine-readable -RTS  >T3294.comp.stderr 2>&1
max_bytes_used value is too high:
    Expected    max_bytes_used: 43224080 +/-15%
    Lower bound max_bytes_used: 36740468 
    Upper bound max_bytes_used: 49707692 
    Actual      max_bytes_used: 54141136 
*** unexpected failure for T3294(normal)
=====> T4801(normal) 1381 of 3913 [0, 4, 0] 
cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc-stage2' -fforce-recom
p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T4801.hs   +RTS -V0 -tT48
01.comp.stats --machine-readable -RTS -static >T4801.comp.stderr 2>&1
bytes allocated value is too high:
    Expected    bytes allocated: 392409984 +/-10%
    Lower bound bytes allocated: 353168985 
    Upper bound bytes allocated: 431650983 
    Actual      bytes allocated: 446748968 
*** unexpected failure for T4801(normal)
=====> T3064(normal) 1382 of 3913 [0, 5, 0] 
cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc-stage2' -fforce-recom
p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T3064.hs   +RTS -V0 -tT30
64.comp.stats --machine-readable -RTS  >T3064.comp.stderr 2>&1
bytes allocated value is too high:
    Expected    bytes allocated: 329795912 +/-5%
    Lower bound bytes allocated: 313306116 
    Upper bound bytes allocated: 346285708 
    Actual      bytes allocated: 350480296 
*** unexpected failure for T3064(normal)
=====> T5837(normal) 1391 of 3913 [0, 6, 0] 
cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc-stage2' -fforce-recom
p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T5837.hs  -ftype-function
-depth=50 +RTS -V0 -tT5837.comp.stats --machine-readable -RTS  >T5837.comp.stderr 2>&1
=====> T6048(optasm) 1392 of 3913 [0, 6, 0] 
cd ./perf/compiler && '/home/hs01/ezyang/ghc-heap-alloced/inplace/bin/ghc-stage2' -fforce-recom
p -dno-debug-output -no-user-package-db -rtsopts -fno-ghci-history -c T6048.hs -O -fasm  +RTS -
V0 -tT6048.comp.stats --machine-readable -RTS  >T6048.comp.stderr 2>&1
bytes allocated value is too high:
    Expected    bytes allocated: 108578664 +/-10%
    Lower bound bytes allocated:  97720797 
    Upper bound bytes allocated: 119436531 
    Actual      bytes allocated: 121958160 

The bump is understandable, because I need to do some post-processing on the C-- stream which induces some extra allocation. I think I might be able to optimize it some more but it would be nice to know if it's OK to bump these figures up.

comment:28 Changed 5 years ago by ezyang

Description: modified (diff)

comment:29 Changed 5 years ago by simonmar

So we have to solve the unloading problem first, right? Also did you find any way to reduce that 5-6% binary size overhead?

Before bumping the performance thresholds you should check whether they're reasonable. Compare against the unpatched compiler (sometimes it is right up against the upper limit anyway). Increasing allocation by 10% sounds like a *lot* for this change.

comment:30 Changed 5 years ago by ezyang

Yeah, I'm going to try to fix both.

As for the stats, on a perf build on GHC 7.8, I see:

T3294 max_bytes_used  37702264
T4801 bytes_allocated 429277160
T3064 bytes_allocated 346679176
T6048 bytes_allocated 120594240

In fact, some of these tests are presently failing validate on a clean branch for me. So it seems the only real problem here is T3294 (where we have almost doubled the max bytes used).

comment:31 Changed 5 years ago by ezyang

I looked at the context of "The closures stored in the indirection tables are tagged, which means that if the code expects an untagged closure, we need an extra instruction to deal with it", and noticed that the only reason we were untagging them was because the GC functions stg_gc_enter_1 and stg_gc_fun require 'node' to be untagged. So clearly, we should just have specialized versions of these functions which untag their argument. Actually, we can do even better: if we're making specialized versions, we can sidestep performing the load in code, so we have the invariant that for static closures, the node is a pointer to the static indirection. (Of course, this also means that we only get the savings from the heap checks of static closure entry code.) I'm implement this and see how much we save.

As for overlapping the indirections with the data, there's something we can do even before that: since the closure that lives in memory doesn't have to be runnable, we can drop any fields that could be derived later. The counterbalance is that the initialization code gets a little more complex, but we're already case-splitting on the closure type, so it can't be too much worse. I'll also implement this and see how much better we do.

comment:32 Changed 5 years ago by ezyang

Description: modified (diff)

Great news: by removing the tagging and overlapping/compressed static closure representation, binary size overall is now less than it used to be! The downside is that I still need to collect static closures together away from their definition, which means that I now need to export symbols local info tables (which I know we'd previously been keeping local to reduce the number of symbols).

Remaining tasks: unloadObj support and T3294 performance debugging..

Last edited 5 years ago by ezyang (previous) (diff)

comment:33 Changed 5 years ago by ezyang

One more progress report (as before, selection of outliers as well as overall figures), after having rebased onto HEAD:

     bernouilli          -0.5%     -0.0%     -9.1%     -9.1%      0.0%
        circsim          -0.5%     -0.0%     +4.5%     +4.5%    +21.7%
   cryptarithm1          -0.5%     +0.0%     +7.6%     +7.6%   +100.0%
           lcss          -0.5%     -0.0%     -8.3%     -8.3%     +1.9%
   wheel-sieve2          -0.5%     -0.0%     -9.1%     -9.1%     +2.1%
--------------------------------------------------------------------------------
            Min          -0.9%    -12.0%     -9.1%     -9.1%     -5.2%
            Max          -0.4%     +0.4%     +7.6%     +7.6%   +100.0%
 Geometric Mean          -0.5%     -0.5%     -1.8%     -1.8%     +9.0%

(Why has cryptarithm1 flipped from one end of the spectrum to the other? In real time terms, it went from 0.6s to 0.7s. Perhaps the initialization time is now taking a lot longer (since it is more complicated, and not just a series of memcpys). I need to look into that.)

Last edited 5 years ago by ezyang (previous) (diff)

comment:34 Changed 5 years ago by ezyang

And the GC results:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
        circsim          -0.5%     -0.0%     -0.9%     -0.9%     +4.4%
    constraints          -0.5%     -0.0%     -1.9%     -2.0%     +0.2%
       fibheaps          -0.5%     -0.0%     -2.1%     -2.1%      0.0%
         fulsom          -0.5%     -0.0%     -2.6%     -2.6%     -6.4%
       gc_bench          -0.5%     -0.0%     -6.8%     -6.8%    -11.6%
          happy          -0.4%     -0.0%     -4.0%     -4.0%      0.0%
           hash          -0.5%     -0.0%     -5.4%     -5.4%      0.0%
           lcss          -0.5%     -0.0%     -7.2%     -7.2%      0.0%
      mutstore1          -0.5%     -0.0%     -3.1%     -3.1%     +4.5%
      mutstore2          -0.5%     -0.0%     -1.9%     -1.9%     +0.8%
          power          -0.5%     -0.0%     -4.7%     -4.7%      0.0%
     spellcheck          -0.4%     -0.0%     -1.0%     -0.9%     +1.5%
--------------------------------------------------------------------------------
            Min          -0.5%     -0.0%     -7.2%     -7.2%    -11.6%
            Max          -0.4%     -0.0%     -0.9%     -0.9%     +4.5%
 Geometric Mean          -0.5%     -0.0%     -3.5%     -3.5%     -0.6%

comment:35 Changed 5 years ago by simonmar

Wow, I'm amazed you managed to get binary sizes to be smaller than before. So the only concerns now are those few strange looking results: cryptarithm, circsim, and I notice that spellcheck is not as good as before (runtime -6.4% before, -1.0% now). Also do you have any idea what is driving the increase in total memory (+100% for cryptarithm, +21% for circsim)?

If startup time has increased, it would be a good idea to check on the startup time of GHC itself again, to give us an idea of the impact for a large program.

comment:36 Changed 5 years ago by ezyang

Differential Rev(s): D207

comment:37 Changed 5 years ago by ezyang

So, here's the status: I rebased the patches onto HEAD, and took a close look at the oscillating results. As it turns out, at least for cryptarithm, the nofib test had bimodal output behavior, so depending on your luck, you'd end up with even changes, or an improvement or a regression. (I think this may be what's happening for the inconsistent GC results; more investigation necessary.) The increase in total memory for cryptarithm1 and circsim stems from the fact that the binaries don't use much memory to begin with, so the allocation of static closures is causing us to allocate a new megablock and, e.g., doubling it in the case cryptarithm.

The startup time for GHC is still 0.020, so same as before.

comment:38 Changed 5 years ago by Edward Z. Yang <ezyang@…>

In 644c76a3574a623fa5d2a9a28d8e6fc971aca901/ghc:

Use LinkerInternals.h for exitLinker.

Part of remove HEAP_ALLOCED patch set (#8199)

Summary: Signed-off-by: Edward Z. Yang <ezyang@cs.stanford.edu>

Test Plan: validate

Reviewers: simonmar, austin

Subscribers: simonmar, ezyang, carter, thomie

Differential Revision: https://phabricator.haskell.org/D262

GHC Trac Issues: #8199

comment:39 Changed 5 years ago by Edward Z. Yang <ezyang@…>

In b23ba2a7d612c6b466521399b33fe9aacf5c4f75/ghc:

Place static closures in their own section.

Summary:
The primary reason for doing this is assisting debuggability:
if static closures are all in the same section, they are
guaranteed to be adjacent to one another.  This will help
later when we add some code that takes section start/end and
uses this to sanity-check the sections.

Part of remove HEAP_ALLOCED patch set (#8199)

Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

Test Plan: validate

Reviewers: simonmar, austin

Subscribers: simonmar, ezyang, carter, thomie

Differential Revision: https://phabricator.haskell.org/D263

GHC Trac Issues: #8199

comment:40 Changed 5 years ago by Edward Z. Yang <ezyang@…>

In 3b5a840bba375c4c4c11ccfeb283f84c3a1ef22c/ghc:

BC-breaking changes to C-- CLOSURE syntax.

Summary:
Previously, there were two variants of CLOSURE in C--:

    - Top-level CLOSURE(foo_closure, foo, lits...), which defines a new
      static closure and gives it a name, and

    - Array CLOSURE(foo, lits...), which was used for the static char
      and integer arrays.

They used the same name, were confusing, and didn't even generate
the correct internal label representation!  So now, we have two
new forms:

    - Top-level CLOSURE(foo, lits...) which automatically generates
      foo_closure (along with foo_info, which we were doing already)

    - Array ANONYMOUS_CLOSURE(foo, lits...) which doesn't generate
      a foo_closure identifier.

Part of remove HEAP_ALLOCED patch set (#8199)

Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

Test Plan: validate

Reviewers: simonmar, austin

Subscribers: simonmar, ezyang, carter, thomie

Differential Revision: https://phabricator.haskell.org/D264

GHC Trac Issues: #8199

comment:41 Changed 5 years ago by Edward Z. Yang <ezyang@…>

In 178eb9060f369b216f3f401196e28eab4af5624d/ghc:

Properly generate info tables for static closures in C--.

Summary:
Previously, we assumed all objects declared in C-- were not-static, even
ones which were CONSTR_NOCAF_STATIC.  This used to be harmless, but now
we need this information to be correct.

Part of remove HEAP_ALLOCED patch set (#8199)

Depends on D264

Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

Test Plan: validate

Reviewers: simonmar, austin

Subscribers: simonmar, ezyang, carter, thomie

Differential Revision: https://phabricator.haskell.org/D265

GHC Trac Issues: #8199

comment:42 Changed 5 years ago by Edward Z. Yang <ezyang@…>

In 35672072b4091d6f0031417bc160c568f22d0469/ghc:

Rename _closure to _static_closure, apply naming consistently.

Summary:
In preparation for indirecting all references to closures,
we rename _closure to _static_closure to ensure any old code
will get an undefined symbol error.  In order to reference
a closure foobar_closure (which is now undefined), you should instead
use STATIC_CLOSURE(foobar).  For convenience, a number of these
old identifiers are macro'd.

Across C-- and C (Windows and otherwise), there were differing
conventions on whether or not foobar_closure or &foobar_closure
was the address of the closure.  Now, all foobar_closure references
are addresses, and no & is necessary.

CHARLIKE/INTLIKE were not changed, simply alpha-renamed.

Part of remove HEAP_ALLOCED patch set (#8199)

Depends on D265

Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

Test Plan: validate

Reviewers: simonmar, austin

Subscribers: simonmar, ezyang, carter, thomie

Differential Revision: https://phabricator.haskell.org/D267

GHC Trac Issues: #8199

comment:43 Changed 5 years ago by ezyang

We backed out these patches, see #9706 for the new plan.

comment:44 Changed 5 years ago by ezyang

We backed out these patches, see #9706 for the new plan.

comment:45 Changed 5 years ago by ajp

Method 1 from the attached task may actually be made to work (at least on Linux). The key point is that we don't have to use the memory address that we happened to be given at startup. We can shift it to an aligned location using MREMAP, and simply include a mb of padding at the end to avoid trashing anything. Here's a modification of the code in the attached task that demostrates shifting the memory address with zero copies.

┌─[anders gurney] [~/devel/alignment]
└─[139] cat test.S 
    .section foo,"aw"
    .p2align 20
    .global foo_start
    .global foo_end
foo_start:
    .ascii "A"
    .fill 1048576
foo_end:
    .ascii "B"
┌─[anders gurney] [~/devel/alignment]
└─[0] cat main.c 
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/mman.h>

extern char foo_start;
extern char foo_end;

size_t mb = 1024 * 1024;

size_t round_to_mb(size_t x) {
    return (x + mb - 1) & ~(mb - 1);
}

size_t main(void) {
    printf("%c %c\n", foo_start, foo_end); // force libtest.so to be loaded
    printf("%p foo_start\n", &foo_start);
    printf("%p foo_end\n", &foo_end);
    printf("%d delta\n", &foo_end - &foo_start);

    void *next_mb_aligned_address = (void *) round_to_mb((size_t)&foo_start);
    printf("%p next mb aligned address\n", next_mb_aligned_address);
    void* new_address = mremap(&foo_start, 4096, 4096, MREMAP_MAYMOVE | MREMAP_FIXED,  next_mb_aligned_address);
    printf("%p %p // should be the same\n", new_address, next_mb_aligned_address);
 
    printf("%c // should be A\n", * (char *) new_address);
    printf("%c // should be B\n", foo_end);
    printf("and this should segfault\n");
    printf("%c\n", foo_start);
}
┌─[anders gurney] [~/devel/alignment]
└─[0] gcc test.S -shared -o libtest.so -fPIC
┌─[anders gurney] [~/devel/alignment]
└─[0] gcc -Wl,-R`pwd` -L. main.c -ltest -g -O0 -fPIC && ./a.out
/nix/store/ycmsiznf2484vbjwmj57jdy2ncyrj7fj-binutils-2.23.1/bin/ld: warning: type and size of dynamic symbol `foo_start' are not defined
/nix/store/ycmsiznf2484vbjwmj57jdy2ncyrj7fj-binutils-2.23.1/bin/ld: warning: type and size of dynamic symbol `foo_end' are not defined
A B
0x7f8318c95000 foo_start
0x7f8318d95001 foo_end
1048577 delta
0x7f8318d00000 next mb aligned address
0x7f8318d00000 0x7f8318d00000 // should be the same
A // should be A
B // should be B
and this should segfault
zsh: segmentation fault  ./a.out

Is there sonething fundamental that would prohibit this technique from working? It seems neater than the idea in #9706, as it

  • doesn't rely on overcommit
  • doesn't screw up accounting of VMA size
  • doesn't "use up" the trick of putting objects in arbitrary memory locations with MAP_FIXED. This could for example be used by an optimized application at the app level to use 48 bit pointers for a large skiplist.

comment:46 Changed 5 years ago by simonmar

@ajp the problem is that this data is statically linked, not mmapped. If we move it somewhere else, all the references within it will be wrong. That's what method 2 was doing: relocating all the static closures elsewhere and fixing up all the pointers.

comment:47 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:48 Changed 4 years ago by rwbarton

Is this ticket now obsolete since #9706 is done?

comment:49 Changed 4 years ago by ezyang

Not quite, because the #9706 technique doesn't work for Windows. But it seems unlikely we'll want to deploy this big patch just to make Windows work. So...

comment:50 Changed 4 years ago by simonmar

I wonder whether we might try an alternative strategy on Windows: a bitmap covering the whole 48-bit address space (units of 1MB) is only 32MB, and most of it won't be touched in the common case. The remaining question is whether checking this bitmap can be made as cheap as the current cache.

comment:51 Changed 4 years ago by ezyang

Reproducing some comments from the Phabricator diff for posterity.

#9706 didn't work on Windows because Windows counted page table memory as part of committed memory. Go has the same problem:

	// Number of bits in page to span calculations (4k pages).
	// On Windows 64-bit we limit the arena to 32GB or 35 bits.
	// Windows counts memory used by page table into committed memory
	// of the process, so we can't reserve too much memory.
	// See https://golang.org/issue/5402 and https://golang.org/issue/5236.
	// On other 64-bit platforms, we limit the arena to 512GB, or 39 bits.
	// On 32-bit, we don't bother limiting anything, so we use the full 32-bit address.
	// On Darwin/arm64, we cannot reserve more than ~5GB of virtual memory,
	// but as most devices have less than 4GB of physical memory anyway, we
	// try to be conservative here, and only ask for a 2GB heap.

The links are good reading: https://golang.org/issue/5402 and https://golang.org/issue/5236 (scroll down).

Note that there is a little bit about Go relying on overcommitting causing problems for people on systems that don't overcommit; this is not relevant to us since we're not actually committing memory and relying on the OS to lazily initialize it if they don't actually use it.

comment:52 Changed 4 years ago by ezyang

Simon, are you thinking doing some sort of lazy committing for the 32MB bitmap, so that most of it doesn't need to backed with real memory? I guess I don't see any reason this shouldn't work.

comment:53 Changed 4 years ago by simonmar

I believe lazy committing should be the default if you just allocate a chunk of memory with VirtualAlloc. The pages will be lazily faulted in as necessary, so the pages we never touch won't consume any memory. We just need to use the convention that a zero bit means "not heap".

comment:54 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:55 Changed 4 years ago by ezyang

Description: modified (diff)
Priority: normallow
Summary: Get rid of HEAP_ALLOCEDGet rid of HEAP_ALLOCED on Windows (and other non-Linux platforms)

comment:56 Changed 4 years ago by thomie

Milestone: 8.0.1
Note: See TracTickets for help on using tickets.