Opened 13 months ago

Last modified 8 months ago

#15503 new bug

interpreter: sequence_ (replicate 100000000 (return ())) gobbles up memory

Reported by: int-e Owned by: osa1
Priority: high Milestone: 8.8.1
Component: GHCi Version: 8.5
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #16011 Differential Rev(s):
Wiki Page:

Description

I consider this to be more of a curiosity than a bug, but it might still be worth investigating. The following line,

sequence_ (replicate 100000000 (return ()))

causes ghci to grow quite big (here it's 6GB on 64 bit Linux; it's roughly proportional to the 100000000 number).

This used to be better in ancient times--before version 8.0.1, ghci ran this code in what looks like constant memory.

(This came up indirectly on #haskell; somebody had adapted Hutton's game of life program, http://www.cs.nott.ac.uk/~gmh/life.lhs which uses busy waiting for a delay.)

Attachments (2)

ghc-stage2.hp (150.6 KB) - added by osa1 9 months ago.
Raw retainer profiler output
ghc-stage2.svg (8.8 KB) - added by osa1 9 months ago.
.svg of the retainer profiler output (generated with hp2pretty)

Download all attachments as: .zip

Change History (17)

comment:1 Changed 12 months ago by bgamari

Milestone: 8.6.18.8.1
Priority: lowhigh

Bumping priority since this shouldn't be so hard to nail.

comment:2 Changed 12 months ago by osa1

Owner: set to osa1

I'll look into this.

comment:3 Changed 12 months ago by osa1

Version: 8.4.38.5

Confirmed that GHCi 7.10.3 runs this in very little memory while GHC HEAD uses about 6G.

comment:4 Changed 12 months ago by osa1

This regression was introduced between 7.10 and 8.0, and reproducible with 8.4 and HEAD. I compared +RTS -s outputs of 7.10 and 8.4.

7.10:

$ ./ghci +RTS -s
GHCi, version 7.10.3: http://www.haskell.org/ghc/  :? for help
λ:2> sequence_ (replicate 100000000 (return ()))
λ:3>
Leaving GHCi.
  12,844,310,920 bytes allocated in the heap
      26,753,232 bytes copied during GC
       7,744,552 bytes maximum residency (6 sample(s))
         138,736 bytes maximum slop
              18 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1806 colls,     0 par    0.155s   0.158s     0.0001s    0.0055s
  Gen  1         6 colls,     0 par    0.096s   0.126s     0.0211s    0.0415s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    4.065s  (  6.690s elapsed)
  GC      time    0.251s  (  0.285s elapsed)
  EXIT    time    0.020s  (  0.023s elapsed)
  Total   time    4.365s  (  7.000s elapsed)

  Alloc rate    3,159,574,299 bytes per MUT second

  Productivity  94.3% of total user, 58.8% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

8.4:

$ ghci +RTS -s
GHCi, version 8.4.3: http://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /home/omer/rcbackup/.ghci
λ:1> sequence_ (replicate 100000000 (return ()))
λ:2>
Leaving GHCi.
  12,869,399,120 bytes allocated in the heap
  13,233,009,504 bytes copied during GC
   4,007,601,416 bytes maximum residency (14 sample(s))
      15,805,176 bytes maximum slop
            8652 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       209 colls,     0 par    3.455s   3.486s     0.0167s    0.4864s
  Gen  1        14 colls,     0 par    7.993s  10.629s     0.7592s    4.4344s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    4.252s  (  6.710s elapsed)
  GC      time   11.448s  ( 14.115s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time   15.701s  ( 20.831s elapsed)

  Alloc rate    3,026,874,614 bytes per MUT second

  Productivity  27.1% of total user, 32.2% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

Observations:

  • Productivity is 94% vs. 27%
  • Almost same amount of allocation in GHCs (very minor difference)
  • 500x max. residency in GHC 8.4 (7,744,552 vs. 4,007,601,416)

So I think for some reason GHC 8.4 keeps retains more things which causes more GC cycles and longer pauses.

I compared the byte codes generated by both versions and they're identical so I suspect this is either a problem with the interpreter code or some library function.

comment:5 Changed 12 months ago by osa1

Note to self: I think running the interpreter with profiled compiler should show us what's retaining all the extra objects in newer GHCs, but I currently can't build the compiler in prof way (the build seems to be broken). I'll try to see what's wrong with the build..

comment:6 Changed 12 months ago by osa1

One thing I tried was to run compiled version of this program with GHC 7.10 and GHC 8.4 and the numbers were identical so maybe this isn't related with the library code.

comment:7 Changed 12 months ago by osa1

My attempts to profile ghci revealed another bug: #15608

comment:8 Changed 11 months ago by osa1

Looking at +RTS -h output, we allocate and retain GBs of PAPs. Not sure what are those PAPs for and what's retaining them though.

comment:9 Changed 11 months ago by simonpj

I speculate that the giant PAP is

return () >> (return () >> (return () >> (... >> return ()...)))

comment:10 Changed 11 months ago by osa1

Simon, do you have any guesses on what may be retaining those PAPs?

Btw, here's the Core generated for this expression: (output of -ddump-simpl)

GHC.Base.bindIO
  @ ()
  @ [()]
  (GHC.GHCi.ghciStepIO
     @ GHC.Types.IO
     GHC.GHCi.$fGHCiSandboxIOIO
     @ ()
     (Data.Foldable.sequence_
        @ []
        @ GHC.Types.IO
        @ ()
        Data.Foldable.$fFoldable[]
        GHC.Base.$fMonadIO
        (GHC.List.replicate
           @ (GHC.Types.IO ())
           (GHC.Types.I# 100000000#)
           (GHC.Base.return
              @ GHC.Types.IO GHC.Base.$fMonadIO @ () GHC.Tuple.()))))
  (\ (it_a1UK :: ()) ->
     GHC.Base.returnIO
       @ [()] (GHC.Types.: @ () it_a1UK (GHC.Types.[] @ ())))

More readable version:

GHC.Base.bindIO
    (ghciStepIO <the expression>)
    (\i -> GHC.Base.returnIO [i])

The Core identical to the Core generated by GHC 7.10.

comment:11 Changed 11 months ago by osa1

One other thing I tried is -fno-it flag, which disables binding the it variable. It did not fix the leak.

comment:12 Changed 11 months ago by simonpj

Hmm. Thanks for showing the full code that GHCi compiles to bytecode. It looks like this:

let main_action = ...
in bindIO main_action (\it -> return [it])

So GHCi will build a thunk for main_action and will pass it to the compiled code for GHC.Base.bindIO.

But alas main_action is essentially this

main_action :: IO ()
main_action = sequence_ (replicate ...)

(I have omitted the ghcStepIO stuff; it is essentially the identity function, and I don't think it affects things.)

Now that main_action thunk will be updated, so if it remains live for any reason, we'll retain a huge PAP of the form described in an earlier comment return () >> (return () ...)).

Why is it being kept alive? I'm really not sure. The interpreter should be tail-calling bindIO...

I wonder if main_action could be floated to top level as a CAF before it gets to the bytecode generator?

Anyway, I hope that may help a bit.

comment:13 Changed 11 months ago by osa1

I still have no idea what retains all those PAPs, but I realized that 8.0.1 is when we introduced -fexternal-interpreter. Perhaps that has to do with this issue (reminder: the leak was also introduced with 8.0.1).

Changed 9 months ago by osa1

Attachment: ghc-stage2.hp added

Raw retainer profiler output

Changed 9 months ago by osa1

Attachment: ghc-stage2.svg added

.svg of the retainer profiler output (generated with hp2pretty)

comment:14 Changed 9 months ago by osa1

With Phab:D5351 we can run retainer profiler on ghci so I tried this with ghci +RTS -hb and used 50000000 as the input. I've attached the raw output and a .svg generated with hp2pretty. It's not too helpful: the SYSTEM retainer is retaining GBs of data but we don't have more details on what the retainer is.

comment:15 Changed 8 months ago by osa1

Note: See TracTickets for help on using tickets.