Opened 11 years ago

Closed 8 years ago

#2236 closed bug (fixed)

Deep stacks make execution time go through the roof

Reported by: simonpj Owned by:
Priority: low Milestone: 7.2.1
Component: Runtime System Version: 6.8.2
Keywords: Cc: midfield@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

In the code below, switching from mapM_ to mapM causes a 20x change in execution time. The code being executed is very very similar (try -ddump-simpl), and the number of bytes allocated is not dissimilar. But the GC time is dramatically more (presumably because we are repeatedly scanning a very deep stack) and, more puzzlingly, the MUT time is much more too.

It all seems to be to do with the stack depth. Not only that, but I think the effect is non-linear: doubling n from 200k to 400k nearly quadruples runtime. Doubling again to 800k makes it say 'Killed'.

module Main(main) where

import System.IO
import System.IO (openFile, IOMode(..), hPutStr)

n = 200000::Int

main = main_fast


main_slow = do	-- Use mapM
  h <- openFile "bardump2" WriteMode
  mapM (do_it h) testlst

main_fast = do	-- Use mapM_
  h <- openFile "bardump2" WriteMode
  mapM_ (do_it h) testlst

do_it h x = hPutStr h "yes"

testlst = [1..n]

Thanks to Ben for identifying this. Below are some figures

--------------------------------------
    main = main_fast, N= 300,000

251,371,168 bytes allocated in the heap
 14,041,824 bytes copied during GC (scavenged)
      1,504 bytes copied during GC (not scavenged)
  8,413,184 bytes maximum residency (5 sample(s))

        448 collections in generation 0 (  2.00s)
          5 collections in generation 1 (  0.01s)

         17 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.17s  (  1.18s elapsed)
  GC    time    2.01s  (  2.03s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.18s  (  3.22s elapsed)

  %GC time      63.3%  (63.1% elapsed)

  Alloc rate    215,201,590 bytes per MUT second

  Productivity  36.7% of total user, 36.2% of total elapsed

--------------------------------------
    main = main_slow, N= 300,000

227,615,792 bytes allocated in the heap
     49,224 bytes copied during GC (scavenged)
      1,440 bytes copied during GC (not scavenged)
     40,960 bytes maximum residency (1 sample(s))

        435 collections in generation 0 (  0.00s)
          1 collections in generation 1 (  0.00s)

          1 Mb total memory in use

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.15s  (  0.15s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.15s  (  0.16s elapsed)

  %GC time       0.0%  (1.8% elapsed)

  Alloc rate    1,537,851,022 bytes per MUT second

  Productivity 100.0% of total user, 95.4% of total elapsed

---------------------------------------------
          Non-linear effect

n      fast/slow	Mut	GC	Tot
400k	fast		0.19	0	0.2
	slow		2.14	3.64	5.78

200k	fast		0.09	0	0.09
	slow		0.48	0.96	1.44

Change History (11)

comment:1 Changed 11 years ago by igloo

Milestone: 6.8.3

comment:2 Changed 11 years ago by simonmar

Component: CompilerRuntime System
Milestone: 6.8.36.10 branch

The analysis is spot on: the stack is deep and getting repeatedly scanned by the GC (to see how deep the stack is, try our handy built-in heap profiling with +RTS -h, no need to compile with -prof).

I'm not sure exactly what to do about this. There seem to be two options:

  • do GC less frequently
  • try to avoid scanning the whole stack

neither is straightforward. With the default heap settings the allocation area is fixed at 512k, we don't currently do any dynamic tuning of that. You can get much better performance here with +RTS -A10m, for example. Perhaps it would be worth tuning the allocation area to be bigger when the stacks are large, but that kind of tuning is delicate and could easily make things worse; I prefer to keep things simple.

Not scanning the whole stack is a like having a finer-grained write barrier on the TSO. This is kind of hard with the monolithic stack objects we have right now; chunked stacks would make this much easier.

Anyway, I'm going to punt this and possibly look at it again in the context of the new GC in 6.10.

comment:3 Changed 11 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:4 Changed 11 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:5 Changed 10 years ago by igloo

Milestone: 6.10 branch6.12 branch

comment:6 Changed 10 years ago by simonmar

Type of failure: Runtime performance bug

comment:7 Changed 9 years ago by igloo

Milestone: 6.12 branch6.12.3

comment:8 Changed 9 years ago by igloo

Milestone: 6.12.36.14.1
Priority: normallow

comment:9 Changed 9 years ago by igloo

Milestone: 7.0.17.0.2

comment:10 Changed 9 years ago by igloo

Milestone: 7.0.27.2.1

comment:11 Changed 8 years ago by simonmar

Resolution: fixed
Status: newclosed

This is fixed in 7.2.1.

N = 400k, main_fast:
  MUT     time    0.33s  (  0.33s elapsed)
  GC      time    0.01s  (  0.01s elapsed)

N = 400k, main_slow, GHC 7.2.1:
  MUT     time    0.33s  (  0.44s elapsed)
  GC      time    0.08s  (  0.08s elapsed)

N = 400k, main_slow, GHC 7.0.3:
  MUT   time    2.25s  (  2.26s elapsed)
  GC    time    2.95s  (  2.95s elapsed)

yay for stack chunks.

Note: See TracTickets for help on using tickets.