Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#7850 closed bug (fixed)

Strangely high memory usage on optimized Ackermann function

Reported by: dolio Owned by:
Priority: normal Milestone: 7.6.3
Component: Compiler Version: 7.6.2
Keywords: Cc: Artyom.Kazak@…, the.dead.shall.rise@…, matvey.aksenov+ghctrac@…, simonmar, carter.schonwald@…, pho@…, anton.nik@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: T7850
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Greetings.

The following post on stack overflow demonstrates some strange behavior in, at least, recent GHC versions (7.4.2 on):

http://stackoverflow.com/questions/16115815/ackermann-very-inefficient-with-haskell-ghc

The program in question is simple:

main = print $ ack 4 1

ack :: Int -> Int -> Int
ack 0 !n = n+1
ack m  0 = ack (m-1) 1
ack m  n = ack (m-1) $ ack m (n-1)

Here are the properties I've been able to deduce so far:

1) When compiled without optimizations, the program uses almost no memory, but of course takes a while.

2) When compiled with optimizations (-O and above), the program eats memory at a staggering rate. It easily freezes my machine in a matter of seconds.

3) -ddump-simpl reveals that the loop is completely unboxed, operating only on Int#. -ddump-prep also shows no lets where allocation would presumably happen.

4) Setting a maximum heap size typically seems to have no effect. When I set it to the minimum heap size with -M4096, even the optimized program seems to run in constant space most of the time. However, using something like -M1000000 seems to result in the outrageous memory usage, and the RTS never catches that far more than 1 megabyte of memory is being used. I had to convince myself that the flag actually does something with another test program.

5) The standard bounded stack also does nothing.

So, we seem to have a program that is allocating vast amounts of (ostensibly) non-heap, non-stack memory; but only when optimized.

I believe I once set the maximum heap to 40960B, and killed the program before it killed my machine. On exiting, I got a heap overflow error. So, my initial stab would be that the completely unboxed loop somehow is allocating in the heap, but somehow not in a way that ever allows the garbage collector or heap overflow check to run (similar to how a non-allocating loop can block any thread preemption). That's a wild guess, though.

I'm unable to easily investigate the behavior on older GHC versions, unfortunately. So I'm unsure how far back this behavior goes. If I've missed something obvious, I apologize, but this seems like very odd behavior to me.

Attachments (1)

0001-When-a-stack-overflow-occurs-check-whether-we-need-t.patch (3.0 KB) - added by simonmar 6 years ago.

Download all attachments as: .zip

Change History (34)

comment:1 Changed 6 years ago by Artyom.Kazak

Cc: Artyom.Kazak@… added

comment:2 Changed 6 years ago by Artyom.Kazak

Architecture: x86_64 (amd64)Unknown/Multiple

This bug is reproducible on my laptop (Linux, x86).

I was not able to run the optimised program with less than 64kb of heap, though (was getting Heap exhausted errors), but running it with -M64000 resulted in fifteen minutes of freeze and this error message:

internal error: getMBlock: mmap: Operation not permitted

comment:3 Changed 6 years ago by Khudyakov

I've tried GHC 7.0.4, 7.2.2 7.4.2 and 7.6.2 on x64. All except 7.0.4 consume a lot of memmory.

comment:4 Changed 6 years ago by monoidal

Looks fixed in HEAD. I can reproduce out-of-memory in 7.6.2 but not 7.7.

comment:5 Changed 6 years ago by nomeata

My observation is that with -kc1M, thinks work great (37 MB memory in use). With -kc786k I can already observe the bad memory behaviour, but it still finishes (15251 MB). -kc512k almost immediately fills up my memory.

My guess is that for some reason, unused stack chunks are not freed by the GC.

comment:6 Changed 6 years ago by refold

Cc: the.dead.shall.rise@… added

comment:7 Changed 6 years ago by nomeata

Maybe https://github.com/ghc/ghc/commit/03d360f289a1c7e93fedf8cfa274cbe5929cd32c fixed it? (Just a guess, I did not set up a bisect environment.)

comment:8 Changed 6 years ago by refold

This program should probably be added to tests or benchmarks.

comment:9 Changed 6 years ago by adinapoli

Architecture: Unknown/Multiplex86_64 (amd64)
Operating System: LinuxMacOS X

If It might be of any help, I reproduced it on Mac OS X 10.8 with GHC 7.6.2 as well. Profiling reveals an humongous allocation of Pinned Memory:

http://heap.ezyang.com/view/6fb7441b501c00f428126f6b6e767d2412f43c2c#form

comment:10 Changed 6 years ago by adinapoli

Architecture: x86_64 (amd64)Unknown/Multiple
Operating System: MacOS XUnknown/Multiple

comment:11 Changed 6 years ago by supki

Cc: matvey.aksenov+ghctrac@… added

comment:12 Changed 6 years ago by dolio

The benchmarks game used to have an entry called 'recursive' that ran fib, ack and tak.* But it's since disappeared, and isn't one of the entries that was copied into nofib.

ack is the only one that displays the pathological behavior by default. I thought tak might, but it didn't. This seems to be related to a shallower stack usage. For instance, tak 60000 30 15 runs comfortably (for a very long time) with -K32K, so it will never overflow the default chunk size.

However, one can get tak to eat memory in a similar way by making the stack chunking small enough. For (an extreme) instance -kb25 -kc256, setting the chunk size to 256 bytes, will cause the same memory exhausting behavior.

This meshes with nomeata's observations above.

Also for reference, I just built 7.6.3, and whatever fixed this in HEAD has not been incorporated into it. It still displays the pathological behavior.


[*] tak is defined as follows:

tak x y z
  | y < x     = tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
  | otherwise = z

comment:13 Changed 6 years ago by dolio

Oh, another note: this behavior still happens when using the llvm backend.

That probably casts some doubt that it was the commit nomeata mentioned, as that appears to be all changes to the ncg. I don't know the code's structure though, so I can't say definitively.

comment:14 in reply to:  9 ; Changed 6 years ago by refold

Replying to adinapoli:

If It might be of any help, I reproduced it on Mac OS X 10.8 with GHC 7.6.2 as well. Profiling reveals an humongous allocation of Pinned Memory:

http://heap.ezyang.com/view/6fb7441b501c00f428126f6b6e767d2412f43c2c#form

35KiB could be hardly called humongous.

comment:15 Changed 6 years ago by simonmar

Cc: simonmar added
difficulty: Unknown

comment:16 Changed 6 years ago by dolio

I've now built HEAD, and get the same results as monoidal. Both ack and tak run in constant memory even with very small stack chunk sizes.

When using -kc2M, HEAD-built ack actually runs slower than 7.6.x -kc2M. And default -kc is quite a bit slower than that, but still much better than freezing the machine.

comment:17 Changed 6 years ago by carter

Cc: carter.schonwald@… added

comment:18 in reply to:  14 ; Changed 6 years ago by adinapoli

Replying to refold:

35KiB could be hardly called humongous.

Fair enough, but that was just for half a second of execution, and compared to the other allocation it seemed quite a lot to me. I've tried to do more profiling, but apparently running it with +RTS -hc does not takes GBs of memory very quickly, it has linear increases but then it comes back to relatively small values. I'm a bit puzzled. If I run the same executable without flags, it literally eat up my memory in seconds.

comment:19 in reply to:  18 Changed 6 years ago by refold

Replying to adinapoli:

I'm a bit puzzled. If I run the same executable without flags, it literally eat up my memory in seconds.

Yes, I also noticed that. Running the program with heap profiling enabled makes the problem go away.

comment:20 Changed 6 years ago by PHO

Cc: pho@… added

comment:21 Changed 6 years ago by applicative

No one seems to be pointing out that this this program does return and more importantly gives the correct answer on ghc-7.6.2. On this dubious macbook air (osx 10.8, 4gb , 2.4ghz, 2 cores etc), the world does not end, this wiki does not freeze, my files are not threatened; it takes as much memory as the os is willing to give it -- ~2gb -- and looks like so:

$ ghc -O2 ackermanbug.hs -fforce-recomp  # ghc-7.6.2  64bit
$ time ./ackermanbug
65533   real    1m47.372s; user 0m26.577s; sys  0m17.641s

The distinction between real and user is of course suggestive of pathology -- but not a bug in the strict sense -- and is indeed handsomely repaired in

$ khc -O2 ackermanbug.hs -fforce-recomp  # ghc HEAD
$ time ./ackermanbug
 65533     real    0m24.081; user  0m23.920s; sys  0m0.133s

If I add the news that "ack n 1 = n + 2", ghc does fine in either version; and of course it clunks along never to return with ack 4 2. With the same emendation http://hpaste.org/86212 the vaunted gcc gives a segmentation fault instantly.

What I wonder is whether gcc's segfault is a 'bug' we should be hysterically reporting on the gcc trak, or if the gcc devs just put in an ack-detector because they got tired of people whining about such things, preferring to focus on what matters.

What is more interesting is that e.g. ocamlopt takes 5s for ack 4 1 where ghc-HEAD takes 24s. I cannot replicate the superiority the SO guy found with the gcc . It seems about the same as ocamlopt. If I add any help at all to the program, e.g. that ack 1 = (+2) -- ghc becomes, for all actually calculable arguments, just as fast as gcc.

comment:22 Changed 6 years ago by dolio

Here's the simplest test case I can think of to demonstrate the behavior:

-- main = print $ stk 4051 1000000
main = print $ stk 4052 1000000

stk :: Int -> Int -> Int
stk _     0 = 0
stk depth n = loop depth `seq` stk depth (n-1)

loop :: Int -> Int
loop 0 = 0
loop n = loop (n-1) `seq` 0

With -O0, ~0% memory usage. With -O2, memory usage grows about linearly with time. If the alternate main is used, -O2 also uses ~0% memory.

The boundary can be found using +RTS -K32K, which will stack overflow after one chunk is used. Choices of depth that do not overflow under such a setting will run in little memory, and choices that do will use large amounts of memory. The boundary number will probably be higher when running with 4-byte ints, for instance.

Some +RTS -s stats look like:

4052 -O2: 32259 MB total memory in use (504 MB lost due to fragmentation) 4052 -O0: 1 MB total memory in use (0 MB lost due to fragmentation) 4051 -O2: 1 MB total memory in use (0 MB lost due to fragmentation)

comment:23 in reply to:  21 Changed 6 years ago by adinapoli

Replying to applicative:

No one seems to be pointing out that this this program does return and more importantly gives the correct answer on ghc-7.6.2. On this dubious macbook air (osx 10.8, 4gb , 2.4ghz, 2 cores etc), the world does not end, this wiki does not freeze, my files are not threatened; it takes as much memory as the os is willing to give it -- ~2gb -- and looks like so:

Don't know which version of the program you used but, on my Macbook, using ghc 7.6.2, 64bit, 16GB of Ram, with the program originally posted, running with -O2 hogs my computer, it takes 10GB of memory very quickly making my pc nearly unusable.

comment:24 in reply to:  5 ; Changed 6 years ago by awson

Most interesting is that while many notice -kc1M cure the problem they still insist it runs much slower than Ocaml version. But it is *NOT*! It takes less than 6.5 secs to deliver a correct answer for original unmodified version of ack with -kc1M on all my configs:

Windows 8 64-bit on Q9450 2.66MHz:
ghc-7.6.2 32-bit - 6.49 secs
ghc-7.6.2 64-bit - 6.47 secs

Windows 7 on i7-2675QM
ghc-7.6.2 32-bit - 6.13 secs
ghc-7.6.2 64-bit - 3.80 (!!!) secs (i tried this several times with the same result)

Could it be GHC RTS on Windows which makes the difference?

comment:25 in reply to:  24 ; Changed 6 years ago by refold

Replying to awson:

Most interesting is that while many notice -kc1M cure the problem they still insist it runs much slower than Ocaml version. But it is *NOT*!

Interesting to hear about this difference between platforms. On Linux (and apparently OS X) it just eats all memory and does not finish.

comment:26 in reply to:  25 Changed 6 years ago by awson

Replying to refold:

Replying to awson:

Most interesting is that while many notice -kc1M cure the problem they still insist it runs much slower than Ocaml version. But it is *NOT*!

Interesting to hear about this difference between platforms. On Linux (and apparently OS X) it just eats all memory and does not finish.

AFAIU on Linux and OS X -kc1M on 7.6.x makes memory consumption very good, but performance is still several times worse than that of Ocaml both for 7.6.x with -kc1M and HEAD (where this bug is declared to be fixed).

Still, the performance is *several times* better under Windows for 7.6.x (sadly, there are no HEAD builds for Windows to try).

comment:27 Changed 6 years ago by dolio

I can't speak for anyone else, but performance here (Linux, 64-bit) is much better with 1M stack chunks as well.

Also, with the program I posted recently... It finishes very shortly after running into my swap with the given parameters. With depth=4050, it takes a little over 5 seconds. With depth=4051, it takes a little over 5 seconds. With depth=4052, it takes around 26 seconds. And this is still true on HEAD.

Anyhow, the difference is that 4052 has to allocate a second stack chunk on each iteration, and then toss it away almost immediately. This activity dramatically slows down the program. -kc1M makes the same difference for the ackermann example. I wouldn't be surprised if OCaml doesn't have to do any similar fooling on this example (due to a larger initial stack), which is why it would perform comparably to C by default.

comment:28 Changed 6 years ago by lelf

Cc: anton.nik@… added

comment:29 Changed 6 years ago by simonmar

Milestone: 7.6.3
Status: newpatch

As noted earlier, it's already fixed in master. I don't remember when I noticed the bug and fixed it, the fix seems to have come in with the giant patch to rewrite the RTS Cmm files in the new syntax.

Anyway, attached is a patch for the 7.6 branch that fixes it. I'll let the release engineer decide what to do with it :-)

comment:30 Changed 6 years ago by igloo

Status: patchmerge

comment:31 Changed 6 years ago by igloo

Test Case: T7850

I've added a test.

comment:32 Changed 6 years ago by thoughtpolice

Resolution: fixed
Status: mergeclosed

comment:33 Changed 6 years ago by thoughtpolice

Fixed and merged in HEAD.

Note: See TracTickets for help on using tickets.