Opened 6 years ago

Last modified 2 years ago

#8457 new bug

-ffull-laziness does more harm than good

Reported by: errge Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.7
Keywords: FloatOut Cc: mihaly.barasz@…, tkn.akio@…, bos@…, johan.tibell@…, edsko
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #9520, #12620 Differential Rev(s):
Wiki Page:

Description

In this bug report I'd like to argue that -ffull-laziness shouldn't be turned on automatically with either -O nor -O2, because it's dangerous and can cause serious memory leaks which are hard to debug or prevent. I'll also try to show that its optimization benefits are negligible. Actually, my benchmarks show that it's beneficial to turn it off even in the cases where we don't hit a space leak.

We've met this issue last week, but it had been reported several times before: e.g. #917 and #5262.

A typical example is the following:

main :: IO ()
main = task () >> task ()

task :: () -> IO ()
task () = printvalues [1..1000000 :: Int]

printvalues :: [Int] -> IO ()
printvalues (x:xs) = print x >> printvalues xs
printvalues [] = return ()

We succeed with -O0, but fail with -O:

errge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O0 -fforce-recomp lazy && ./lazy +RTS -t >/dev/null
<<ghc: 1620098744 bytes, 3117 GCs, 32265/42580 avg/max bytes residency (3 samples), 2M in use, 0.00 INIT (0.00 elapsed), 1.28 MUT (1.28 elapsed), 0.02 GC (0.02 elapsed) :ghc>>
errge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O -fforce-recomp lazy && ./lazy +RTS -t >/dev/null
<<ghc: 1444098612 bytes, 2761 GCs, 3812497/13044272 avg/max bytes residency (7 samples), 28M in use, 0.00 INIT (0.00 elapsed), 1.02 MUT (1.03 elapsed), 0.12 GC (0.12 elapsed) :ghc>>

28M? What the leak!? Well, it's -ffull-laziness:

errge@curry:~/tmp $ ~/tmp/ghc/inplace/bin/ghc-stage2 -v0 -O -fno-full-laziness  -fforce-recomp lazy && ./lazy +RTS -t >/dev/null
<<ghc: 1484098612 bytes, 2835 GCs, 34812/42580 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.04 MUT (1.04 elapsed), 0.02 GC (0.02 elapsed) :ghc>>

We get constant space and the fastest run-time too, since we spare some cycles on GC.

Note, that in this instance we are trying to explicity disable sharing by using () as a fake argument for the function. Also note, that this function may easily be a utility function in a larger code base or in a library, therefore it's impractical to say that you shouldn't use it twice "too close together".

Quoting from the GHC user guide:

 -O2:

    Means: “Apply every non-dangerous optimisation, even if it means
    significantly longer compile times.”

    The avoided “dangerous” optimisations are those that can make
    runtime or space worse if you're unlucky. They are normally turned
    on or off individually.

    At the moment, -O2 is unlikely to produce better code than -O.

This seems to be false at the moment.

We decided to make a broader investigation into this issue and wanted to know if we can disable this optimization without too much pain. Came up with this benchmark plan:

  • let's benchmark GHC,
  • compile all stages with -O, but hack the stage1 compiler to emit -t statistics for every file compiled,
  • gather these statistics while compiling the libraries and the stage2 compiler.

On the second run we compile the stage1 compiler with -O -fno-full-laziness, but leave everything else unchanged in the environment.

When we have both results of the compilation of ~1600 files, we match them up and compute the (logarithmic) ratio of CPU and memory difference between compilations, the final results for our benchmark.

The results and the raw data can be found at https://github.com/errge/notlazy.

The overall compilation time dropped from 26:20 to 25:12, which is a 4% improvement. Investigating the full matching shows that this overall result is from small improvements all around the place.

The results plotted:

The graphs show the logarithmic (100*log_10(new/orig)) ratio of change in cpu and memory consumption. Therefore negative results mean that the new compilation method is faster.

As can be seen on the CPU graph, in most of the cases the difference is negligible (actually smaller than what can be measured on small files, this is why we have the spike at 0). In overall we see a small improvement in CPU, and there are some outliers in both directions, but there are more drastic improvement cases than drastic regressions.

On the memory graph the situation is much more close to zero. There is one big positive memory outlier: DsListComp.lhs. It uses 69M originally and now uses 103M. But compiles in 2 seconds both ways and there are files in the source tree which requires 400M to compile, so this is not an issue.

After all this, I'd like to hear other opinions about just disabling this optimization in -O and -O2 and leaving it as an option that can be turned on when needed, my reasons once more:

  • it's unsafe,
  • it's hard to debug when you hit its issues,
  • the optimization doesn't seem to be very productive,
  • it's always easy to force sharing, but it's not easy to force copying.

Apparently a Haskell programmer should be lazy, but never fully lazy.

Research done by Gergely Risko <errge> and Mihaly Barasz <klao>, confirmed on two different machines with no other running processes.

Change History (17)

comment:1 Changed 6 years ago by rwbarton

Very nice work!

I just want to point out a suggestion I made a little while ago, but haven't followed up on: it may be easy to identify certain occurrences of let-floating that are truly "safe" performance-wise, and control them with a separate flag, which would be on by default under -O. Then remove -ffull-laziness from -O. I would be interested to see the results of your experiment with such a patch.

comment:2 Changed 6 years ago by klao

Cc: mihaly.barasz@… added

comment:3 Changed 6 years ago by simonpj

Interesting. But if I understand right this is just one program you are testing, namely GHC itself. Would you like to do a nofib run, with and without -fno-full-laziness and see what nofib-analyse says? thanks

Simon

comment:4 Changed 6 years ago by akio

Cc: tkn.akio@… added

comment:5 Changed 6 years ago by errge

Hi Simon,

Thanks for pointing me towards this 'Nestedly Organized Future Improving Benchmarks', they seem to be great. :)

I followed http://ghc.haskell.org/trac/ghc/wiki/Building/RunningNoFib and the results are here: https://github.com/errge/notlazy/blob/master/nofib.txt

There are some outliers in both directions, but the overall picture seems to me as kind of "nothing changing". Even more if you subtract the noise introduced by the outliers.

It may be worth to have a look on fulsom, hidden, parser and parstof; these are the examples where the optimization did very well and it would be interesting to see whether it's easy to reintroduce the necessary sharing by hand. OTOH, these benchmarks are a bit old school, undocumented and some of them are autogenerated.

And there are some outliers on the negative side: constraints, bspt and integer. It'd be interesting to see here if the optimization introduced sharing or other performance bugs are easy to workaround or not in these cases.

And in general, we may have status quo bias here: when implementing and submitting these real world benchmarks people just used the GHC as it was at that point, optimizing for that. If they were forced to do sharing then they would have.

If there are more benchmarks to run, I'm happy to run them.

Gergely

Last edited 6 years ago by errge (previous) (diff)

comment:6 Changed 6 years ago by errge

Milestone: 7.8.17.10.1
Priority: highnormal

Also, I've changed the milestone now to 7.10.1 and changed the priority to normal.

I don't want to rush anyone into a decision that is clearly not super important. Everything is like as it is for years, and users can easily say -fno-full-laziness themselves.

Since this is only a simple flag switch, if we decide that we should do the switch, we can easily do it for 7.8.2 or later 7.8 releases, no rush.

comment:7 Changed 6 years ago by simonpj

Cc: bos@… johan.tibell@… added

Thanks.

A couple of nofib programs get a lot worse (eg partsof allocates 20x more), and the geometric mean of allocation I speculate that some of these extreme effects are because the programs have static data, but I'm not sure. The question is whether "typical" GHC users will see things getting better or worse.

I'd be happy to know what our GHC Performance Tsars think (or indeed anyone else). Bryan, Johan: any views?

Simon

comment:8 Changed 6 years ago by tibbe

I'd like to see an anlysis of the nofib outliers, both positive and negative, before making any changes. Once we know why they changed we can decide whether the benchmarks or the compiler are to blame.

comment:9 Changed 5 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:10 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:11 Changed 4 years ago by simonpj

See also #917, #1945, #3273, #4276, #5729, #10535

comment:12 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:13 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:14 Changed 3 years ago by edsko

Cc: edsko added

comment:15 Changed 3 years ago by edsko

Just published a blog post on the perils of full laziness when using conduit (see also #9520); it might be relevant to this ticket: http://www.well-typed.com/blog/2016/09/sharing-conduit/ . See also #12620 for a recent alternative proposal to limit the scope of full laziness.

comment:16 Changed 3 years ago by edsko

My blog post underestimated the perils of full laziness. Proposed erratum at https://www.reddit.com/r/haskell/comments/55xk4z/erratum_to_sharing_memory_leaks_and_conduit_and/ .

comment:17 Changed 2 years ago by simonpj

Keywords: FloatOut added
Note: See TracTickets for help on using tickets.