Opened 5 years ago

Last modified 4 years ago

#9675 new bug

Unreasonable memory usage on large data structures

Reported by: Polarina Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.3
Keywords: Cc: ekmett
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I have attached a file that, when compiled, results in very large amount of memory usage by GHC.

The memory used is in the order of tens of gigabytes, but I've not been able to successfully compile the file due to lack of physical memory.

This is on Debian GNU/Linux (Jessie) using Haskell Platform 2014.2.0.0 with GHC 7.8.3.

Attachments (8)

Data.hs (237.1 KB) - added by Polarina 5 years ago.
plot.png (10.0 KB) - added by nomeata 5 years ago.
Plot of GHC allocation numbers
gen-large-product.hs (1.9 KB) - added by nomeata 5 years ago.
Plot of GHC allocation numbers – generation script
ghc-stage2.pdf (5.8 KB) - added by nomeata 5 years ago.
GHC heap profile
ghc-stage2-verbose-core2core.pdf (25.6 KB) - added by nomeata 5 years ago.
Profile with -dverbose-core2core
ghc-stage2-no-verbose-core2core.pdf (12.6 KB) - added by nomeata 5 years ago.
Profile without -dverbose-core2core
T9675-peak_megabytes_allocated.png (7.3 KB) - added by thomie 4 years ago.
T9675-max_bytes_used.png (7.7 KB) - added by thomie 4 years ago.

Download all attachments as: .zip

Change History (23)

Changed 5 years ago by Polarina

Attachment: Data.hs added

comment:1 Changed 5 years ago by schyler

It's intriguing to know that this file only contains a single record and no 'actual' code.

comment:2 Changed 5 years ago by rwbarton

Well, the data declaration does define ~3000 record accessors, each of which pattern matches on a record with ~3000 fields... it's easy for some things to become quadratic (intermediate program code size, strictness/demand info) even if in the end, the code size is only linear.

comment:3 Changed 5 years ago by nomeata

Interesting extreme benchmark, maybe it will reveal some low-hanging fruit of algorithmic improvement. But is there actually a chance that the resulting code will be usable?

comment:4 Changed 5 years ago by Polarina

The idea behind the code is to load all the OpenGL function pointers at program startup and store them in a data structure, rather than load them at runtime as needed; and to support multiple OpenGL contexts as there is no guarantee that two different contexts provide the same set of function pointers for the same set of functions.

comment:5 Changed 5 years ago by nomeata

I created a plot to see what kind of complexity problem we have to look for. I gotta run now, so it’s not as polished as it should, and I didn’t have time yet to actually try to make sense of it.

Changed 5 years ago by nomeata

Attachment: plot.png added

Plot of GHC allocation numbers

Changed 5 years ago by nomeata

Attachment: gen-large-product.hs added

Plot of GHC allocation numbers – generation script

comment:6 Changed 5 years ago by nomeata

Did some GHC profiling, and the problem is clearly the huge number of useless variables in the pattern match in the Core for the record selector functions, here for a three-field-record (I’ll spare us the 3000-field-record):

Test.field3 :: Test.Foo -> GHC.Types.Int -> GHC.Types.Int
[GblId[[RecSel]], Arity=1, Caf=NoCafRefs, Str=DmdType]
Test.field3 =
  \ (ds_dm2 :: Test.Foo) ->
    case ds_dm2 of _ [Occ=Dead] { Test.Foo ds1_dm3 ds2_dm4 ds3_dm5 ->
    ds3_dm5
    }

See the attached heap profile, IdInfo, Var, IntMap and [] are the biggest chunks.

The problem is that a data type declaration like

data R = R { field1 :: T1, field2 :: T2 }

generates Haskell code of the form

field3 (R {field3 = x}) = x

which is then turned in to the form above by general code that also handles multiple fields, field puns, and further patterns in the field.

While the submitter’s code above may or may not be reasonable it migth be worth improving the situation for the benefot of modules with medium-sized records (including GHC’s own DynFlags module).

I see a two approaches:

  1. Avoid generating actual Core for these selector. Make them implicitly availailbe, and only generated them in STG or later. I am not sure if there is precedence for such implicit things.
  1. Extend the this code somehow:
    AltCon
      = DataAlt DataCon
      | LitAlt  Literal
      | DEFAULT
    

I have two ideas here. Either special-case the single-field-match, i.e. a new constructor

  | FieldSelAlt DataCon Int

which binds only one variable, the field indicated by the Int variable. This would get rid of the quadraticness of the representation. But quite a bit of code would have to handle two cases, and there would be no unique representation of the same match.

The other one is to extend DataAlt by a bit field:

  = DataAlt DataCon [Bool]

(with a more suitable data structure for [Bool], e.g. some kind of BitField). The idea is that this field specifies what fields of the constructor are actually used by the match. The number of variables bound by such an alternative would equal to the number of Trues in the list.

This would improve every alternative where not all variables are used. It might make some code more complicated – I could imagine that some compiler phases expect that within a case alternative, there is a locally bound name for every field – but maybe it’s not too bad.

This would still be quadratic (but with a much lower factor – one bit instead of whatever Id + IdInfo + (:) + whatnot use). We can even get rid of that by making the BitField a data type like

data BitField = Empty | Singleton Int | ManyBits ByteString | All

(Empty would be useful for Con {} patterns. Not sure if All is useful. Maybe data BitField = Empty | Singleton Int | All is actually enough, if few-but-more-than-one-field-patterns are rare.)

In both cases, matchOneConLike in compiler/deSugar/MatchCon.lhs needs to be adjusted, where selectConMatchVars generates one variable for every argument up-front, even if not used.

At this point I’m hoping for someone experienced (e.g. SPJ, of course) to tell us which of these ideas are totally whacky, and which are worth pursuing, and if so, what problems to expect. But this is not a promise that I’ll implement it :-)

Last edited 5 years ago by nomeata (previous) (diff)

Changed 5 years ago by nomeata

Attachment: ghc-stage2.pdf added

GHC heap profile

comment:7 Changed 5 years ago by simonpj

I agree that it is tantalising that so much Core is necessary to generate so little object code. Nevertheless, I am deeply reluctant to fiddle with the representation of Core. Doing so will impose overheads on every pass that deals with Core, and I'm far from convinced that it'll solve the problem.

Anything along these lines would risk breaking optimisations. For example, here is how GHC optimises repeated evaluation of the same data constructor. Consider

case x of
  K a _ _ _ -> ...  (case x of 
                       K _ _ c _ -> ...a..c..)  ...

Clearly we'd like to optimise this to

case x of
  K a _ c _ -> ...  ( ...a..c.. )  ...

And GHC does so by (effecitvely) transforming to

case x of
  K a bb cc dd  -> let x = K a bb cc dd in
                   ...  (case x of 
                             K _ _ c _ -> ...a..c..)  ...

The let shadows the outer binding of x. Now the inner case can see that x is bound to (K a bb cc dd), and so it's easy to do case-of-known-constructor to the inner case, giving

case x of
  K a bb cc dd  -> let x = K a bb cc dd in
                   ...  (let c = cc in  ...a..c..)  ...

(The let doesn't really exist, of course, it's only in the Simplifier's head.)

All this is threatened if the outer case binds only a.

I think a decent alternative approach would be not to generate code for selectors (at least when there are many fields) but rather to inline them at usage sites, via a so-called "compulsory unfolding". That too risks lots of code if you use a lot of record selectors, so it's not exactly a solid fix.

The other thing is that I think it's quite likely that GHC has accumulated a space leak. The compiler makes repeated passes over the Core program. To avoid a leak, all vestiges of the input of each pass should be gone by the time we have computed the output of that pass. But I would not be amazed if that was not true at all -- that vesiges of pass 1 were still in memory when we were on pass 10.

I would love someone to really look at this for us. It's not easy, but it could give a huge payoff for every GHC compilation, not just these weird corner cases. I certainly think we should be sure this is not happening before investing effort in optimising particular cases

Simon

comment:8 Changed 5 years ago by nomeata

Thanks for your input.

The way it breaks optimisation is what I meant with “making some code more complicated”: With a sparse list of bound variables, it would have to check which fields the optimization would like to use and then generated them on demand. I see that that might make the code very ugly..

The compulsory unfolding you mention would still be of the the shape of a huge pattern match, right? So the quadratic behaviour wouldn’t be eliminated.

A comparison of the heap profile with and without -dverbose-core2core (using this as a poor man’s deepseq after each phase) shows that there might be a space leak, as you guess. (Uploading both graphs.)

@Polarina: A work-around for you might be to not use a data constructor, but a newtype around a Vector (Int -> Int) (inner type mostly irrelevant). You store the functions therein and your accessor functions would use unsafeIndex and unsafeCoerce to get them out again. You can wrap that highly unsafe code in a module that can be used in a totally safe way.

Changed 5 years ago by nomeata

Profile with -dverbose-core2core

Changed 5 years ago by nomeata

Profile without -dverbose-core2core

comment:9 Changed 5 years ago by nomeata

A space leak seems to be hidden in DmdAnal. Maybe strictness annotation calculation for dead variables, which are then never used? I’m preparing a compiler perf test case so that we can track possible improvements.

comment:10 Changed 5 years ago by Joachim Breitner <mail@…>

In 05f962df2ba028fd304fdada9e68e7199822cbf0/ghc:

Compiler performance benchmark for #9675

so that whoever improves the situation can feel good about it.

comment:11 Changed 5 years ago by nomeata

Heap profiling tells me that most of it is retained by dmdAnalRhs, produced by annotateBndrs/dmdAnalAlt/dmdAnal'. But even after cluttering most of the module with `seq` and bang patterns, the leak is still there....

and just while I was about to give up I found the problem: The environment part of a DmdType is not seq’ed. Doing that does great good. Patch coming.... :-)

comment:12 Changed 5 years ago by simonpj

That sounds very plausible -- thank you! Do de-clutter when you are done :-).

In fact if you grep for seq you'll find a variety of seqs, some of which probably do nothing useful while costing execution time. You are in an excellent position to find out, if you feel up to doing so. Really each should be accompanied with a story for why it is there.

Simon

comment:13 Changed 5 years ago by Joachim Breitner <mail@…>

In d9db81f4ed8ca6e7262f84347174d6b0e2e9a76a/ghc:

seqDmdType needs to seq the DmdEnv as well

otherwise this can retain large lazy calculations. This fixed one space
leak pointed out in #9675.

comment:14 Changed 5 years ago by ekmett

Cc: ekmett added

Changed 4 years ago by thomie

Changed 4 years ago by thomie

Attachment: T9675-max_bytes_used.png added

comment:15 Changed 4 years ago by Ben Gamari <ben@…>

In b5e1944e2b069a7df3444e57bae0b4ee15bde73c/ghc:

Use `+RTS -G1` for more stable residency measurements (#9675)

Reviewers: ezyang, austin, thomie

Subscribers: thomie, bgamari

Differential Revision: https://phabricator.haskell.org/D1006

GHC Trac Issues: #10557
Note: See TracTickets for help on using tickets.