Opened 4 years ago

Last modified 2 years ago

#10980 new bug

Deriving Read instance from datatype with N fields leads to N^2 code size growth

Reported by: slyfox Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.2
Keywords: deriving-perf Cc: simonmar, simonpj, ezyang, mpickering, RyanGlScott, nelhage
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #13213 #14364 #7258 Differential Rev(s):
Wiki Page:

Description

The problem observed on GHC's code base when explored why exactly some object files are so huge.

Let's look at the stats of text code section sizes on today's HEAD (skipping GHCi libraries and stage1 binaries):

~/dev/git/ghc $ size `find -name *.o -type f | grep -v stage1 | grep -v HS` | sort -k1 -n -r | head -n20
   text    data     bss     dec     hex filename
1791068  145448       0 1936516  1d8c84 ./compiler/stage2/build/DynFlags.o
1558637   77464       0 1636101  18f705 ./compiler/stage2/build/PprC.o
1397966   23272       0 1421238  15afb6 ./compiler/stage2/build/PlatformConstants.o
1332048  373344       0 1705392  1a05b0 ./libraries/Cabal/Cabal/dist-boot/build/Distribution/PackageDescription.o
1331238  373352       0 1704590  1a028e ./bootstrapping/Distribution/PackageDescription.o
1271578  246240       0 1517818  1728fa ./libraries/template-haskell/dist-boot/build/Language/Haskell/TH/Syntax.o
1229696  311616       0 1541312  1784c0 ./libraries/Cabal/Cabal/dist-install/build/Distribution/PackageDescription.o
1215082  224288       0 1439370  15f68a ./libraries/template-haskell/dist-install/build/Language/Haskell/TH/Syntax.o
1071746  242664       0 1314410  140e6a ./compiler/stage2/build/HsExpr.o
1015090   40904       0 1055994  101cfa ./compiler/stage2/build/Llvm/PpLlvm.o
 970203  124944       0 1095147  10b5eb ./compiler/stage2/build/Parser.o
 849693   79760       0  929453   e2ead ./compiler/stage2/build/HsDecls.o
 833327   35440       0  868767   d419f ./compiler/stage2/build/X86/CodeGen.o
 819959  127192       0  947151   e73cf ./libraries/Cabal/Cabal/dist-boot/build/Distribution/Simple/Setup.o
 819685  125120       0  944805   e6aa5 ./bootstrapping/Distribution/Simple/Setup.o
 816927  124520       0  941447   e5d87 ./libraries/Cabal/Cabal/dist-install/build/Distribution/Simple/Setup.o
 785398   41536       0  826934   c9e36 ./compiler/stage2/build/CoreLint.o
 772550   42040       0  814590   c6dfe ./compiler/stage2/build/DriverPipeline.o
 766461   36280       0  802741   c3fb5 ./compiler/stage2/build/HscMain.o
 735605   23408       0  759013   b94e5 ./libraries/containers/dist-install/build/Data/Sequence.o

PlatformConstants.o is a very clear example of this problem. Trimmed down example of this file is:

module M (D) where
data D = D { a0
, a01 , a02 , a03 , a04 , a05 , a06 , a07 , a08 , a09 , a10
, a11 , a12 , a13 , a14 , a15 , a16 , a17 , a18 , a19 , a20
, a21 , a22 , a23 , a24 , a25 , a26 , a27 , a28 , a29 , a30
, a31 , a32 , a33 , a34 , a35 , a36 , a37 , a38 , a39 , a40
, a41 , a42 , a43 , a44 , a45 , a46 , a47 , a48 , a49 , a50
, a51 , a52 , a53 , a54 , a55 , a56 , a57 , a58 , a59 , a60
, a61 , a62 , a63 , a64 , a65 , a66 , a67 , a68 , a69 , a70
, a71 , a72 , a73 , a74 , a75 , a76 , a77 , a78 , a79 , a80
, a81 , a82 , a83 , a84 , a85 , a86 , a87 , a88 , a89 , a90
, a91 , a92 , a93 , a94 , a95 , a96 , a97 , a98 , a99

 :: Int } deriving Read

Results in 800KB file:

$ ghc-stage2 -c -O1 -fforce-recomp M.hs -o M.o
$ size M.o
   text    data     bss     dec     hex filename
 847039   16080       0  863119   d2b8f M.o

The size growth is quadratic:

# fields code-size
1    6263
21   61767
41  173623
61  347503
81  583367
101  881231
121 1241087
141 1662959
161 2146815
181 2692679
201 3300543
221 3970407
241 4702263
261 5496135
281 6351991

Attachments (4)

mk_data.bash (493 bytes) - added by slyfox 4 years ago.
script to build temp files and stats
size-growth.log.png (18.4 KB) - added by slyfox 4 years ago.
graph for 1 to 281 field in a simple datatype.
M_200_auto.hs (39.3 KB) - added by slyfox 4 years ago.
by ghc derived as-is
M_200_manual.hs (18.5 KB) - added by slyfox 4 years ago.
factored out repetition

Download all attachments as: .zip

Change History (22)

Changed 4 years ago by slyfox

Attachment: mk_data.bash added

script to build temp files and stats

Changed 4 years ago by slyfox

Attachment: size-growth.log.png added

graph for 1 to 281 field in a simple datatype.

comment:1 Changed 4 years ago by slyfox

Derived instance looks reasonably well (i've picked a data type with 282 fields):

{-
$ ghc -O1 -fforce-recomp -ddump-deriv M.hs
[1 of 1] Compiling M                ( M.hs, M.o )

==================== Derived instances ====================
Derived instances:
-}
  instance GHC.Read.Read M.D where
    GHC.Read.readPrec
      = GHC.Read.parens
          (Text.ParserCombinators.ReadPrec.prec
             11
             (do { GHC.Read.expectP (Text.Read.Lex.Ident "D");
                   GHC.Read.expectP (Text.Read.Lex.Punc "{");
                   GHC.Read.expectP (Text.Read.Lex.Ident "a0");
                   GHC.Read.expectP (Text.Read.Lex.Punc "=");
                   a1_a1ns <- Text.ParserCombinators.ReadPrec.reset GHC.Read.readPrec;
                   GHC.Read.expectP (Text.Read.Lex.Punc ",");
                   GHC.Read.expectP (Text.Read.Lex.Ident "a1");
                   GHC.Read.expectP (Text.Read.Lex.Punc "=");
                   a2_a1nt <- Text.ParserCombinators.ReadPrec.reset GHC.Read.readPrec;
                   GHC.Read.expectP (Text.Read.Lex.Punc ",");
                   GHC.Read.expectP (Text.Read.Lex.Ident "a2");
                   GHC.Read.expectP (Text.Read.Lex.Punc "=");
                   a3_a1nu <- Text.ParserCombinators.ReadPrec.reset GHC.Read.readPrec;
-- ...
                   GHC.Read.expectP (Text.Read.Lex.Punc ",");
                   GHC.Read.expectP (Text.Read.Lex.Ident "a280");
                   GHC.Read.expectP (Text.Read.Lex.Punc "=");
                   a281_a1rY <- Text.ParserCombinators.ReadPrec.reset
                                  GHC.Read.readPrec;
                   GHC.Read.expectP (Text.Read.Lex.Punc ",");
                   GHC.Read.expectP (Text.Read.Lex.Ident "a281");
                   GHC.Read.expectP (Text.Read.Lex.Punc "=");
                   a282_a1rZ <- Text.ParserCombinators.ReadPrec.reset
                                  GHC.Read.readPrec;
                   GHC.Read.expectP (Text.Read.Lex.Punc "}");
                   GHC.Base.return
                     (M.D
                        a1_a1ns
                        a2_a1nt
                        a3_a1nu
                        a4_a1nv
                        a5_a1nw
-- ...
                        a280_a1rX
                        a281_a1rY
                        a282_a1rZ) }))
    GHC.Read.readList = GHC.Read.readListDefault
    GHC.Read.readListPrec = GHC.Read.readListPrecDefault

Generated assembly is full of stack rearrangements which seems to be a main source of code inflation and compilation slowdown.

Some thoughts:

  • Could GHC do a CSE for code sequences of expetP (Punc ); expetP (Ident ); expetP (Punc );?
  • If all fields were strict could GHC optimise stack layout better?

comment:2 Changed 4 years ago by slyfox

Cc: simonmar simonpj ezyang added

comment:3 Changed 4 years ago by slyfox

I've tried to manually move out repeated pieces

    GHC.Read.expectP (Text.Read.Lex.Punc ",");
    GHC.Read.expectP (Text.Read.Lex.Ident "a2");
    GHC.Read.expectP (Text.Read.Lex.Punc "=");
    a3_a1nu <- Text.ParserCombinators.ReadPrec.reset GHC.Read.readPrec;

to a separate binding parameterised by Ident and it seems to be enough to make code growth linear with added fields.

comment:4 Changed 4 years ago by mpickering

Cc: mpickering added

comment:5 Changed 4 years ago by simonpj

Slyfox, can you give a small example of what the code looks like before and after your change? I can see that turning 4 lines into 1 (by creating a new function definition) is good, but at best that makes it 4 times smaller, which is not the difference between linear and quadratic.

I'd like to understand where the quadratic-ness is coming from!

Thanks for investigating this.

Simon

comment:6 in reply to:  5 Changed 4 years ago by slyfox

Replying to simonpj:

Slyfox, can you give a small example of what the code looks like before and after your change? I can see that turning 4 lines into 1 (by creating a new function definition) is good, but at best that makes it 4 times smaller, which is not the difference between linear and quadratic.

I've rechecked the measurement and you are correct! Quadratic behaviour stil presents in manually converted sources. Seems I was looking at the wrong binary files when tweaking things.

Apologies for the confusion :(

(for completness) attaching source file before and after transformation for 200-entries file:

$ ghc -c -fforce-recomp -c -O1 M_200_auto.hs
$ ghc -c -fforce-recomp -c -O1 M_200_manual.hs 
$ size M_200_auto.o M_200_manual.o 

   text    data     bss     dec     hex filename
3268679   41520       0 3310199  328277 M_200_auto.o
 674695   16424       0  691119   a8baf M_200_manual.o

Changed 4 years ago by slyfox

Attachment: M_200_auto.hs added

by ghc derived as-is

Changed 4 years ago by slyfox

Attachment: M_200_manual.hs added

factored out repetition

comment:7 Changed 4 years ago by rwbarton

(By the way, slyfox's numbers are for 7.10. GHC 7.8.4 ran for 3 times as long and produced 5 times as much code as 7.10.1 on M_200_manual. I wonder why?)

So there is some interesting code generation problem here, where a program of size O(n) involving nested lambdas/closures incurs a quadratic code size cost, due to repeatedly copying all the free variables of one closure into the next closure. I'm not sure whether this problem is solvable in general (or, perhaps, whether it is solvable without building rather fancy data structures into the code generator). (That's not to say it's hard necessarily, I just haven't thought about it at all.)

In this case, however, I think we can avoid the quadratic blowup in code size by restructuring the input code. Basically, since the values bound by <- are never used except to build the final result, just use the ApplicativeDo desugaring to avoid writing any lambdas at all. I don't have a GHC with ApplicativeDo handy yet so I manually rewrote slyfox's

                   a1 <- reset readPrec;
                   a2 <- comma_field_eq_read "a1";
                   a3 <- comma_field_eq_read "a2";
                   ...
                   a201 <- comma_field_eq_read "a200";
                   expectP (Punc "}");
                   return
                     (D a1
                      ...)

to

                   res <- D <$> reset readPrec
                          <*&> comma_field_eq_read "a1"
                          <*&> comma_field_eq_read "a2"
                          ...
                          <*&> comma_field_eq_read "a200";
                   expectP (Punc "}");
                   return res

Here <*&> is just a NOINLINE copy of <*>. That's important, otherwise <*> will be inlined, creating a lambda and defeating the purpose of the rewrite.

Now the final code size is

   text	   data	    bss	    dec	    hex	filename
 101839	  24008	      0	 125847	  1eb97	M_200_applicative.o

and reading through the assembly there isn't anything obviously quadratic left (just a linear number of small functions, and one linearly-sized function to build the result--which should just be the constructor for D anyways, and probably shouldn't really count against the size of the Read instance). Presumably factoring out the "<*&> comma_field_eq_read" and unpackCString# pattern would reduce code size a bit more.

Unfortunately the compile time is still quadratic, because type-checking D <$> x1 <*> x2 <*> ... <*> xn results in a Core program with a quadratic number of types (D has type (Int ->)^n -> D, and then D <$> x1 has type (Int ->)^(n-1) -> f D, and D <$> x1 <*> x2 has type (Int ->)^(n-2) -> f D, etc.) It's still decently faster than building slyfox's manual version (about 2/3 the compile time).

Obviously this may affect runtime performance too, but my position is that deriving Read is not intended to be (and is not in practice either) a high-performance parser, so I don't really care if it happens to get a bit slower. As long as read doesn't become asymptotically slower, which seems unlikely considering the generated code was originally asymptotically larger in size, I think it's fine.

comment:8 Changed 4 years ago by simonpj

Reid, interesting! So you think it's all in the code generation and closure construction.

Can you give a simple test case, separate from deriving Read that demonstrates the phenomenon, with the monadic and applicative versions side by side, but with very different perf? That would help to focus our attention.

Thanks

Simon

comment:9 Changed 4 years ago by rwbarton

Type of failure: None/UnknownCompile-time performance bug

comment:10 Changed 4 years ago by ezyang

Keywords: deriving-perf added

comment:11 Changed 3 years ago by RyanGlScott

Cc: RyanGlScott added

comment:12 Changed 3 years ago by rwbarton

Here's a simple example of the quadratic code size I was talking about in comment:7, not specific to Read or Applicative or Monad.

f a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
  = a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10

At the Core and STG stages the size of f is linear in the number of arguments. But the Cmm for f builds a thunk for the first argument a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 and then calls +. In order to build that thunk it has to copy a1 through a9 into the thunk, as they are its free variables. Then the code for that thunk is going to build another thunk for a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8, which requires copying a1 through a8 again, and so on. The total code size, work done and allocations are all quadratic in the number of arguments.

In this particular case it would clearly be better to do all the thunk construction up front in f, like (pseudo-STG)

f a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
  = let t2 = a1 + a2
        t3 = t2 + a3
        t4 = t3 + a4
        ...
        t9 = t8 + a9
    in t9 + a10

which would be linear code size, work and allocations in the number of arguments.

comment:13 Changed 3 years ago by rwbarton

However, the quadratic behavior exhibited by the Read instance is more similar to this example

f :: (Int -> a) -> a
f k = k 0
{-# NOINLINE f #-}

v :: (Int,Int,Int,Int,Int,Int,Int,Int,Int,Int)
v = f (\a1 -> f (\a2 -> f (\a3 -> f (\a4 -> f (\a5 ->
    f (\a6 -> f (\a7 -> f (\a8 -> f (\a9 -> f (\a10 ->
    (a1,a2,a3,a4,a5,a6,a7,a8,a9,a10)))))))))))

v is supposed to correspond to the desugaring of the long do blocks in slyfox's attachments.

Here the quadratic blowup at the Cmm stage is caused by the fact that the free variables of the nth lambda are all the preceding a1, ..., an. We have to copy a1 through an into the (n+1)st lambda that we provide as an argument to the next f.

This is the blowup that I said I didn't know how to solve in the code generator in comment:7. However, (I claimed that) in this case we can avoid generating such nested lambdas in the first place by changing the desugaring to use Applicative methods rather than Monad ones.

comment:14 Changed 3 years ago by nomeata

(In reply to comment:12) Is this a specific +, or just the overloaded method from Num? I guess the latter, as a strict + would yield quite different code.

So you are proposing to add a transformation to STG (or Core) that replaces

let t1 = let t2 = e1
         in e2
in e3

with

let t2 = e1 
    t1 = e2
in e3

Can we give an easy criterion when this transformation is beneficial?

The problem is that this will increase allocation in the case when t1 is not called, as it is more space efficient to pack the union of the free variables of e1 and e2 into one closure, instead of having one for the free variables of e1 and one for those of e2.

We could say “Do this transformation if the free variables of e2 and e1” are disjoint. Then we’d allocate two extra words (one for the info table of t2, and one pointer to that in the closure for t1), but have some nice gains if t1 is indeed called.

But this rule would not fire in this case, because the Num a dictionary is also a free variable, which would now have to be copied into both closures!

I guess one could try and see whether real-world programs benefit from this transformation, and how many shared free variables between e1 and e2 are ok.

comment:15 in reply to:  14 Changed 3 years ago by rwbarton

Replying to nomeata:

(In reply to comment:12) Is this a specific +, or just the overloaded method from Num? I guess the latter, as a strict + would yield quite different code.

Right, I meant the latter. It didn't occur to me that this behavior would only occur when (+) is not known by GHC to be strict; I thought it would be enough for GHC to have to perform the call to (+), i.e., not inline it.

So you are proposing to add a transformation to STG (or Core) that replaces

let t1 = let t2 = e1
         in e2
in e3

with

let t2 = e1 
    t1 = e2
in e3

Can we give an easy criterion when this transformation is beneficial?

The problem is that this will increase allocation in the case when t1 is not called, as it is more space efficient to pack the union of the free variables of e1 and e2 into one closure, instead of having one for the free variables of e1 and one for those of e2.

And you're right that it's only a gain when (+) is actually going to evaluate t1, and a loss when it is not going to. In this case the gain is asymptotic and the loss is a constant factor, but if there were a lot of subexpressions and few free variables then the reverse could be true.

We could say “Do this transformation if the free variables of e2 and e1” are disjoint. Then we’d allocate two extra words (one for the info table of t2, and one pointer to that in the closure for t1), but have some nice gains if t1 is indeed called.

But this rule would not fire in this case, because the Num a dictionary is also a free variable, which would now have to be copied into both closures!

I guess one could try and see whether real-world programs benefit from this transformation, and how many shared free variables between e1 and e2 are ok.

Perhaps this discussion should move to a new ticket, since I remembered it is actually not the issue with the derived Read instance after all?

comment:16 Changed 3 years ago by nomeata

Perhaps this discussion should move to a new ticket, since I remembered it is actually not the issue with the derived Read instance after all?

Sure, #13213

comment:17 Changed 2 years ago by nelhage

Cc: nelhage added

comment:18 Changed 2 years ago by bgamari

See also: #14364, which tries to knock a factor off of the code size of derived Read instances.

Note: See TracTickets for help on using tickets.