Opened 3 years ago

Closed 3 years ago

Last modified 2 years ago

#13344 closed bug (fixed)

Core string literal patch regresses compiler performance considerably

Reported by: bgamari Owned by: bgamari
Priority: high Milestone: 8.2.1
Component: Compiler Version: 8.1
Keywords: strings Cc: akio, gridaphobe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8472 Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

The Core string literals patch regresses (see #8472) compiler performance by nearly 5%.

We should really understand why this is; this is a huge regression for something that really shouldn't be costly.

Change History (28)

comment:1 Changed 3 years ago by bgamari

Description: modified (diff)
Owner: set to bgamari
Priority: normalhigh
Type of failure: None/UnknownCompile-time performance bug

comment:2 Changed 3 years ago by bgamari

Version: 8.0.18.1

comment:3 Changed 3 years ago by bgamari

I asked David look into this a few weeks back; he says that he may have seen duplicated strings in Core.

T5837 looks like it would be a good testcase to evaluate this issue with.

comment:4 Changed 3 years ago by akio

Cc: akio added

I'm sorry that my patch slowed down GHC by so much. I'll try to investigate.

comment:5 Changed 3 years ago by bgamari

I had a quick look. I am seeing significant increases in compiler allocations as early as typecheck/rename. For instance, in the case of T5837 I see,

Phase Before patch (megabytes) After patch (megabytes)
Parser 1.069 1.069
Renamer/typechecker 117.168 125.031
Desugar 4.127 4.155
Simplifier 0.746 0.811
CoreTidy 2.277 2.516
CorePrep 0.287 0.299
CodeGen 6.804 6.845

The only difference in the simplified core is due to the strings associated with the Typeable bindings. Before the patch we had,

$trModule4 :: GHC.Types.TrName
$trModule4 = GHC.Types.TrNameS "T5837"#

Now we have,

$trModule3 :: GHC.Prim.Addr#
$trModule3 = "T5837"#

$trModule4 :: GHC.Types.TrName
$trModule4 = GHC.Types.TrNameS $trModule3

While some small fraction of the allocations change after CodePrep is attributable to the fact that StgSyn got slightly larger (since we now have a separate type for top-level bindings), this doesn't explain the difference during renaming/typechecking. I have verified that the output from renaming and typechecking is identical before and after the change. I wonder if some of these newly floating string literals have snuck into interface files?

Last edited 3 years ago by bgamari (previous) (diff)

comment:6 Changed 3 years ago by bgamari

Hmm, it seems that even tc-trace and rn-trace output is unchanged by the patch.

It looks like I'll need to build some ticky'd compilers.

comment:7 Changed 3 years ago by simonpj

It looks like I'll need to build some ticky'd compilers.

Indeed; thank you. This is deeply weird.

comment:8 Changed 3 years ago by bgamari

Hmm, indeed ticky suggests that there is some leakage into interface files,

| Change    | alloc A | alloc B | name                                                                                            |
|-----------|---------|---------|-------------------------------------------------------------------------------------------------|
| +417944.0 | 3978840 | 4396784 | containers-0.5.7.1:Data.IntMap.Base.$winsert (:<no module>)                                     |
| +249216.0 | 1225560 | 1474776 | $wgo (ghc:BinIface)                                                                             |
| +166144.0 | 821968  | 988112  | go7 (ghc:LoadIface)                                                                             |
| +131256.0 | 1767520 | 1898776 | ghc:Module.$sinsertWith_$s$sgo10 (:<no module>)                                                 |
| +75680.0  | 566080  | 641760  | mk_supply (ghc:UniqSupply)                                                                      |
| +68208.0  | 454152  | 522360  | unsafeDupableInterleaveIO1 (base:GHC.IO.Unsafe)                                                 |
| +62016.0  | 1209552 | 1271568 | containers-0.5.7.1:Data.Map.Base.balanceL (:<no module>)                                        |
| +52864.0  | 1592272 | 1645136 | ghc:IfaceType.$w$cget6 (:<no module>)                                                           |
| +45312.0  | 744224  | 789536  | ghc:Binary.$w$cget3 (:<no module>)                                                              |
| +45312.0  | 226728  | 272040  | ghc:IfaceSyn.$w$cget1 (:<no module>)                                                            |
| +45312.0  | 238080  | 283392  | ghc:TcRnMonad.$wforkM_maybe (:<no module>)                                                      |
| +44776.0  | 934504  | 979280  | ghc:FastString.$wmkFastStringWith (:<no module>)                                                |
| +31360.0  | 340240  | 371600  | go (base:Data.OldList)                                                                          |
| +30208.0  | 156096  | 186304  | ghc:Module.$WDefiniteUnitId (:<no module>)                                                      |
| +30208.0  | 156096  | 186304  | ghc:Module.$w$cget2 (:<no module>)                                                              |
| +29008.0  | 1155896 | 1184904 | base:GHC.Base.++ (:<no module>)                                                                 |
| +27976.0  | 39000   | 66976   | ghc:TysWiredIn.$wisBuiltInOcc_maybe (:<no module>)                                              |
| +26432.0  | 130984  | 157416  | $wlvl12 (ghc:HscTypes)                                                                          |
| +26432.0  | 202496  | 228928  | bytestring-0.10.8.1:Data.ByteString.$wcopy (:<no module>)                                       |
| +26432.0  | 355440  | 381872  | ghc:NameCache.extendNameCache1 (:<no module>)                                                   |
| +22768.0  | 71896   | 94664   | lexToken (ghc:Lexer)                                                                            |
| +22656.0  | 137472  | 160128  | ghc:Binary.$w$cget1 (:<no module>)                                                              |
| +22656.0  | 107040  | 129696  | ghc:Binary.$wlazyGet (:<no module>)                                                             |
| +22656.0  | 333696  | 356352  | ghc:FastString.$wmkFastStringByteString (:<no module>)                                          |
| +20544.0  | 0       | 20544   | ghc:TcRnMonad.emitInsoluble3 (:<no module>)                                                     |

comment:9 Changed 3 years ago by bgamari

Ahh, indeed I should have looked at the interface files sooner. There are indeed more bindings here. e.g., previously in Prelude we had (in --show-iface output),

  $trModule :: Module
  {- HasNoCafRefs, Strictness: m,
     Unfolding: (Module $trModule2 $trModule1) -}

Now we have,

  $trModule1 :: TrName
  {- HasNoCafRefs, Strictness: m1, Unfolding: (TrNameS $trModule2) -}

  $trModule2 :: Addr#
  {- HasNoCafRefs, Unfolding: ("Prelude"#) -}

comment:10 Changed 3 years ago by bgamari

Just looking at the interface file sizes between the two trees, it appears that many files increased by a few hundred bytes with a few (e.g. GHC.Show, GHC.Generics) increasing by some kilobytes. For future reference I performed the comparison with,

echo "delta after before file"
paste \
    <(ls -l $(find ghc-compare-1/libraries/base/dist-install -iname *.hi) | awk '{print $5,$9}') \
    <(ls -l $(find ghc-compare-2/libraries/base/dist-install -iname *.hi) | awk '{print $5,$9}') \
    | awk '{ print (($1 - $3)), $1, $3, $2}' 
    | sort -rn | less

Which results in

16461 800034 783573 libraries/base/dist-install/build/GHC/Generics.hi
7673 130677 123004 libraries/base/dist-install/build/GHC/RTS/Flags.hi
7555 969695 962140 libraries/base/dist-install/build/Data/Data.hi
7449 103629 96180 libraries/base/dist-install/build/GHC/IO/Exception.hi
7068 268944 261876 libraries/base/dist-install/build/Data/Foldable.hi
6904 136200 129296 libraries/base/dist-install/build/GHC/Show.hi
4830 211720 206890 libraries/base/dist-install/build/Data/Monoid.hi
4767 458028 453261 libraries/base/dist-install/build/Data/Semigroup.hi
4735 498885 494150 libraries/base/dist-install/build/Foreign/C/Types.hi
4651 59316 54665 libraries/base/dist-install/build/GHC/IO/Handle/Types.hi
...

In the case of GHC.Generics we now have things like,

  $fEqV14 :: [Char]
  {- Unfolding: (unpackCString# $fEqV15) -}

  $fEqV15 :: Addr#
  {- HasNoCafRefs, Unfolding: ("error"#) -}

Whereas we previously had,

  $fEqV11 :: [Char]
  {- Unfolding: (unpackCString# "error"#) -}

One potential improvement we can make here (at the expense of some constant folding opportunities) is to drop the unfoldings from Addr# bindings; afterall, in most cases there is little benefit to inlining the string literal.

In the case of the Typeable bindings, I'm very tempted to just eliminate unfoldings from all generated TrNames. This would likely cut down on interface file sizes dramatically.

comment:11 in reply to:  3 Changed 3 years ago by gridaphobe

Cc: gridaphobe added

Replying to bgamari:

I asked David look into this a few weeks back; he says that he may have seen duplicated strings in Core.

T5837 looks like it would be a good testcase to evaluate this issue with.

FWIW, I noticed this weekend that Phab:D2605 did not enable CSE on the floated literals. I think the latest 1% decrease in binary sizes in my Phab:D1259 is largely due to enabling CSE on the floated strings.

comment:12 Changed 3 years ago by gridaphobe

Actually, I retract my comment, I think I grossly misunderstood what the exprIsLiteralString check in tryForCSE was doing..

comment:13 Changed 3 years ago by rwbarton

The patch in #8472 allows floating out unboxed string literals to top level in (at least) three places:

  1. In the "Floats" section in SimplEnv we float a let out of a let. If the body of the let is a lambda then we can increase the arity of the original binding, which is a good thing. This is the optimization that was allowed in the example of #8472 by letting the unboxed string literal float to the top level.
  1. In prepareRhs in Simplify we can float, for example, an argument out of a constructor. In general this brings us closer to ANF, but when the argument is a literal it is trivial already; and the data representation we generate for Ptr "blob"# is just fine.
  1. In the float out pass we can float out an expression to avoid repeated allocation or evaluation. But there is no allocation or evaluation associated to an unboxed literal.

In https://github.com/ghc/ghc/commit/8f99cfa49be0fc0767f0c73967ff2472de5cfcd6 I've tried disabling the above items 2 and 3 for unboxed string literals. It still produces the desired code in the example of #8472, but it no longer floats the string literals out of, for example, $trModule bindings. I'm waiting to see what perf.haskell.org thinks about it.

comment:14 Changed 3 years ago by simonpj

There is a good reason for (2). Consider

let x = C "foo"#
in .x..x..x.....case x of { C s -> ...s...s... }....

We'll do case-of-known-constructor on the case expression, binding s to the component. Ah! The component is "foo"# so we'll duplicate "foo"# at every occurrence of s.

That is why we do ANF for let-bound constructors. I think we should continue to do so. (Unless, just possibly, the let-binding has a NOINLINE pragma...)

For (3) I agree. The only reason to aggressively float literals would be for CSE.


More generally, if it is really the case that these top level bindings

x = "foo#"
y = C x

take a lot longer to compile (code generate?) than

y = C "foo"#

then I think we should seek insight into why. There is no good reason for it to take longer. Let's not bend ourselves out of shape to accommodate what may possibly be an easily-fixed perf bug in the back end.

comment:15 Changed 3 years ago by bgamari

Reid and I had a rather lengthy discussion about this problem. The take-away message is that while there are a variety of things that we could do to avoid moving string literals up to the top-level (and consequently avoid the costs associated with interface file names, as seen in comment:10), it is unfortunately rather tricky to do in a way that doesn't either come at significant implementation cost or risk regressing #12585.

Consequently, we came up with a few ideas for instead reducing the cost of the now-larger interface files. Happily, this effort may also have helpful effects beyond string literals. The gist is that we want make interface file deserialisation more lazy. Reid is working on a patch.

comment:16 Changed 3 years ago by simonpj

I'm all for making interface-file deserialisation cheaper.

But I don't feel that we have a well-characterised problem. What exactly is the problem we are trying to solve here?

Is it ONLY that

x :: Addr#
x = "foo"#
y :: T
y = K x

is more expensive than

y :: T
y = K "foo"#

And if it is more expensive, how much more expensive? And does that cost come from simplifying, spitting out an interface file, code generation, reading in an interface file?

I don't think there is much to do with strings here. The same will happen for

x = [True, False, True, ..., True]

Perhaps we just don't want to expose the unfolding for a top level constructor. That would be a bit less good on optimisation, but perhaps not a lot. We could measure it.

I feel as if we are at risk of taking actions without data to guide them. I'd love to have data.

Simon

comment:17 Changed 3 years ago by rwbarton

The increase in interface file size is definitely not the only factor. Another known problem is that CSE is not working properly for top-level unboxed string literals. I found that I could make it work by changing the Lit l case in use_subst below, in CSE:

addBinding :: CSEnv                      -- Includes InId->OutId cloning
           -> InId
           -> OutId -> OutExpr           -- Processed binding
           -> (CSEnv, OutId)             -- Final env, final bndr
-- Extend the CSE env with a mapping [rhs -> out-id]
-- unless we can instead just substitute [in-id -> rhs]
addBinding env in_id out_id rhs'
  | noCSE in_id = (env,                              out_id)
  | use_subst   = (extendCSSubst env in_id rhs',     out_id)
  | otherwise   = (extendCSEnv env rhs' id_expr', zapped_id)
  where
    id_expr'  = varToCoreExpr out_id
    zapped_id = zapIdUsageInfo out_id
       -- Putting the Id into the cs_map makes it possible that
       -- it'll become shared more than it is now, which would
       -- invalidate (the usage part of) its demand info.
       --    This caused Trac #100218.
       -- Easiest thing is to zap the usage info; subsequently
       -- performing late demand-analysis will restore it.  Don't zap
       -- the strictness info; it's not necessary to do so, and losing
       -- it is bad for performance if you don't do late demand
       -- analysis

    -- Should we use SUBSTITUTE or EXTEND?
    -- See Note [CSE for bindings]
    use_subst = case rhs' of
                   Var {} -> True
                   Lit l  -> litIsTrivial l
                   _      -> False

litIsTrivial is false for an unboxed string literal (and for an Integer literal, and True for other literals).

However I don't really understand what is going on here. Is making use_subst return True for unboxed string literals the right fix?

I also don't think fixing this is going to make a large change to total build time, since gridaphobe says that Phab:D1259 also fixes CSE for top-level string literals, yet it improved overall build time by less than 1% on my test system.

comment:18 Changed 3 years ago by simonpj

OK so you are pointing out a bug in CSE. Very good! Let's make a ticket (#13367) and fix the bug. I have done that; fix validating now.

(The same fix is in gridaphobe's patch, but it's orthogonal, so best done separately.)

Back to comment:16. Let's assume that this string-CSE thing makes no difference. Then all my questions in comment:16 remain. I just want to be sure that we are fixing the right problem!

comment:19 Changed 3 years ago by rwbarton

Consequently, we came up with a few ideas for instead reducing the cost of the now-larger interface files. Happily, this effort may also have helpful effects beyond string literals. The gist is that we want make interface file deserialisation more lazy. Reid is working on a patch.

This patch is Phab:D3269.

Back to comment:16. Let's assume that this string-CSE thing makes no difference. Then all my questions in comment:16 remain. I just want to be sure that we are fixing the right problem!

I think it will make some substantial difference, but I suspect there is still something else going on. When your CSE fix is ready (I assume it is a simple change), I will try comparing

  • the commit before the top-level string patch, with
  • the top-level string patch plus your fix to CSE

by doing a validate --build-only with detailed tracking of where time is being spent.

comment:20 Changed 3 years ago by simonpj

When your CSE fix is ready (I assume it is a simple change)

It is, but I got snarled up in something else. Monday I hope

comment:21 in reply to:  16 Changed 3 years ago by rwbarton

Replying to simonpj:

Is it ONLY that

x :: Addr#
x = "foo"#
y :: T
y = K x

is more expensive than

y :: T
y = K "foo"#

And if it is more expensive, how much more expensive? And does that cost come from simplifying, spitting out an interface file, code generation, reading in an interface file?

I think I've found the main source of extra cost from this transformation. When we build a module with -split-objs, the code generator emits one assembly file per strongly-connected component of Cmm declarations, and then the driver runs the assembler on each of these files. In the example above, the first program will be translated into two .o files using two assembler invocations, while the second will be translated into a single .o file with a single assembler invocation. Thus each string literal will, with this change, normally result in an extra assembler invocation (assuming it gets floated to top level, and is only referred to from one place).

The builds on https://perf.haskell.org/ghc are done using something like the default (perf) flavour build settings, so when it built this commit (https://perf.haskell.org/ghc/#revision/d49b2bb21691892ca6ac8f2403e31f2a5e53feb3) the libraries were built with -split-objs. I ran similar builds myself and found that the number of .o files (so, assembler invocations, roughly) increased from 100000 to 133000 with the top-level strings patch. The 33000 extra assembler invocations can plausibly explain the extra ~100 seconds of total build time. But rather than hand-waving estimates, there's a better way to confirm what happened.

In fact, we actually mean to build the libraries not with -split-objs, but with -split-sections, a new flag that achieves a similar effect but which requires running the assembler only once per module. However, the top-level strings patch (Jan 18) landed in the window between

The first of these two commits increased build time from 1891s to 2150s, a change of +259s, and the second decreased it from 2653s to 2311s, a change of -342s. Let's assume that the increase between these commits (from 2150s to 2653s) is caused mainly by ghc getting slower (e.g., join points core lint), not by ghc getting larger (more modules to compile). The changes of +259s and -342s from these -split-sections patches are presumably due mainly to saving 100000 and 133000 assembler invocations respectively; and 342/259 = 1.32 is in remarkably close agreement with these numbers. That suggests that 342-259 = 83 of the 102 seconds by which top-level unboxed strings increased the build time should be attributed to additional invocations of the assembler.

The good news is that we do not have to do anything about those 83 seconds, as we do not use -split-objs during the GHC build any more. Effectively they were measurement error, due to the build being misconfigured when the top-level strings patch landed.

This leaves an increase of 102-83 = 19 seconds (out of a total ~2000, so about 1%) from the top-level strings patch, which can plausibly be explained by (1) larger interface files and (2) lack of CSE for top-level strings.

Last edited 3 years ago by rwbarton (previous) (diff)

comment:22 Changed 3 years ago by gridaphobe

Wow, excellent sleuthing!

So is the bottom line is that the increase that you believe to actually be attributable to the top-level strings patch will be fixed by the combined CSE of string literals and lazier interface file patches? (I seem to recall you mentioning a gain of roughly 1% from one or the other, but it's not in this ticket, perhaps IRC..)

comment:23 Changed 3 years ago by rwbarton

(I seem to recall you mentioning a gain of roughly 1% from one or the other, but it's not in this ticket, perhaps IRC..)

Yes, I think I mentioned on IRC that your D1259 sped up the build by about 1%; likely because it fixes CSE, but not sure.

comment:24 Changed 3 years ago by simonpj

Great stuff. My CSE patch is done but I have to re-validate having caught up with HEAD. In flight...

comment:25 Changed 3 years ago by simonpj

OK done

commit 9304df5230a7a29d3e992916d133e462b854e55f
Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Mar 3 11:00:04 2017 +0000

    Fix CSE (again) on literal strings
    
    Fixes Trac #13367.  See Note [Take care with literal strings]

comment:26 Changed 3 years ago by rwbarton

Resolution: fixed
Status: newclosed

As perhaps expected, the above commit sped up the build by 20 seconds: https://perf.haskell.org/ghc/#revision/9304df5230a7a29d3e992916d133e462b854e55f (click the "=" at the top right to see the build time) which means that we've accounted for the regressions from the top-level strings patch.

comment:27 Changed 2 years ago by bgamari

Description: modified (diff)

comment:28 Changed 2 years ago by bgamari

Keywords: strings added
Note: See TracTickets for help on using tickets.