Opened 4 years ago

Last modified 11 months ago

#10844 patch task

CallStack should not be inlined

Reported by: nomeata Owned by: gridaphobe
Priority: normal Milestone: 8.10.1
Component: Compiler Version: 7.10.2
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1259
Wiki Page:

Description

The use of CallStack based error messages (since changeset:6740d70d95cb) has led to some code size increase, and I believe this needs to be improved.

While investigating the #10788, I was looking at the core produced by that code, and I found this in the code:

-- RHS size: {terms: 2, types: 0, coercions: 0}
Main.main17 :: [Char]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 100 0}]
Main.main17 = unpackCString# "vecto_20t0fI93Jj9LUbaZg6e04I"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
Main.main16 :: [Char]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 110 0}]
Main.main16 = unpackCString# "Data.Vector.Primitive.Mutable"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
Main.main15 :: [Char]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 120 0}]
Main.main15 = unpackCString# "./Data/Vector/Primitive/Mutable.hs"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
Main.main14 :: Int
[GblId,
 Caf=NoCafRefs,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
Main.main14 = I# 97#

-- RHS size: {terms: 2, types: 0, coercions: 0}
Main.main13 :: Int
[GblId,
 Caf=NoCafRefs,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
Main.main13 = I# 16#

-- RHS size: {terms: 2, types: 0, coercions: 0}
Main.main12 :: Int
[GblId,
 Caf=NoCafRefs,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
Main.main12 = I# 21#

-- RHS size: {terms: 8, types: 0, coercions: 0}
Main.main11 :: SrcLoc
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 80}]
Main.main11 =
  SrcLoc
    Main.main17
    Main.main16
    Main.main15
    Main.main14
    Main.main13
    Main.main14
    Main.main12

This is clearly a CallStack from a library, and there is no point in copying that into the (every?) use of wherever that came from, as it will only increase code size and slow down compilation.

I don’t know how to fix it, though. Float out CallStacks more aggressively? Or is the problem that unfoldings are recorded before that can happen?

Change History (34)

comment:1 Changed 4 years ago by nomeata

Owner: set to gridaphobe

comment:2 Changed 4 years ago by gridaphobe

Just copying some notes from our discussion on phabricator for easy reference:

gridaphobe:

  • The increase in binary sizes (at least for the atom benchmark) is coming from a handful of modules in ghc-prim and base
  • GHC/Types.o has increased by about 5K. It's now the home of the SrcLoc and CallStack types, which together define 7-8 record selectors, so I think this may not be too surprising.
  • Control/Applicative.o, Data/Either.o, Data/Foldable.o, and a few others in base have increased in by 2-5K each. Every one of these modules contains at least one CallStack in the Core.
  • The other modules I've looked at that have not increased in size do not have a CallStack, so it seems pretty clear that the increase *is* due to the CallStack.
  • Finally, I think you're correct that the CallStacks are being inlined sometimes, because I found a CallStack referencing GHC.Base in Data.Foldable.

nomeata:

I’m not quite sure what the best way is to prevent the sharing. Maybe the type checker, which produces the CallStack values, can be told to add the (core equivalent) of a NOINLINE pragma to them?

---

I agree that CallStacks should not be inlined. Is there a core equivalent of a NOINLINE pragma? If so, it ought the be straightforward to have the desugarer attach it.

Also, is there any extra cost to lifting all of the int and string components of a SrcLoc to top-level binders (which seems similarly pointless)? Or does it just clutter the Core output?

comment:3 Changed 4 years ago by simonpj

Can anyone give a concrete small example in which too much inlining is happening?

(I agree there is absolutely no point in inlining literal strings.)

comment:4 Changed 4 years ago by nomeata

The source causing the above inlining of a CallStack looks like this:

instance Prim a => G.MVector MVector a where
  {-# INLINE basicUnsafeNew #-}
  basicUnsafeNew n
    | n < 0 = error $ "Primitive.basicUnsafeNew: negative length: " ++ show n
    | n > mx = error $ "Primitive.basicUnsafeNew: length to large: " ++ show n
    | otherwise = MVector 0 n `liftM` newByteArray (n * size)
    where
      size = sizeOf (undefined :: a)
      mx = maxBound `div` size :: Int

It is not surprising to me that an error in an INLINE function causes the CallStack to be inlined (although still rather pointless).

You can reproduce this, if you have vector installed, by compiling the example program given in #10788.

I tried to reproduce this with two smaller modules, i.e.

==> T10844a.hs <==
module T10844a where

foo :: Int -> Int
foo 0 = error "foo"
foo n = n
{-# INLINE foo #-}


==> T10844.hs <==
module T10844 where

import T10844a

n :: Int
n = 0
{-# NOINLINE n #-}

main = print (foo n)

but it did *not* show this behavior. But when I change the first module to

module T10844a where

class Foo a where foo :: a -> a

instance Foo Int where
    foo 0 = error "foo"
    foo n = n
    {-# INLINE foo #-}

then T10844 will contain a CallStack referencing a source location in T10844a

comment:5 Changed 4 years ago by gridaphobe

Sorry to be slow here.

I've been investigating this more today, trying to get GHC to not inline CallStacks. I can tell GHC to not inline Ids with an IP CallStack type by adding a special case to CoreUnfold.callSiteInline and SimplUtils.{pre,post}InlineUnconditionally, but this doesn't seem to be good enough.

In Joachim's minimal example (thanks by the way!) the CallStack still ends up in T10844. I've posted the output of -dverbose-core2core with what I believe to be the relevant portion highlighted.

https://gist.github.com/gridaphobe/fd9f313d7d91be405f01#file-gistfile1-txt-L773-L981

You can see that the pieces of the CallStack have just been floated out to the top-level, but $cfoos unfolding still contains the whole CallStack instead of just a reference to the newly-created binder.

So it seems that we want to tell the simplifier to update the unfolding when a CallStack is floated out. Is that a reasonable thing to do? (Sorry, my knowledge of this part of GHC is very limited)

As an alternative, perhaps we could revise how CallStacks are desugared. If we could tell the Desugarer to translate CallStack evidence into exported binders instead of local binders, we could mark those binders as NOINLINE. This approach seems like it would be less brittle.

(My WIP branch is at https://github.com/gridaphobe/ghc.git, branch noinline-callstacks)

comment:6 Changed 4 years ago by gridaphobe

Differential Rev(s): D1259
Status: newpatch

comment:7 Changed 4 years ago by gridaphobe

Differential Rev(s): D1259Phab:D1259

comment:8 Changed 4 years ago by simonpj

Eric, I'm very suspicious of this. Making the desugarer more complicated to fix a single instance of what is presumably a generic shortcoming in the float-out machinery, seems entirely wrong to me. So you fix CallStacks, but what about other instances of the same problem?

I would far rather characterise the problem more carefully and fix it at source. It's possible that it is Truly Hard to do that, in which case a more ad-hoc fix would be justifiable; but then it would need a careful Note to explain why it could not be done generically.

Could you instead characterise the problem? You don't need to use call-stacks; just write source code that generates the same pattern and lets look at it.

Thanks!

Simon

comment:9 Changed 4 years ago by gridaphobe

Ok, here's a version of Joachim's test case that doesn't involve CallStacks.

==> T10844a.hs <==
module T10844a where

data Foo = Foo Int String

showFoo (Foo i s) = unwords [ "Foo", show i, show s ]

foo n = showFoo $ Foo 0 "foo"
{-# INLINE foo #-}

==> T10844.hs <==
module T10844 where

import T10844a

main = print (foo 0)

With optimizations it produces the following core

% ghc --make -fforce-recomp -ddump-simpl T10844.hs -O
[1 of 2] Compiling T10844a          ( T10844a.hs, T10844a.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 61, types: 45, coercions: 0}

T10844a.showFoo3 :: [Char]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 40 0}]
T10844a.showFoo3 = GHC.CString.unpackCString# "Foo"#

T10844a.showFoo2 :: Char
[GblId,
 Caf=NoCafRefs,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
T10844a.showFoo2 = GHC.Types.C# ' '

T10844a.showFoo1 :: [Char]
[GblId,
 Caf=NoCafRefs,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 30}]
T10844a.showFoo1 =
  GHC.Types.: @ Char GHC.Show.shows6 (GHC.Types.[] @ Char)

T10844a.$wshowFoo [InlPrag=[0]] :: Int -> String -> String
[GblId,
 Arity=2,
 Str=DmdType <L,1*U(U)><L,1*U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [20 0] 220 0}]
T10844a.$wshowFoo =
  \ (ww_sRs :: Int) (ww1_sRt :: String) ->
    ++
      @ Char
      T10844a.showFoo3
      (GHC.Types.:
         @ Char
         T10844a.showFoo2
         (case ww_sRs of _ [Occ=Dead] { GHC.Types.I# ww3_aQt ->
          case GHC.Show.$wshowSignedInt 0 ww3_aQt (GHC.Types.[] @ Char)
          of _ [Occ=Dead] { (# ww5_aQx, ww6_aQy #) ->
          ++
            @ Char
            (GHC.Types.: @ Char ww5_aQx ww6_aQy)
            (GHC.Types.:
               @ Char
               T10844a.showFoo2
               (++
                  @ Char
                  (GHC.Types.:
                     @ Char
                     GHC.Show.shows6
                     (GHC.Show.showLitString ww1_sRt T10844a.showFoo1))
                  (GHC.Types.[] @ Char)))
          }
          }))

showFoo [InlPrag=INLINE[0]] :: Foo -> String
[GblId,
 Arity=1,
 Str=DmdType <S,1*U(1*U(U),1*U)>,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=1,unsat_ok=True,boring_ok=False)
         Tmpl= \ (w_sRp [Occ=Once!] :: Foo) ->
                 case w_sRp
                 of _ [Occ=Dead] { Foo ww1_sRs [Occ=Once] ww2_sRt [Occ=Once] ->
                 T10844a.$wshowFoo ww1_sRs ww2_sRt
                 }}]
showFoo =
  \ (w_sRp :: Foo) ->
    case w_sRp of _ [Occ=Dead] { Foo ww1_sRs ww2_sRt ->
    T10844a.$wshowFoo ww1_sRs ww2_sRt
    }

lvl_rEJ :: Int
[GblId, Caf=NoCafRefs, Str=DmdType]
lvl_rEJ = GHC.Types.I# 0

lvl1_rRU :: [Char]
[GblId, Str=DmdType]
lvl1_rRU = GHC.CString.unpackCString# "foo"#

lvl2_rRV :: String
[GblId, Str=DmdType]
lvl2_rRV = T10844a.$wshowFoo lvl_rEJ lvl1_rRU

foo [InlPrag=INLINE (sat-args=1)] :: forall t_aOn. t_aOn -> String
[GblId,
 Arity=1,
 Str=DmdType <L,A>,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=1,unsat_ok=False,boring_ok=False)
         Tmpl= \ (@ t_aOp) _ [Occ=Dead] ->
                 $ @ Foo
                   @ String
                   showFoo
                   (T10844a.Foo
                      (GHC.Types.I# 0)
                      (GHC.Base.build
                         @ Char
                         (\ (@ b_aPw) -> GHC.CString.unpackFoldrCString# @ b_aPw "foo"#)))}]
foo = \ (@ t_aOp) _ [Occ=Dead] -> lvl2_rRV



[2 of 2] Compiling T10844           ( T10844.hs, T10844.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 31, types: 22, coercions: 3}

T10844.main7 :: Int
[GblId,
 Caf=NoCafRefs,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 20}]
T10844.main7 = GHC.Types.I# 0

T10844.main6 :: [Char]
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 40 0}]
T10844.main6 = GHC.CString.unpackCString# "foo"#

T10844.main5 :: String
[GblId,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 30 0}]
T10844.main5 = T10844a.$wshowFoo T10844.main7 T10844.main6

...

The problem is that the unfolding for foo contains the string literal "foo" instead of the floated lvl1_rRU, so we end up inlining the string literal in T10844, which seems rather pointless.

In fact, I'd argue that since the Foo 0 "foo" doesn't contain any free variables, the whole thing should be floated to the top-level. GHC actually goes further and floats the entire RHS (to lvl2_rRV), but it doesn't update foos unfolding.

The thing that I'm unsure about is that this example relies on the INLINE pragma. As I understand it, INLINE tells GHC to unconditionally inline the entire definition, which is precisely what it's doing here. (INLINEABLE has a similar effect in this example.) The more general fix, to make the unfolding reflect the result of the FloatOut pass, would subtly change the semantics of INLINE. It would no longer mean "inline the entire RHS" but rather "inline the pieces of the RHS that we think are sensible to inline". Is that a change we want to make? It would certainly decrease binary sizes, but it seems like that could also cause a performance loss due to increased cache misses.

comment:10 Changed 4 years ago by nomeata

The more general fix, to make the unfolding reflect the result of the FloatOut pass, would subtly change the semantics of INLINE

AFAIK, we promise that when something is marked as INLINE, it the unfolding will match closely the definition and is _not_ already optimized. Floating stuff out here might for example prevent rules from firing at the inline site. Here the quote from the user’s guide:

So GHC guarantees to inline precisely the code that you wrote, no more and no less. It does this by capturing a copy of the definition of the function to use for inlining (we call this the "inline-RHS"), which it leaves untouched, while optimising the ordinarily RHS as usual. For externally-visible functions the inline-RHS (not the optimised RHS) is recorded in the interface file.

comment:11 Changed 4 years ago by gridaphobe

Thanks, that makes sense.

I see two ways to think about this:

  1. The real problem is the interaction between INLINE and the invisible CallStacks. Normally the programmer can see the entire definition that would be inlined because he just wrote it. So GHC can expect the programmer to make an informed decision about whether a given definition should be unconditionally inlined.

CallStacks change this because they're invisible to the programmer, GHC inserts them automatically. Worse, they're a sizeable datatype containing three Strings per call-site. For example, a simple call to error goes from the source-level

error "bad"

to something like

error [("error", SrcLoc 1 1 1 8 "file.hs" "Main")] "bad"

We can't reasonably expect programmers to account for such invisible expressions when deciding whether to add an INLINE pragma.

So perhaps a reasonable principle would be that GHC should not inline terms that the programmer did not write. At the moment I think this would only apply to CallStacks and dictionaries, but dictionaries are already top-level values so they shouldn't be dragged along the way CallStacks are.

  1. Perhaps inlining the code to build the CallStack is not so bad (I don't really know how to judge that), but the real problem is inlining the three string literals. It does seem quite silly to inline string literals, why should we need more than one copy? GHC already floats string literals to top-level binders, but it doesn't know that it can reuse literals from imported modules. In this case the principle would be that there should only be one copy of any given string literal (per package?).

So perhaps a Module could export a list of all string literals it allocated. Then a final simplification pass could replace literals with references to the literals that were already allocated in imported modules. This would leave the unfoldings unchanged and should not affect any opportunities for optimization, while still allowing us to clean up after the inliner.

These both seem like reasonable principles to me, but only the second would help with my callstack-free example in comment:9.

comment:12 Changed 4 years ago by simonpj

I think that (2) in comment:11 seems right to me.

One path would be to adjust FloatOut so that

  • it picks up string literals; but remember that these "string literals" actually look like (unpackCString# "foo"#).
  • (more invasively) it looks inside INLINE unfoldings, but in string-literal-only mode.

That seems quite complicated.

Another path is along the lines of your earlier patch: make the desugarer bind all string literals at top level in the first place. That way they'll be shared from birth. On the whole that sounds like an easier path to me. Particularly as the desugarer already uses a monadic function mkStringExprFS to make a new string literal, so you can make a desugarer-specific version of it which tosses a global binding into the monad. We'd need an extra bunch of monad-carried top-level bindings in the desugarer, but that is probably useful anyway. (It doesn't need to be string-specific.)

Of course this would mean that, as Joachim points out (comment:10) that we wouldn't INLINE exactly what the programmer wrote: we'd share the string literal instead of copying it. But I think we should be able to guarantee that the effect is the same, provided we add a CONLIKE pragma to unpackCSTring#. (When doing this step, do some perf tests to check what happens.)

This is all closely related to #8472, although it concerned unboxed string literals. There's a concrete proposal there, which would be good to execute on.

In short, now we are focusing on string literals rather than call stacks, I agree that doing some extra work in the desugarer is probably the best path.

BTW #5218 is in the same general area!

comment:13 Changed 4 years ago by gridaphobe

I'm not very impressed with the performance metrics on Phab:D1259 (nofib diff), and I have a small example that I think may illustrate the problem.

foo x = "foo" ++ x

boo = "foo"

Compare what ghc-master does

% ghc -O -fforce-recomp -ddump-simpl -dsuppress-all Foo.hs
[1 of 1] Compiling Foo              ( Foo.hs, Foo.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 8, types: 8, coercions: 0}

boo
boo = unpackCString# "foo"#

foo
foo = \ x_amS -> unpackAppendCString# "foo"# x_amS

to what my branch does

% ./inplace/bin/ghc-stage2 -O -fforce-recomp -ddump-simpl -dsuppress-all Foo.hs
[1 of 1] Compiling Foo              ( Foo.hs, Foo.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 8, types: 9, coercions: 0}

-- RHS size: {terms: 2, types: 0, coercions: 0}
boo
boo = unpackCString# "foo"#

-- RHS size: {terms: 4, types: 3, coercions: 0}
foo
foo = \ x_anI -> ++ boo x_anI

Lifting the boxed string literal to a top-level binder prevents ghc from rewriting the ++ to unpackAppendCString#. Why? Because boo is a String not an Addr#, so applying the rewrite would require inlining the "foo"# string literal, which ghc does not want to do.

What we really want (I think) is to make a top-level binder for "foo"# :: Addr#, so we'd end up with something like

foo# = "foo"#

boo = unpackCString# foo#

foo = \ x_amS -> unpackAppendCString# foo# x_amS

Concretely, I propose that the desugarer-specific mkStringExpr function create two top-level binders, one for the Addr# and one for the String. That should let us share the unboxed string and the boxed string.

Does this seem reasonable? Or is my sample program simply a non-issue?

(Sorry to keep jumping back and forth between phab and trac, but this seemed to fit better on the ticket than on the diff.)

comment:14 Changed 4 years ago by simonpj

See #7307 and #8472.

But if unpackCString# is marked CONLIKE, the rewrites for ++ should still work. Don't they? That's the whole point of CONLIKE

comment:15 Changed 4 years ago by gridaphobe

Unfortunately the rules don't seem to fire in the right order. The key step appears to be when we get to the following state.

boo = build (\ @ b_aog -> unpackFoldrCString# "foo"#)

lvl_spB = \ @ b_aot c_aou n_aov -> foldr c_aou n_aov boo

foo = \ x_anI -> augment lvl_spB x_anI

At this point, I would want fold/build to fire on lvl_spB, giving us

lvl_spB = \ @ b_aot c_aou n_aov -> unpackFoldrCString# "foo"# c_aou n_aov

followed by unpack-append on foo, which would finally give us

foo = \ x_anI -> unpackAppendCString# "foo"# x_anI

But instead foldr/app fires, collapsing foo back into a use of ++.

If I add a special rule for ++ on strings

{-# RULES

"++-string"  forall xs ys. unpackCString# xs ++ ys = unpackAppendCString# xs ys

#-}

I can coax GHC into generating the same Core as on master, but this is not very satisfying. It seems like we already have all of the necessary rules available.

comment:16 Changed 4 years ago by simonpj

Why doeesn't foldr/build fire? Becuase build isn't CONLIKE.

You want cheapBuild; see #7206. It's parked there, because it wasn't always a win; but you could use it for this narrower case.

comment:17 Changed 4 years ago by gridaphobe

Thanks, that makes sense.

After adding the cheapBuild patch (only for string literals) I'm finally seeing a nice win on binary sizes. Unfortunately it comes at the cost of runtime performance, which seems like the wrong tradeoff; I'd rather have larger binaries if they run faster.

Here's a selection of the more interesting summary results from nofib. I haven't had a chance to investigate where the performance loss is coming from.

NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           atom          -1.0%      0.0%     +0.9%     +1.6%      0.0%
   binary-trees          -0.9%     -0.0%     +9.4%     +9.2%      0.0%
           bspt          -1.0%     -0.3%     0.004     0.013    -33.3%
      cacheprof          -1.9%     +2.0%     -1.3%     -1.8%     +0.9%
        circsim          -1.0%     +0.0%     +4.9%     +5.1%      0.0%
    constraints          -1.0%      0.0%     +0.9%     +1.1%      0.0%
   cryptarithm1          -1.0%      0.0%     +6.5%     +7.7%      0.0%
            cse          -1.0%     +1.4%     0.000     0.008      0.0%
 fannkuch-redux          -1.0%      0.0%     +3.0%     +3.1%      0.0%
       fibheaps         +12.2%     -0.0%     0.017     0.025      0.0%
           fish          -1.0%     +3.1%     0.006     0.014      0.0%
         hidden          -1.0%     +0.0%     +1.5%     +1.7%      0.0%
        integer          -1.0%      0.0%     +1.7%     +2.0%      0.0%
   k-nucleotide          -0.9%     -0.0%     +2.7%     +2.9%      0.0%
           lcss          -1.0%      0.0%     -0.9%     -2.1%      0.0%
          power          -0.9%     +0.0%     -4.1%     -3.7%      0.0%
            scs          -0.7%     +0.1%     +1.3%     +0.8%      0.0%
--------------------------------------------------------------------------------
            Min          -1.9%     -2.0%     -4.1%     -3.7%    -33.3%
            Max         +12.2%     +3.5%     +9.4%     +9.2%     +0.9%
 Geometric Mean          -0.9%     +0.1%     +1.4%     +1.4%     -0.4%

comment:18 Changed 4 years ago by simonpj

  • Investigate carefully before trusting the nofib runtimes. They can wobble around a lot; you really need criterion and nofib doesn't use that. eg It's very unusual to see a genuine 9% increase in runtime with zero change in allocation on a program that is not spending a lot of time in non-allocating numeric loops.
  • Why is allocation going up? That's unexpected.
  • The fibheaps binary size is also a surprise. 12% is a lot.

comment:19 Changed 4 years ago by gridaphobe

Ok, will do.

The fibheaps increase is also interesting, yes. There's a bit more to it, actually. I had to add -package array to the fibheaps makefile because I was getting a linker error about undefined symbols from Data.Array.Base. This is not too surprising since the point of this whole thing is to prevent strings and callstacks from being inlined across modules. In this particular case, the undefined symbol was a callstack used in an error call in Data.Array.Base. But it does force us to link against array, which we weren't doing before.

comment:20 Changed 3 years ago by gridaphobe

The work in #8472 to float primitive string literals to the top did fix the issue I described in https://phabricator.haskell.org/D1259#72921, but it turns out there's another issue leading to increased allocations elsewhere in nofib. I've minimized the parstof benchmark to

module Foo where

c_the_program=(++) "main ip =\n" ((++) "  i2str (optim (myMain deciem))\n" ((++) ";\n" ((++) "\n" ((++) "TYPE tJobdef    = [ JOBDEF, int, int, int, tJobdef, tJobdef ] ;\n" ((++) "TYPE tJobstat   = [ JOBSTAT, int, int, int, int, tJobdef ] ;\n"
	((++) "TYPE tTree      = [ LEAF, int |\n" ((++) "                    TREE, tTree, tTree ] ;\n" ((++) "TYPE tProc      = [ PROC, int, tJobstat ] ;\n" ((++) "\n" ((++) "\n" ((++) "\n" ((++) "emptyjobdef     = [JOBDEF, 0     , 0 , 0, emptyjobdef, emptyjobdef] ;\n"  ""))))))))))))

which is just a chain of appends (though the number of (++) seems to matter!). GHC HEAD optimizes this into a single string literal, whereas my patch gives

-- RHS size: {terms: 1, types: 0, coercions: 0}
Foo.c_the_program11 :: GHC.Prim.Addr#
[GblId,
 Caf=NoCafRefs,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 120 0}]
Foo.c_the_program11 =
  "main ip =\n\
  \  i2str (optim (myMain deciem))\n\
  \;\n"#

-- RHS size: {terms: 1, types: 0, coercions: 0}
Foo.c_the_program10 :: GHC.Prim.Addr#
[GblId,
 Caf=NoCafRefs,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 620 0}]
Foo.c_the_program10 =
  "TYPE tJobdef    = [ JOBDEF, int, int, int, tJobdef, tJobdef ] ;\n\
  \TYPE tJobstat   = [ JOBSTAT, int, int, int, int, tJobdef ] ;\n\
  \TYPE tTree      = [ LEAF, int |\n\
  \                    TREE, tTree, tTree ] ;\n\
  \TYPE tProc      = [ PROC, int, tJobstat ] ;\n"#

-- RHS size: {terms: 1, types: 0, coercions: 0}
Foo.c_the_program7 :: GHC.Prim.Addr#
[GblId,
 Caf=NoCafRefs,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 190 0}]
Foo.c_the_program7 =
  "emptyjobdef     = [JOBDEF, 0     , 0 , 0, emptyjobdef, emptyjobdef] ;\n"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
Foo.c_the_program6 :: [Char]
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=True,
         WorkFree=False, Expandable=True, Guidance=IF_ARGS [] 20 0}]
Foo.c_the_program6 = GHC.CString.unpackCString# Foo.c_the_program7

-- RHS size: {terms: 3, types: 1, coercions: 0}
Foo.c_the_program5 :: [Char]
[GblId,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False, Guidance=IF_ARGS [] 30 0}]
Foo.c_the_program5 =
  ++ @ Char Foo.c_the_program8 Foo.c_the_program6
...

It's able to eliminate some of the (++) calls, but not all. I'm not yet sure why this is happening, but I imagine it involves a (++) term being floated out before we eliminate it.

comment:21 Changed 3 years ago by simonpj

Well the apppend-string rule is in PrelRules.lhs:

match_append_lit :: [Expr CoreBndr] -> Maybe (Expr CoreBndr)
match_append_lit [Type ty1,
                    Lit (MachStr s1),
                    c1,
                    Var unpk `App` Type ty2
                             `App` Lit (MachStr s2)
                             `App` c2
                             `App` n
                   ]
 = ...

Notice that it maches only on actual literal strings. If they are floated to top level they'll be replaced by a Var. Probably an Id whose unfolding is the literal string, but still it won't match the above.

Instead you probably need to use exprIsLiteral_maybe as in

match_Word64ToInteger _ id_unf id [xl]
  | Just (MachWord64 x) <- exprIsLiteral_maybe id_unf xl
  = case splitFunTy_maybe (idType id) of

later in the same module.

Try that?

Indeed, every match on a Lit pattern in these rules should go via exprIsLiteralMaybe. See if this works and then maybe at least open a ticket for the others; preferably do them. Could be a big win!

Simon

comment:22 Changed 3 years ago by gridaphobe

Ah yes, I was wondering about that built-in match. Thanks for the pointer to exprIsLiteral_maybe, I'll give it a try!

comment:23 Changed 3 years ago by gridaphobe

Actually, that's not the problem (though it may well be a profitable change regardless). The problem can be illustrated even more simply.

module Foo where

str = "foo" ++ "\n" ++ "\n"

We would like GHC to optimize this into

str = "foo\n\n"

but the pre-emptive lifting/cse that I've done during desugaring gives us

foo = "foo"
n = "\n"
str = foo ++ n ++ n

At this point the simplifier is in a bind because in order to fully inline everything, it would have to duplicate the "\n" string, and GHC presumably knows this is generally a bad idea.

Removing the pre-emptive CSE from the desugarer resolves this issue :)

comment:24 Changed 3 years ago by simonpj

At this point the simplifier is in a bind because in order to fully inline everything, it would have to duplicate the "\n" string, and GHC presumably knows this is generally a bad idea.

Yes, this is a key choice. Look at CoreSubst.exprIsLiteral_maybe. It calls expandUnfolding_maybe. Look at CoreSyn.expandUnfolding_maybe. It expands the unfolding if uf_expandable is True. That flag says if it's ok to expand an unfolding to make a RULE match.

But in comment:20 I see "Expandable = True" for those top-level string literals. So they should be visible to the rule matcher.

Some link in this chain isn't working. But I hope you can see the links now, so you can see what's not working?

Simon

comment:25 Changed 3 years ago by gridaphobe

Sorry for the confusion Simon, I had already fixed the problem by removing the pre-emptive CSE that I had added to the desugarer. This allows the simplifier to fully optimize the string operations and then common up any remaining literals.

comment:26 Changed 3 years ago by simonpj

OK, but it shoould work just as well if you do do pre-emptive CSE. If not, it exposes a fragility. Maybe we can find out why and fix it?

comment:27 Changed 3 years ago by gridaphobe

Ah, I see your point now. In comment:24 you weren't necessarily suggesting that GHC should duplicate strings, just that in the example I posted, there's nothing to stop GHC from duplicating these strings because they're marked expandable. That does seem to be worth investigating then.

comment:28 Changed 3 years ago by simonpj

The "expandable" thing means that it's ok to duplicate the let-bound thing if it makes a rule match. In this case it's the string-append rule -- and that rule is probably worth firing even if it does mean a bit of string duplication. (Of course it might also mean that the top-level binding ends up un-referenced in the end, so there's no duplication. But the worst-case is a bit of string duplication which is probably ok.

I don't understand why that wasn't happening. And if it's not happening, maybe other similar Good Things are not happening, all unseen. So fixing this might fix other unreported bugs. Thanks for looking into it.

comment:29 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In a6e13d50/ghc:

Make exprIsConApp_maybe work better for literals strings

There are two things here

* Use exprIsLiteral_maybe to "look through" a variable bound
  to a literal string.

* Add CONLIKE to the NOINLINE pragma for unpackCString# and
  unpackCStringUtf8#

See Trac #13317, Trac #10844, and
Note [exprIsConApp_maybe on literal strings] in CoreSubst

I did a nofib run and got essentially zero change except for one
2.2% improvement in allocation for 'pretty'.

comment:30 Changed 3 years ago by simonpj

Phab:D1259 does floating in the desugarer. We should document carefully here the reason that the FloatOut pass doesn't nail this; I keep forgetting. It's because functions with an INLINE pragma are supposed to inline precisely the RHS, so FloatOut doesn't float anything out of a stable unfolding.

So we are instead doing it in the desugarer, the same place that the stable unfoldings are built in the first place.

But we should pursue your idea from comment:11: GHC should not inline terms that the programmer did not write. In particular, perhaps when you write an incomplete pattern match, the catch-all error message could be plonked (by the desugarer) at top level?

comment:31 Changed 2 years ago by bgamari

Type of failure: None/UnknownCompile-time performance bug

comment:32 Changed 21 months ago by bgamari

Milestone: 8.6.1

comment:33 Changed 16 months ago by bgamari

Milestone: 8.6.18.8.1

These won't be fixed for 8.6, bumping to 8.8.

comment:34 Changed 11 months ago by bgamari

Milestone: 8.8.18.10.1

This will not be happening for 8.10.

Note: See TracTickets for help on using tickets.