Opened 3 years ago

Closed 2 years ago

Last modified 23 months ago

#12150 closed bug (fixed)

Compile time performance degradation on code that uses undefined/error with CallStacks

Reported by: thomie Owned by:
Priority: high Milestone: 8.2.2
Component: Compiler Version: 8.0.1
Keywords: Cc: gridaphobe, dfeuer, saurabhnanda
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case: perf/compiler/T12150
Blocked By: Blocking:
Related Tickets: #10844 Differential Rev(s): Phab:D3753
Wiki Page:

Description (last modified by thomie)

GHC 8 has a lot of trouble compiling the following program:

module Serialize where

data Result a = Success a | Error String

{- 80 guards

   ghc-7.10.3 -O :  0.3s
   ghc-8.0.1 -O  :  1.8s 
-}

instance Functor Result where
    {-# INLINE fmap #-}
    fmap | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f
         | bool = f

      where
        bool = undefined
        f = undefined

Here are some timing results, depending on the number of | bool = f clauses:

* ghc-8.0.1

N clauses : time (s)
10        : 0.61
20        : 0.78
40        : 1.03
80        : 1.64
160       : 2.83
320       : 5.16
640       : 10.37
1280      : 21.16

* ghc-7.10.3

N clauses : time (s)
10        : 0.33
20        : 0.29
40        : 0.34
80        : 0.30
160       : 0.32
320       : 0.35
640       : 0.48
1280      : 0.80

I think this compile time difference is caused by the CallStack changes introduced in GHC 8.0. When I use a version of undefined that doesn't have a CallStack, there is no difference in compile time when using GHC 7.10 or GHC 8.0.

This is my implementation of undefined without CallStack:

import GHC.Exception (errorCallException)
import GHC.Prim (raise#)
import Prelude (Char)

error :: [Char] -> a
error s = raise# (errorCallException s)

undefined :: a
undefined = error "undefined without callstack"

This is the quick and dirty Python script I used to generate those timing results (ghc version is hardcoded):

import os
import tempfile
import time
import subprocess

def src(n):
    return '''
module Test where

data Result a = Success a | Error String

instance Functor Result where
    {{-# INLINE fmap #-}}
    fmap {0}
      where
        bool = undefined
        f = undefined

'''.format('\n        | bool = f' * n)


tempfile = tempfile.mktemp('.hs')
print('tempfile = {0}'.format(tempfile))

print('N clauses : time (s)')
for i in range(8):
    n = 10 * 2 ** i
    with open(tempfile, 'w') as f:
        f.write(src(n))
        f.flush()

        t0 = time.time()
        subprocess.call(['ghc-8.0.1', '-fforce-recomp', '-v0', '-O', tempfile])
        t1 = time.time()

        print(str(n).ljust(10) + ': %.2f' % (t1 - t0))

os.remove(tempfile)

Change History (25)

comment:1 Changed 3 years ago by gridaphobe

Sorry to be so slow! I've finally had a chance to look at this..

First of all, I can't reproduce the actual slowdown on my laptop

> python test.py
tempfile = /var/folders/d0/j2pt98tx3pvcpdjghyx38c9c0000gn/T/tmpvwxkz4.hs
N clauses : time (s)
10        : 0.27
20        : 0.31
40        : 0.13
80        : 0.13
160       : 0.12
320       : 0.12
640       : 0.12
1280      : 0.13

though the shrinking numbers makes me concerned that there may be some strange caching going on.. Ah, indeed, if I add -fforce-recomp I get something closer to your result

> python test.py
tempfile = /var/folders/d0/j2pt98tx3pvcpdjghyx38c9c0000gn/T/tmpXChceI.hs
N clauses : time (s)
10        : 0.27
20        : 0.31
40        : 0.37
80        : 0.55
160       : 0.88
320       : 1.55
640       : 3.06
1280      : 6.10

Second, I took a look at the generated Core, and while the desugared Core looks correct to me, the simplified Core has had the CallStacks inlined at each callsite, which is utterly pointless.

In an earlier ticket (#10844) I investigated preventing CallStacks (and string literals) from being inlined, but the results were not very impressive so the patch stalled. Looks like I should take another look at it!

comment:2 Changed 3 years ago by gridaphobe

comment:3 Changed 3 years ago by simonpj

I'd be happy to advise on the inlining front, once you have a bit more info.

Simon

comment:4 Changed 3 years ago by thomie

Description: modified (diff)
Milestone: 8.0.2
Priority: normalhigh

comment:5 Changed 3 years ago by simonpj

Eric, might you have time to look at this?

comment:6 Changed 3 years ago by gridaphobe

Yes, sorry! I dusted off the old no-inline patch a couple weeks ago, made a few tweaks, and noticed better results, but I haven't had a chance to clean it up for review yet. I'll try to get it into Phab this week.

comment:7 Changed 3 years ago by gridaphobe

So, although my no-inline patch is now producing better results for the original CallStack-inlining issue, it doesn't seem to be doing much for this issue. Upon closer inspection of the Core I noticed something quite interesting. When we remove the CallStacks from undefined, GHC manages to optimize away the entire set of guards! In stark contrast, leaving the CallStacks gives us the following Core

-- RHS size: {terms: 4, types: 7, coercions: 0}
Test.$fFunctorResult_$cfmap [InlPrag=INLINE (sat-args=0)]
  :: forall a_aFQ b_aFR.
     (a_aFQ -> b_aFR) -> Result a_aFQ -> Result b_aFR
[GblId,
 Str=b,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=False, ConLike=False,
         WorkFree=False, Expandable=False,
         Guidance=ALWAYS_IF(arity=0,unsat_ok=False,boring_ok=False)
         Tmpl= \ (@ a_aFS) (@ b_aFT) ->
                 let {
                   $dIP_s1QH :: GHC.Stack.Types.CallStack
                   [LclId]
                   $dIP_s1QH =
                     GHC.Stack.Types.pushCallStack
                       (Test.$fFunctorResult10, Test.$fFunctorResult8)
                       GHC.Stack.Types.emptyCallStack } in
                 let {
                   bool1_aqS :: forall a1_a1Dw. a1_a1Dw
                   [LclId]
                   bool1_aqS =
                     \ (@ a1_a1Dw) ->
                       undefined
                         @ 'GHC.Types.PtrRepLifted
                         @ a1_a1Dw
                         ($dIP_s1QH
                          `cast` (Sym
                                    (GHC.Classes.N:IP[0]
                                       <"callStack">_N <GHC.Stack.Types.CallStack>_N)
                                  :: (GHC.Stack.Types.CallStack :: *)
                                     ~R#
                                     ((?callStack::GHC.Stack.Types.CallStack) :: Constraint))) } in
                 case bool1_aqS @ Bool of {
                   False ->
                     case bool1_aqS @ Bool of {
                       False ->
                         case bool1_aqS @ Bool of {
                           False ->
                             case bool1_aqS @ Bool of {
                               False ->
                                 case bool1_aqS @ Bool of {
                                   False ->
                                     case bool1_aqS @ Bool of {
                                       False ->
                                         case bool1_aqS @ Bool of {
                                           False ->
                                             case bool1_aqS @ Bool of {
...]
Test.$fFunctorResult_$cfmap =
  \ (@ a_aFS) (@ b_aFT) -> case bool1_aqS of wild_00 { }

So GHC is optimizing the actual definition, but not the unfolding, whereas with the old undefined it also optimized the unfolding.

I also noticed that I can reproduce both behaviors with a simple wrapper around undefined, by choosing whether the wrapper should take a CallStack.

-- myUndef :: HasCallStack => a
-- myUndef :: a
myUndef = undefined

The HasCallStack variant is slow, but the CallStack-free variant behaves just like the old undefined. I suppose this is to be expected, but it seems to suggest that the issue is connected to the fact that a use of undefined is now really an application undefined $call_stack rather than a simple variable. Indeed, if I make one final tweak to the program and use the old error instead of undefined, I get a similar behavior to the new undefined.

N clauses : time (s)
10        : 0.44
20        : 0.48
40        : 0.54
80        : 0.76
160       : 1.13
320       : 1.93
640       : 3.52
1280      : 6.97

I'm not really sure where to go from here. It seems like the solution would be to convince GHC to rewrite the unfolding for the new undefined (or error for that matter), but rewriting the unfolding is not something we're supposed to do, right? Which makes me wonder, why was GHC rewriting the unfolding for the old undefined in the first place?

comment:8 Changed 3 years ago by gridaphobe

Thomie, could you share the program from which you derived your example?

Does it also use undefined or was it using CallStacks in some other manner? And if it was using undefined, do you see a similar slowdown if you replace undefined with the old (callstack-less) error?

comment:9 Changed 3 years ago by thomie

gridaphobe: I was investigating a report of some (other) compile time regression (I don't remember which ticket or which program). I tried to find a small reproducible example, deleting as much code as possible, and replacing function bodies with undefined.

I then noticed something fishy was going on with undefined itself, and came up with the above example. So the above example doesn't look anything like the original program I was investigating.

Constant factor performance degradation

The measurements presented in the description make it seem like 7.10.3 took O(1) seconds to compile the program (where n is the number of guards), while 8.0.1 took O(n) seconds. This is not the case: both versions of GHC need O(n) time.

Running the program from the description a bit longer:

7.10.3:

N clauses : time (s)
10        : 1.39
20        : 0.28
40        : 0.30
80        : 0.28
160       : 0.36
320       : 0.39
640       : 0.53
1280      : 0.74
2560      : 1.17
5120      : 2.11
10240     : 4.21
20480     : 8.18
40960     : 17.15

So we're should really just be looking at constant factor performance differences.

Timing results

Time (s) to compile the program with 1000 bool = f clauses with different versions of undefined and error:

bool/f= ghc-7.10.3 ghc-8.0.1
old undefined 0.6 0.8
new undefined 0.6 18
myUndef :: a; myUndef = old undefined 2.0 2.1
myUndef :: a; myUndef = new undefined 2.0 4.2
old error 6 6.5
new error 6 5

Quotes

the CallStack-free variant [myUndef :: a] behaves just like the old undefined

No, it is still 5x slower (4.2 vs 0.8 seconds).

if I .. use the *old* error instead of undefined, I get a similar behavior to the *new* undefined

No. 8.0.1 still takes 3x longer to compile the (new) undefined than it is to compile the (old or new) error (18 vs 5-6 seconds).

When we remove the CallStacks from undefined, GHC manages to optimize away the entire set of guards!

Hmm, maybe the whole example is flawed.

comment:10 Changed 3 years ago by bgamari

Milestone: 8.0.28.0.3

This won't be fixed for 8.0.2.

comment:11 Changed 3 years ago by bgamari

Milestone: 8.0.38.2.1

At this point it is rather unlikely that there will be an 8.0.3. Re-milestoning.

comment:12 Changed 3 years ago by bgamari

Milestone: 8.2.18.2.2

Hmm, this is an interesting ticket which we should really have a closer look at. Sadly, though, this won't happen for 8.2.1.

comment:13 Changed 2 years ago by dfeuer

I see that GHC HEAD now optimizes the gigantic redundant case away (eventually), but it still goes more slowly than we'd like. For example, in GHC 7.10, -ddump-spec shows that all the redundant crud is already gone, whereas on GHC 8.3, it is still there then. I would like to understand how 7.10 gets rid of the extra cases before demand analysis. Is it using the fact that there are nested matches on the same constructor? Does that test not ignore type arguments it should?

Interestingly, it seems that at a certain point bool actually gets inlined at a bunch of call sites (some in normal terms; some in unfoldings), but not all.

comment:14 Changed 2 years ago by dfeuer

OK, I have a guess about what's going on now. I didn't notice this before, because the shapes were similar, but there's a big difference between how 7.10 and 8.0 handle this code in the very earliest stages. In particular, -ddump-ds for 7.10 shows that bool has already inlined, exposing the undefineds (which it already knows to be bottom), whereas that has not happened in 8.0. Presumably, the slight optimization that takes place along with desugaring sees bool = undefined, concludes that there's no reason to have the extra identifier, and inlines it away altogether.

In 8.0 and 8.2, bool is not just undefined, but undefined applied to a dictionary. So suddenly it needs to actually produce a bool binding, and doesn't figure out until demand analysis that bool is always bottom.

I suspect that what we want to do is to shift just a bit of demand analysis way up. undefined has demand signature

<B,1*U>x

So we really shouldn't need to wait for full demand analysis to conclude that bool has demand signature x; that follows immediately. Could we teach the simplifier how to reduce some demand applications? Perhaps it should even reduce any that it sees; I'm not sure of the full implications of that choice.

comment:15 Changed 2 years ago by dfeuer

Cc: dfeuer added

comment:16 Changed 2 years ago by simonpj

Yes, this looks plausible. (I think it's the call-stack stuff that makes bool inlined less aggressively, but I have not checked.)

I think what we want is for the simplifier to do a "cheap and cheerful" bottom-ness analysis on every iteration. See CoreArity.exprBotStrictness_maybe.

But actually it already very nearly does exactly that. The simplifier already calls CoreArity.findRhsArity, which in turn calls CoreArity.rhsEtaEpandArity, which calls arityType just like exprBotStrictness_maybe. So we are already doing all the work (in arityType); we just aren't getting the benefit!

I have not worked this through, but I think findRhsArity could return a boolean flag for "returns bottom" as well as an arity. The the simplifier could use that to set the strictness of the Id if it didn't already have (a bottoming) strictness.

That way bool would get a bottoming strictness and the case expressions would fall way immediately.

In a bit more detail perhaps SimplUtils.tryEtaExpandRhs could return a new Id (with updated arity and perhpas strictness) as well as a new rhs.

comment:17 Changed 2 years ago by dfeuer

I'll give it a go this weekend.

comment:18 Changed 2 years ago by dfeuer

Differential Rev(s): Phab:D3753

I've made the changes Simon suggested to Phab:D3753. Unfortunately, the redundant cases still aren't removed until the simplification pass after float out, by which point we've already wasted a bunch of time. Consider

module Serialize where

data Result a = Success a | Error String

instance Functor Result where
    {-# INLINE fmap #-}
    fmap | bool = f
         | bool = f

      where
        bool = undefined
        f = undefined

We get

==================== Simplifier ====================
2017-07-18 23:33:20.811559426 UTC
  Max iterations = 4
  SimplMode {Phase = InitialPhase [Gentle],
             inline,
             rules,
             eta-expand,
             no case-of-case}

...

bool_s2b7 :: forall a. a
[LclId,
 Str=x,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=False, ConLike=False,
         WorkFree=True, Expandable=True, Guidance=NEVER}]
bool_s2b7
  = \ (@ a_a1Xw) ->
      undefined
        @ 'GHC.Types.LiftedRep
        @ a_a1Xw
        ($dIP_s2b6
         `cast` (Sym
                   (GHC.Classes.N:IP[0] <"callStack">_N <GHC.Stack.Types.CallStack>_N)
                 :: (GHC.Stack.Types.CallStack :: *)
                    ~R# (?callStack::GHC.Stack.Types.CallStack :: Constraint)))

which looks great (there's an x!), but we still have

$cfmap_aU6 [InlPrag=INLINE (sat-args=0)]
  :: forall a b. (a -> b) -> Result a -> Result b
[LclId,
 Str=x,  -- Yay for the x, of course
 Unf=...]
$cfmap_aU6
  = \ (@ a_aUa) (@ b_aUb) ->
      case bool_s2b7 @ Bool of {
        False ->
          case bool_s2b7 @ Bool of {
            False ->
              Control.Exception.Base.patError
                @ 'GHC.Types.LiftedRep
                @ ((a_aUa -> b_aUb) -> Result a_aUa -> Result b_aUb)
                "Serialize.hs:(7,5)-(21,21)|function fmap"#;
            True ->
              undefined
                @ 'GHC.Types.LiftedRep
                @ ((a_aUa -> b_aUb) -> Result a_aUa -> Result b_aUb)
                ($dIP_s2aW
                 `cast` (Sym
                           (GHC.Classes.N:IP[0] <"callStack">_N <GHC.Stack.Types.CallStack>_N)
                         :: (GHC.Stack.Types.CallStack :: *)
                            ~R# (?callStack::GHC.Stack.Types.CallStack :: Constraint)))
          };
        True ->
          undefined
            @ 'GHC.Types.LiftedRep
            @ ((a_aUa -> b_aUb) -> Result a_aUa -> Result b_aUb)
            ($dIP_s2aW
             `cast` (Sym
                       (GHC.Classes.N:IP[0] <"callStack">_N <GHC.Stack.Types.CallStack>_N)
                     :: (GHC.Stack.Types.CallStack :: *)
                        ~R# (?callStack::GHC.Stack.Types.CallStack :: Constraint)))
      }

For some reason, we're not eliminating the two cases on bool_s2b7 @ Bool. I'm not sure where to look next.

comment:19 Changed 2 years ago by simonpj

For some reason, we're not eliminating the ...

I assume this is just in the first simplifier run, when case-of-case is off? In the second simplifier run, the case gets eliminated, correct?

I know why this is. In Simplify we have:

simplExprF1 env (Case scrut bndr _ alts) cont
  | sm_case_case (getMode env)
  = simplExprF env scrut (Select { sc_dup = NoDup, sc_bndr = bndr
                                 , sc_alts = alts
                                 , sc_env = env, sc_cont = cont })
  | otherwise
  = do { (env', scrut') <- simplExprF (zapFloats env) scrut $
                           mkBoringStop (substTy env (idType bndr))
       ; let scrut'' = wrapJoinFloats (seJoinFloats env') scrut'
             env''   = env `addLetFloats` env'
       ; rebuildCase env'' scrut'' bndr alts cont }

That is, if case-of-case is off, we simplify the scrutinee in a trivial continuation, and then wrap the alternatives around it. So disabling case-of-case has also disabled case-of-bottom. I did this fairly recently, which accounts for the change.

Let me think about a better way.

Meanwhile you are working on that botSig stuff, making it work with functions too, right?

comment:20 in reply to:  19 Changed 2 years ago by dfeuer

Replying to simonpj:

Meanwhile you are working on that botSig stuff, making it work with functions too, right?

Already done. It could probably use a bit of cleanup/optimization, but I was waiting with that to see if I can get to the payoff.

comment:21 Changed 2 years ago by dfeuer

What we're doing here smells an awful lot like what we're doing in the case-of-known-constructor transformation in rebuildCase. Rather than a known constructor, we're dealing with a known bottom. Would it make sense to move the bottom check there, or is that nonsense? I don't have a great sense of how all these pieces fit together yet.

comment:22 Changed 2 years ago by simonpj

Let me think about a better way.

Still thinking (turned out to have ramifications). Work on something else for now!

comment:23 Changed 2 years ago by Simon Peyton Jones <simonpj@…>

In 33452dfc/ghc:

Refactor the Mighty Simplifier

Triggered by #12150, and the knock-on effects of join points, I did a
major refactoring of the Simplifier.  This is a big patch that change
a lot of Simplify.hs: I did a lot of other re-organisation.

The main event
~~~~~~~~~~~~~~
Since the dawn of time we have had
  simplExpr :: SimplEnv -> InExpr -> SimplCont
            -> SimplM (SimplEnv, OutExpr)

What's that SimplEnv in the result?  When simplifying an expression the
simplifier add floated let-bindings to the SimplEnv, extending the
in-scope set appropriately, and hence needs to resturn the SimplEnv at
the end.  The mode, flags, substitution in the returned SimplEnv were
all irrelevant: it was just the floating bindings.

It's strange to accumulate part of the /result/ in the /environment/
argument!  And indeed its leads to all manner of mysterious calls to
zapFloats and transferring of floats from one SimplEnv to another.
It got worse with join points, so I finally bit the bullet and refactored.
Now we have
  simplExpr :: SimplEnv -> InExpr -> SimplCont
            -> SimplM (SimplFloats, OutExpr)
  -- See Note [The big picture]
and the SimplEnv no longer has floats in it.  The code is no shorter,
but it /is/ easier to understand.

Main changes

* Remove seLetFloats field from SimplEnv

* Define new data type SimplFloats, and functions over it

* Change the types of simplExpr, simplBind, and their many variants,
  to follow the above plan

Bottoming bindings
~~~~~~~~~~~~~~~~~~
I made one other significant change in SimplUtils (not just refactoring),
related to Trac #12150 comment:16.  Given
  x = <rhs>
where <rhs> turns out to be a bottoming expression, propagate that
information to x's IdInfo immediately.  That's always good, because
it makes x be inlined less (we don't inline bottoming things), and
it allows (case x of ...) to drop the dead alterantives immediately.
Moreover, we are doing the analysis anyway, in tryEtaExpandRhs, which
calls CoreArity.findRhsArity, which already does simple bottom analysis.
So we are generating the information; all we need do is to atach the
bottoming info to the IdInfo.

See Note [Bottoming bindings]

Smaller refactoring
~~~~~~~~~~~~~~~~~~~
* Rename SimplifierMode to SimplMode
* Put DynFlags as a new field in SimplMode, to make fewer
  monadic calls to getDynFlags.
* Move the code in addPolyBind into abstractFloats
* Move the "don't eta-expand join points" into tryEtaExpandRhs

comment:24 Changed 2 years ago by simonpj

Resolution: fixed
Status: newclosed
Test Case: perf/compiler/T12150

It took longer than I thought but it's done now. The test program compiles fast again.

comment:25 Changed 23 months ago by saurabhnanda

Cc: saurabhnanda added
Note: See TracTickets for help on using tickets.