Opened 13 months ago

Last modified 9 months ago

#15519 new bug

Minor code refactoring leads to drastic performance degradation

Reported by: danilo2 Owned by:
Priority: highest Milestone: 8.10.1
Component: Compiler Version: 8.4.3
Keywords: SpecConstr Cc: sgraf, maoe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #15578 Differential Rev(s):
Wiki Page:

Description

Hi! I've just observed an important performance problem. This code:

test0 :: Text -> Result
test0 src = let
    s1 = token 't'
    s2 = token 'e'
    s3 = token 's'
    s4 = token 't'
    p  = many $! s1 <> s2 <> s3 <> s4
    in runTokenParser p src
{-# NOINLINE test0 #-}

runs over 10 times faster than this one:

testGrammar1 :: Grammar Char
testGrammar1 = let
    s1 = token 't'
    s2 = token 'e'
    s3 = token 's'
    s4 = token 't'
    in many $! s1 <> s2 <> s3 <> s4
{-# INLINE testGrammar1 #-}

test1 :: Text -> Result
test1 = runTokenParser testGrammar1
{-# NOINLINE test1 #-}

I've also observed another thing here, namely the former code runs also over 10 times faster than this code:

test2 :: Text -> Result
test2 src = let
    s1 = token 't'
    s2 = token 'e'
    s3 = token 's'
    s4 = token 't'
    p  = X $! many $! s1 <> s2 <> s3 <> s4
    in runTokenParser p src
{-# NOINLINE test2 #-}

The only difference here is the X wrapper, while the runTokenParser is defined as runTokenParser (X !a) = runTokenParser a.

I've created sample project for it here: https://github.com/wdanilo/ghc-bug-peg-optimization/blob/master/src/Main.hs

In order to run it execute stack build --exec test.

The results are:

benchmarking test0
time                 420.0 μs   (417.6 μs .. 422.9 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 421.0 μs   (419.2 μs .. 425.3 μs)
std dev              9.286 μs   (4.239 μs .. 15.30 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking test1
time                 6.069 ms   (6.022 ms .. 6.123 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 6.065 ms   (6.037 ms .. 6.117 ms)
std dev              114.5 μs   (74.30 μs .. 183.4 μs)

benchmarking test2
time                 6.070 ms   (6.007 ms .. 6.137 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 6.067 ms   (6.039 ms .. 6.129 ms)
std dev              123.0 μs   (63.88 μs .. 220.1 μs)

benchmarking native
time                 428.0 μs   (421.5 μs .. 437.4 μs)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 427.1 μs   (424.1 μs .. 434.7 μs)
std dev              15.18 μs   (5.678 μs .. 26.26 μs)
variance introduced by outliers: 29% (moderately inflated)

Where "native" is just standard Text.takeWhile ....

Attachments (1)

Main.hs (2.7 KB) - added by sgraf 13 months ago.
Variant of Main.hs I used for debugging

Download all attachments as: .zip

Change History (29)

comment:1 Changed 13 months ago by danilo2

Btw it seems like an inlining problem (but tbh I don't get where so big slowdown comes from!). If I replicate the code using type classes (each constructor of Grammar is separate type) and I create multiple instances which have to be resolved during compilation time and I mark them to be INLINED, the performance is the same as native.

comment:2 Changed 13 months ago by danilo2

One more thing - the source code uses -XStrict but even without using it and putting everywhere bang patterns by hand I got exactly the same results.

comment:3 Changed 13 months ago by danilo2

Priority: highhighest

I'm changing the priority to highest here because:

  1. The predictive performance is super important
  2. It's hurting us even more than some of the bugs with highest priority. It's as strange as my other bug #15176 but it's even more common situation.

comment:4 Changed 13 months ago by danilo2

Ok, I've got one more important notice.

If I put the NOINLINE pragma on the runTokenParser function, the test0 is working 14 times slower. And this is interesting because runTokenParser does a simple pattern matching and then evaluates the Text.span function. So the Text.span (mapping a function to 100000 Chars) takes a tiny fraction of time of this computation.

comment:5 Changed 13 months ago by simonpj

How can I demonstrate without Criterion? (trying to install criterion with my in-place HEAD compiler leads to a raft of dependency conflicts that I do not understand.)

I tried just running test0 src1, test1 src1, test2 src1 but they all behaved identically.

comment:6 Changed 13 months ago by danilo2

@simonpj I'll try to prepare you some version without criterion this weekend (sorry for that, I'm working almost literally 24h/day now). Thank you for looking into this issue!

comment:7 Changed 13 months ago by bgamari

Milestone: 8.6.18.8.1
Status: newinfoneeded

comment:8 Changed 13 months ago by danilo2

@simonpj I removed Criterion as dependency here and got the same results. Please compile it and pass the resulting executable a number - 0, 1 or 2 to execute test0, test1 or test2 respectively. Thank you for looking into this issue!

https://github.com/wdanilo/ghc-bug-peg-optimization/blob/master/src/Main.hs

Last edited 13 months ago by danilo2 (previous) (diff)

comment:9 Changed 13 months ago by danilo2

Status: infoneedednew

Changed 13 months ago by sgraf

Attachment: Main.hs added

Variant of Main.hs I used for debugging

comment:10 Changed 13 months ago by sgraf

Thanks, that's much easier to reproduce.

So, it seems that the performance gap stems from the fact that the call to runTokenParser in test0 is specialised to the specific grammar. That is not the case for test1, because its definition doesn't get eta-expanded for some reason. Note that the correct arity 1 is detected:

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
test1 [InlPrag=NOINLINE] :: Text -> Result
[LclIdX,
 Arity=1,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 20 60}]
test1 = runTokenParser testGrammar1

If test1 would be eta-expanded, the call to runTokenParser becomes saturated and could (in theory) be specialised to testGrammar1. Except manual eta-expansion (e.g. test1 t = runTokenParser testGrammar1 t shows that it's not enough for SpecConstr to pick this up.

Ironically, the problem seems to be related to the INLINE pragma on testGrammar1. If you omit it, the call in test1 specialises properly. Even CSE can unify testGrammar1 and the floated out grammar binding from test0, which wasn't possible before because of the different pragmas I suppose.

So, the fix to apply in your situation seems to be to eta-expand test1 and omit the INLINE pragma.

As to why that fixes performance, I'm really at a loss. It's probably related to the fact that the unfolding attached to testGrammar1 isn't CONLIKE, whereas the RHS at the time when SpecConstr runs is. I can't find any relevant code in SpecConstr that looks at unfoldings of local ids, though. Perhaps I'll find time to look into this some more tomorrow.

P.S.: test2 fails to specialise completely in a similar way.

Last edited 13 months ago by sgraf (previous) (diff)

comment:11 Changed 13 months ago by simonpj

(NB: this comment does not respond to test2, which I'm still investigating.)

I can see what is happening in test0 vs test1.

In test0 we get

lvl_sdWb :: Char -> Bool
[LclId, Arity=1]
lvl_sdWb
  = \ (c_X402 :: Char) ->
      case c_X402 of { GHC.Types.C# ipv_sbcB [Dmd=<S,U>] ->
      case ipv_sbcB of {
        __DEFAULT -> GHC.Types.False;
        'e'# -> GHC.Types.True;
        's'# -> GHC.Types.True;
        't'# -> GHC.Types.True
      } }

$wtest0_sdEs
  = \ (ww_sdEo :: GHC.Prim.ByteArray#)
      (ww_sdEp :: GHC.Prim.Int#)
      (ww_sdEq :: GHC.Prim.Int#) ->
      case $sunion_scrz lvl_scJr lvl_sdWa of dt_X4yi [Dmd=<S,U>]
      { __DEFAULT ->
      case $sunion_scrz lvl_scJp dt_X4yi of dt_X4yq { __DEFAULT ->
      $wrunTokenParser_sdDw
        (Main.Many @ Char (Main.Tokens @ Char dt_X4yq lvl_sdWb))
        ww_sdEo
        ww_sdEp
        ww_sdEq
      }
      }

So SpecConstr will specialise $wrunTokenParser thus:

RULES     "SC:$wrunTokenParser1" [2]
            forall (sc_se8A :: GHC.Prim.Int#)
                   (sc_se8z :: GHC.Prim.Int#)
                   (sc_se8y :: GHC.Prim.ByteArray#)
                   (sc_se8x :: Set Char).
              $wrunTokenParser_sdDw (Main.Many
                                       @ Char (Main.Tokens @ Char sc_se8x lvl_sdWb))
                                    sc_se8y
                                    sc_se8z
                                    sc_se8A
              = $s$wrunTokenParser_se8G sc_se8A sc_se8z sc_se8y sc_se8x]

Notice, in particular, lvl_sdWb, which is a top-level constant (not forall'd by the RULE): the specialised runTokenParser knows exactly what that function is, and that makes the inner loop fast. The fact that it is specialised for Many and Tokens is incidental, because runTokenParser is not recursive; it's the loop inside (which comes from span) that is rendered fast by knowing the function given to span.

In contrast, test1 doesn't get any specialisation.

test1 = runTokenParser testGrammar1

Even if you manually eta-expand it, by writing

test3 src = runTokenParser testGrammar1 src

we still get nothing useful

Main.$wtest3
  = \ (ww_sdIT :: GHC.Prim.ByteArray#)
      (ww1_sdIU :: GHC.Prim.Int#)
      (ww2_sdIV :: GHC.Prim.Int#) ->
      $wrunTokenParser_resz testGrammar1_r23E ww_sdIT ww1_sdIU ww2_sdIV

What happened in test0 (the fast case) is that the programmer manually inlined testGrammar1_r23E:

testGrammar1_r23E
  = case Main.$sunion lvl17_resE lvl19_resH of dt_X4B6 { __DEFAULT ->
    case Main.$sunion lvl18_resG dt_X4B6 of dt1_X4Be { __DEFAULT ->
    Main.Many @ Char (Main.Tokens @ Char dt1_X4Be lvl10_ress)
    }
    }

But GHC is super-cautious about doing so in test3, in case we duplicate the work of computing testGrammar1: Those unions might be expensive! And test3 might be applied to many different src arguments. In contrast, in test1 you manually put the grammar inside the \src.


Analysis

There are two problems:

  1. In test3, GHC's caution about inlining testGrammar1 is (in general) reasonable. Perhaps the Right Thing is to ignore the problem of work-duplication if the user supplies an INLINE pragma, which you do in this case, presumably for that exact reason.

Let's see: danilo2, what led you to the INLINE pragma on testGrammar1?

  1. test1 is not eta-expanded, because it's a partial application: see Note [Do not eta-expand PAPs] in SimplUtils. And because it is not eta-expanded, runTokenParser doesn't get enough arguments and SpecConstr doesn't consider under-saturated calls.

Moreover, even test0 is fragile: it's entirely possible that the full-laziness pass will float out all those let-bindings (for s1, s2 etc) to top level, since they are independent of src


Workarounds

A robust improvement is to use oneShot:

test3 = oneShot (runTokenParser testGrammar1)

The magic oneShot function eta-expands its argument to a one-shot lambda:

test3 = \src[one-shot] -> runTokenParser testGrammar1 src

Now at least, if testGrammar1 is inlined, it won't get floated out again.

Next thing: make this the only occurrence of testGrammar1; then there is no duplication issue when inlining it, so we get

test3 = \src[one-shot] -> runTokenParser (...code for the grammar...) src

In the test case, testGrammar1 is used in two different tests, so it currently won't get inlined (lest we have work duplication); but in your real code it is probably used only once, I guess.

Last thing: remove the INLINE pragma on testGrammar1. It may seem silly but if a binding has an INLINE pragma, even if it is used exactly once it is not inlined. Why not? See Note [Stable unfoldings and preInlineUnconditionally]. This really is silly: see Fix 1 below.


Fixing it properly

  • Fix 1: concerning Note [Stable unfoldings and preInlineUnconditionally], perhaps we should not do this for arity-0 bindings.
  • Fix 2: perhaps we should honour INLINE pragmas on 0-arity bindings, and simply inline them at every usage site regardless.

It's not so clear how to solve the eta-expansion problem. Ideas

  • we could try to make SpecConstr deal with under-saturated calls; and/or
  • we could eta-expand PAPs provided no coercions were involved.

comment:12 Changed 13 months ago by simonpj

OK I have figured out test2. Consider

f x p = case x of
          A y -> ..(f y p2)...
          B z -> map z something

main = f (A (B isEven)) blah

SpecConstr will see the call f (A (B isEven)) but, looking at f, it only see a single-level pattern-match on x, so it'll only do a single-level specialisation:

f_spec y p = ...(f y p2)...

{-# RULE f (A y) p = f_spec y p #-}

This is good so far as it goes, but we are left with

f_spec y p = ...(f y p2)...
main = f_spec (B isEven) something

which is not good (because we have not specialised on B. Even if we ran SpecConstr a second time, it won't see any reason to specialise f_spec (since it does not directly scrutinise y), and neither will it see any reason to further specialise f.

It would be much better to specialise on that full call in the first place, giving

f_spec1 p = ...(f (B isEven) p2)...

{-# RULE f (A (B isEven)) p = f_spec1 p #-}

Now we have a call in the body of f_spec1 that we can specialise, and SpecConstr does just that, giving

f_spec2 p = map isEven something

{-# RULE f (B isEven) p = f_spec2 p #-}

And all is good. The final code is

f_spec2 p = map isEven something
f_spec1 y p = ...(f_spec2 y p2)...
main = f_spec1 blah

But specialising on the full call pattern risks over-specialising, duplicating code to no purpose, and making the specialisation less useful to other callers.

I'm not sure what to do about this.

comment:13 Changed 13 months ago by simonpj

Keywords: SpecConstr added

OK, re comment:12, this is an old problem: see Trac #4448. The current (horrible) solution is to take control explicitly, like this

import GHC.Type( SPEC(..) )

runTokenParser :: SPEC -> Grammar Char -> Text -> Result
runTokenParser = \sp grammar stream -> case grammar of
    Tokens _ tst -> let
        head = Text.head stream
        in if tst head
            then Success (Text.tail stream) (Text.singleton head)
            else Fail
    Many (Tokens _ tst) -> let
        (!consumed, !rest) = Text.span tst stream
        in Success rest consumed
    X !grammar -> runTokenParser sp grammar stream

Notice that SPEC argument. It's not actually used, but it tells SpecConstr to specialise the call regardless of whether the function scrutinises the argument (see Trac #4448 for a simpler example).

That indeed makes test2 work nicely. I don't claim that it's nice. At all.

comment:14 Changed 13 months ago by sgraf

Cc: sgraf added

comment:15 Changed 13 months ago by danilo2

@simonpj, @sgraf, first of all, thank you very much for your time, taking look at this issue and investigating it so deeply. Thanks to it I have hope we will fix it!

  1. First of all I'm simply amazed by the amount of workarounds here. I know I was writing this before in other tickets, but I care a lot about predictive performance in GHC. All the described things show me that its worse than I thought, see the following points.
  1. In order to write high performance code we need invariants. Rules that we can follow and we can trust that they allow us get exactly the compile time behavior we want. One of the most important invariants to me (probably the most important one) was always that if I use the INLINE pragma, the code will be inlined if the call is saturated and it will have exactly the same behavior if I just copy paste it there. I always understood that the INLINE pragma is exactly for this - to very precisely guide GHC how to optimize the code. Learning that GHC does not really inline all explicitly marked saturated calls and sometimes it gets better specialization when removing the INLINE pragma is just insane. it breaks the most primitive invariant that we can rely on and without it we cannot predict just anything about the performance of code we write. For me this is critical error.
  1. Moreover I strongly disagree with the sentence that the "fix would be to remove INLINE pragma" because this leaves us in a world where GHC performance is completely unknown and we have to randomly enable / disable things hoping that it will magically get better. I suspect @sgraf that you didn't mean "fix" but instead a "dirty workaround for now", but I preferred to emphasize my worries regarding this matter.
  1. Answering your question @simonpj, I completely understand that GHC is super-cautious about inlining things and making the code bigger, but that is exactly the reason why we can fine tune the behavior by telling GHC that we in reality want it to be inlined, right? Exactly this led me to use INLINE pragma here. When writing this code I know I will have here some very tight loops to be optimized and I know that no matter what it sohuld be inlined.
  1. I'm surprised that test1 is not eta expanded. How can I be sure my functions get eta expanded? I have read the Note [Do not eta-expand PAPs] but it's still not clear to me. What are the "invariants here". If I write high performance code should I always manually eta-expand functions? Should I rewrite it to test1 = \t -> runTokenParser testGrammar1 t? Please correct me if my thinking is wrong here.
  1. Regarding the test2, specialization and over-specialising things. I see the problem and I don't know what approach would be good here. The only thing that is clear to me is that the change to code is so small, that nobody should expect such drastic performance changes and we should have a clear way of preventing such things from happening (again - invariant of how to write high performance code when we want to wrap some data in a compile-known constructor just to refactor things).
  1. Looking at the source code I don't see any mention of the SPEC in GHC.Type. Where it cames from and where can I learn more about it?
Version 1, edited 13 months ago by danilo2 (previous) (next) (diff)

comment:16 Changed 13 months ago by simonpj

I strongly disagree with the sentence that the "fix would be to remove INLINE pragma"

I didn't say that! I said that a workaround is to remove the INLINE pragma :-). I agree that it's insane that adding INLINE makes things worse, and I propose to change GHC so that INLINE is honoured even for 0-ary values, and even if doing so would cause work duplication.

I think that this alone will help you a lot, addressing 2,3,4.

As to (5), yes, eta expansion is a particularly tricky transformation for GHC. In performance critical code, do it by hand.

Re (6) you could put it another way: SpecConstr gives you absolutely stunning performance gains! Without you'd have had less good perf, but reliably so, and you would not be submitting this ticket :-). So there is a positive side here.

The trouble is that it's difficult for SpecConstr to give reliably good perf, and that's why this horrible SPEC business is there. I'm sure it's possible to do better -- but it needs someone to really study the problem carefully. It's not just an easy fix.

Re (7) SPEC is not well documented. The best documentation is probably Note [Forcing specialisation] in SpecConstr itself. It is a HORRIBLE hack, but we were never able to come up with anything more civilized. Again, it would reward some sustained attention. (I'm not addressing you here, rather any GHC hackers or research students looking for an interesting challenge.)

comment:17 Changed 13 months ago by danilo2

@simonpj I know you didn't say that, I'm sorry I was not clear enough. The point 3 was written regarding the answer of @sgraf: So, the fix to apply in your situation seems to be to eta-expand test1 and omit the INLINE pragma., I've even mentioned him there :)

Yes, making GHC always listen to INLINE will help me regarding 2,3,4!

  1. This is very interesting. Would you be so nice and tell me jsut a little more aobut it - where can I learn about it and why such design decisions were made?
  1. :)
  1. Very interesting. Thank you for sharing the knowledge!
Last edited 13 months ago by danilo2 (previous) (diff)

comment:18 Changed 13 months ago by simonpj

This is very interesting. Would you be so nice and tell me jsut a little more aobut it - where can I learn about it and why such design decisions were made?

I'm afraid I have never written a paper about eta-expansion, despite its importance to GHC.

There are extensive Notes in

  • CoreArity.hs (which is all about arity and eta expansion)
  • SimplUtils.hs (around tryEtaExpandRhs and tryEtaReduce)

I think that's the best I can do without a big investment

comment:19 Changed 13 months ago by simonpj

I have created #15578 for the proposed INLINE-of-0-ary things change.

comment:20 in reply to:  15 Changed 13 months ago by sgraf

Replying to danilo2:

I strongly disagree with the sentence that the "fix would be to remove INLINE pragma" because this leaves us in a world where GHC performance is completely unknown and we have to randomly enable / disable things hoping that it will magically get better. I suspect @sgraf that you didn't mean "fix" but instead a "dirty workaround for now", but I preferred to emphasize my worries regarding this matter.

Yes, exactly. I'm not happy with doing this (or rather having to anticipate doing this), either.

comment:21 Changed 13 months ago by simonpj

Yes, exactly. I'm not happy with doing this (or rather having to anticipate doing this), either.

I've got lost. What exactly is "this"? Are you happy with what I propose in #15578?

comment:22 in reply to:  21 Changed 13 months ago by sgraf

Replying to simonpj:

Yes, exactly. I'm not happy with doing this (or rather having to anticipate doing this), either.

I've got lost. What exactly is "this"? Are you happy with what I propose in #15578?

Sorry, I could have been clearer. TLDR; I'm quite happy, the comment was unrelated to anything you wrote.

In comment:10, I wrote (with a little more context):

So, the fix to apply in your situation seems to be to eta-expand test1 and omit the INLINE pragma.

That's what @danilo2's comment:15 alludes to when he writes

I strongly disagree with the sentence that the "fix would be to remove INLINE pragma" [...] I suspect @sgraf that you didn't mean "fix" but instead a "dirty workaround for now", but I preferred to emphasize my worries regarding this matter.

So, by "this" in comment:20, I meant the workaround of manually eta-expanding and (more importantly) omitting the INLINE pragma. I'm quite happy about #15578 :) Not sure about consequences for library authors being used to the old behavior (if they were even aware of it), though.

comment:23 Changed 13 months ago by danilo2

@sgraf I would be truly surprised if 1% of library authors who ever used the INLINE pragma are aware how in reality it gets resolved. Moreover, you cannot rely on it because you don't have any clear guidance how to write high performance code with the current behavior, so I'm pretty sure there is nobody consciously relying on this behavior. I'd love to see this fixed according to SPJ proposal! :)

@simonpj thank you for pointing out the notes, it's very helpful!

Last edited 13 months ago by danilo2 (previous) (diff)

comment:24 Changed 13 months ago by simonpj

To summarise: once we have done #15578, we'll get good perf for

testGrammar1 = <blah>
{-# INLINE testGrammar1 #-}

test3 = oneShot (runTokenParser testGrammar1)
  • The INLINE ensures that testGrammar1 is inlined even if it is used in many places.
  • The oneShot is, I'm afraid, still necessary. (You can import it from GHC.Exts.) It says that GHC doesn't need to worry about sharing the work of testGrammar1 between, say, two calls (test3 src1) and (test3 src2).

You should get reliably good perf with these changes. OK?

comment:25 Changed 12 months ago by George

comment:26 Changed 12 months ago by George

comment:27 Changed 10 months ago by maoe

Cc: maoe added

comment:28 Changed 9 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of some high and highest priority tickets.

Note: See TracTickets for help on using tickets.