Opened 4 years ago

Closed 3 years ago

#11731 closed bug (fixed)

Simplifier: Inlining trivial let can lose sharing

Reported by: nomeata Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 8.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2073
Wiki Page:

Description (last modified by nomeata)

While working on #10613, I looked a bit more closely at the ticky numbers of single-entry thunks. In all of nofib, I found exactly one fishy result: A thunk, marked as single entry, but entered multiple times, in gametb.

I’ll write a preliminary analysis in the comments and either found a bug or a misunderstanding on my side.

Attachments (1)

T11731.hs (839 bytes) - added by nomeata 4 years ago.
Test case

Download all attachments as: .zip

Change History (54)

comment:1 Changed 4 years ago by nomeata

    Entries      Alloc    Alloc'd     #Alloc     Single   Multiple  Non-void Arguments      STG Name
       4096          0      65536       2048          0       2048   0                      sat_s2vw{v} (main@main:GamtebMain) (thk,se) in r2uq

The relevant -ddump-stg code is this:

                                      let {
                                        sat_s2vw [Occ=Once, Dmd=<L,1*U(U)>] :: GamtebType.Random
                                        [LclId, Str=DmdType] =
                                            \s srt:SRT:[r4K :-> Utils.$wgenRand] []
                                                case Utils.$wgenRand w_s2ux of _ [Occ=Dead] {
                                                  (#,#) ww5_s2vu [Occ=Once] _ [Occ=Dead] ->
                                                      ww5_s2vu;
                                                };
                                      } in 
                                        TransPort.$wtransPort
                                            lvl2_r2ut
                                            lvl2_r2ut
                                            lvl2_r2ut
                                            lvl2_r2ut
                                            lvl1_r2us
                                            lvl2_r2ut
                                            lvl1_r2us
                                            w1_s2uz
                                            ww2_s2vk
                                            lvl_r2ur
                                            sat_s2vw
                                            ww17_s2vn
                                            ww19_s2vp
                                            ww22_s2vs;

The relevant code for $wtransPort ist

TransPort.$wtransPort [InlPrag=[0], Occ=LoopBreaker]
  :: GamtebType.Coord
     -> GamtebType.Coord
     -> GamtebType.Coord
     -> GamtebType.Coord
     -> GamtebType.Coord
     -> GamtebType.Coord
     -> GamtebType.Weight
     -> GamtebType.Energy
     -> GamtebType.Indx
     -> GHC.Types.Int
     -> GamtebType.Random
     -> GamtebType.Prob
     -> GamtebType.Prob
     -> GHC.Prim.Double#
     -> (# [GamtebType.Result], [GamtebType.Stat] #)
[GblId,
 Arity=14,
 Str=DmdType <L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U><L,U(U)><L,1*U(U)><L,U(U)><L,U(U)><S,U>,
 Unf=OtherCon []] =
    \r srt:SRT:[r4K :-> Utils.$wgenRand, r5x :-> Compton.$wcompton,
                r5O :-> Pair.$wpair, r3Df :-> TransPort.$wtransPort,
                r3Dr :-> lvl11_r3Dr] [ww_s3Dy
                                      ww1_s3Dz
                                      ww2_s3DA
                                      ww3_s3DB
                                      ww4_s3DC
                                      ww5_s3DD
                                      ww6_s3DE
                                      ww7_s3DF
                                      ww8_s3DG
                                      ww9_s3DH
                                      ww10_s3DI
                                      ww11_s3DJ
                                      ww12_s3DK
                                      ww13_s3DL]
        case Utils.$wgenRand ww10_s3DI of _ [Occ=Dead] {
          (#,#) ww15_s3DN [Occ=Once!] ww16_s3DO ->

Note how the strictness signature say that the Random argument is used at most once (1*U(U)). The call to $wgenRand is indeed the only use of ww10_s3DI. So here is the code of $wgenRand:

Utils.$wgenRand [InlPrag=[0]]
  :: GamtebType.Random -> (# GamtebType.Random, GamtebType.Random #)
[GblId, Arity=1, Str=DmdType <L,U(U)>, Unf=OtherCon []] =
    \r srt:SRT:[015 :-> GHC.Integer.Type.timesInteger,
                017 :-> GHC.Integer.Type.negateInteger,
                01j :-> GHC.Integer.Type.divInteger,
                01x :-> GHC.Integer.Type.shiftLInteger] [w_s3BX]
        let {
          sat_s3CT [Occ=Once] :: GamtebType.Random
          [LclId, Str=DmdType] =
              \u srt:SRT:[015 :-> GHC.Integer.Type.timesInteger,
                          017 :-> GHC.Integer.Type.negateInteger,
                          01j :-> GHC.Integer.Type.divInteger,
                          01x :-> GHC.Integer.Type.shiftLInteger] []
                  case w_s3BX of _ [Occ=Dead] {
                    GHC.Types.D# y_s3Cs [Occ=Once] ->
...
                  }; } in
        let {
          sat_s3Cq [Occ=Once] :: GamtebType.Random
          [LclId, Str=DmdType] =
              \u srt:SRT:[015 :-> GHC.Integer.Type.timesInteger,
                          017 :-> GHC.Integer.Type.negateInteger,
                          01j :-> GHC.Integer.Type.divInteger,
                          01x :-> GHC.Integer.Type.shiftLInteger] []
                  case w_s3BX of _ [Occ=Dead] {
                    GHC.Types.D# y_s3BZ [Occ=Once] ->
...
                  };
        } in  (#,#) [sat_s3Cq sat_s3CT];

So $wgenRand may evaluate its argument more than once (as its strictness signature says). On what grounds does $wtransPort then claim the argument is used at most once?

Now if the argument to $wgenRand were a complex expression, this would be valid reasoning, as it would be let-bound and then shared. But trivial expressions, when passed as arguments, are _not_ shared at this point, so the information that the argument may be used multiple times should be propagated out.

Last edited 4 years ago by nomeata (previous) (diff)

comment:2 Changed 4 years ago by nomeata

It seems that on its own, the demand analysis is correct. Here is the relevant bit from transPort, as the demand analysis sees it:

transPort =
  \ (p_ayV [Dmd=<S(SSLLLLL),U(1*U(U(U),U(U),U(U)),1*U(U(U),U(U),U(U)),U(U),U(U),U,U(U),1*U(U))>]
       :: Particle)
    (prob_ayW [Dmd=<S(LLLS(S)),1*U(U(U),A,U(U),U(U))>]
       :: Probability) ->
    let {
      seed_s33f [Dmd=<L,U(U)>] :: Random
      [LclId,
       Str=DmdType {ayV-><S(LLLLLLS),A>},
       Unf=Unf{Src=<vanilla>, TopLvl=False, Value=False, ConLike=False,
               WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 0}]
      seed_s33f =
        case p_ayV
        of _ [Occ=Dead, Dmd=<L,A>]
        { Part pos_X21Q [Dmd=<L,A>] dir_X21S [Dmd=<L,A>] w_X21U [Dmd=<L,A>]
               e_X21W [Dmd=<L,A>] eIndx_X21Y [Dmd=<L,A>] cell_X220 [Dmd=<L,A>]
               seed_X22n [Dmd=<S,1*U(U)>] ->
        seed_X22n
        } } in
    case Utils.$wgenRand seed_s33f
...

Note that the demand on seed is *not* one-shot (because there are two calls to wgenRand seed_s33f in the body below, but the demand on the corresponding member of Particle is. As long as seed_s33f is shared, this is fine.

But after the next simplifier run, which includes worker-wrappering the Particle argument, CSE’ing the various calls to $wgenRand as well as subsequent simplifications, we get:

[LclId,
 Arity=14,
 Str=DmdType <L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U(U)><L,U><L,U(U)><L,1*U(U)><L,U(U)><L,U(U)><S,U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=NEVER}]
$wtransPort_s3nF =
  \ (ww_s3nb :: Coord)
    (ww_s3nc :: Coord)
    (ww_s3nd :: Coord)
    (ww_s3ni :: Coord)
    (ww_s3nj :: Coord)
    (ww_s3nk :: Coord)
    (ww_s3nm :: Weight)
    (ww_s3nn :: Energy)
    (ww_s3no :: Indx)
    (ww_s3np :: Int)
    (ww_s3nq :: Random)
    (ww_s3nu :: Prob)
    (ww_s3nw :: Prob)
    (ww_s3nA :: GHC.Prim.Double#) ->
    case Utils.$wgenRand ww_s3nq
    of _ [Occ=Dead, Dmd=<L,A>]
    { (# ww1_a2WR [Dmd=<S(S),1*U(U)>], ww2_a2WS [Dmd=<L,1*U(U)>] #) ->

and behold: The sharing-ensuring let is gone, the field member ww_s3nq is passed directly to $wgenRand and now the strictness signature is a lie!

Tracing the simplifier is not easy, but I expect the sequence of actions to be roughly this: After WW we have

  \ (ww_s3nb [Dmd=<L,U(U)>] :: Coord)
    (ww_s3nc [Dmd=<L,U(U)>] :: Coord)
    (ww_s3nd [Dmd=<L,U(U)>] :: Coord)
    (ww_s3ni [Dmd=<L,U(U)>] :: Coord)
    (ww_s3nj [Dmd=<L,U(U)>] :: Coord)
    (ww_s3nk [Dmd=<L,U(U)>] :: Coord)
    (ww_s3nm [Dmd=<L,U(U)>] :: Weight)
    (ww_s3nn [Dmd=<L,U(U)>] :: Energy)
    (ww_s3no :: Indx)
    (ww_s3np [Dmd=<L,U(U)>] :: Int)
    (ww_s3nq [Dmd=<L,1*U(U)>] :: Random)
    (ww_s3nu [Dmd=<L,U(U)>] :: Prob)
    (ww_s3nw [Dmd=<L,U(U)>] :: Prob)
    (ww_s3nA [Dmd=<S,U>] :: GHC.Prim.Double#) ->
...
    let {
      w_s3n4 [Dmd=<S(SSLLLLL),U(1*U(U(U),U(U),U(U)),1*U(U(U),U(U),U(U)),U(U),U(U),U,U(U),1*U(U))>]
        :: Particle
      [LclId, Str=DmdType]
      w_s3n4 =
        GamtebType.Part
          ww_s3n8 ww_s3nf ww_s3nm ww_s3nn ww_s3no ww_s3np ww_s3nq } in
...
    case (\ (p_ayV [Dmd=<S(SSLLLLL),U(1*U(U(U),U(U),U(U)),1*U(U(U),U(U),U(U)),U(U),U(U),U,U(U),1*U(U))>]
               :: Particle)
            (prob_ayW [Dmd=<S(LLLS(S)),1*U(U(U),A,U(U),U(U))>]
               :: Probability) ->
            let {
              seed_s33f [Dmd=<L,U(U)>] :: Random
              [LclId,
               Str=DmdType,
               Unf=Unf{Src=<vanilla>, TopLvl=False, Value=False, ConLike=False,
                       WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 0}]
              seed_s33f =
                case p_ayV
                of _ [Occ=Dead, Dmd=<L,A>]
                { Part pos_X21Q [Dmd=<L,A>] dir_X21S [Dmd=<L,A>] w_X21U [Dmd=<L,A>]
                       e_X21W [Dmd=<L,A>] eIndx_X21Y [Dmd=<L,A>] cell_X220 [Dmd=<L,A>]
                       seed_X22n [Dmd=<S,1*U(U)>] ->
                seed_X22n
                } } in
            case Utils.$wgenRand seed_s33f
...
           w_s3n4 w_s3n5

and then (beta-reduction)

            let {
              seed_s33f [Dmd=<L,U(U)>] :: Random
              [LclId,
               Str=DmdType,
               Unf=Unf{Src=<vanilla>, TopLvl=False, Value=False, ConLike=False,
                       WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 10 0}]
              seed_s33f =
                case w_s3n4 -- ← here
                of _ [Occ=Dead, Dmd=<L,A>]
                { Part pos_X21Q [Dmd=<L,A>] dir_X21S [Dmd=<L,A>] w_X21U [Dmd=<L,A>]
                       e_X21W [Dmd=<L,A>] eIndx_X21Y [Dmd=<L,A>] cell_X220 [Dmd=<L,A>]
                       seed_X22n [Dmd=<S,1*U(U)>] ->
                seed_X22n
                } } in
            case Utils.$wgenRand seed_s33f

and then (inlining of constructor application into interesting context, and case of known constructor)

            let {
              seed_s33f [Dmd=<L,U(U)>] :: Random
              seed_s33f = ww_s3nq
            case Utils.$wgenRand seed_s33f

and then (inline trivial let)

            case Utils.$wgenRand seed_s3nq

and there we have the salad (German idiom).

My gut feeling is that in order to fix this problem, either all trivial lets must be preserved, or any simplification that turns a non-trivial let into a trivial let needs to somehow mark this let as “need to be preserved until STG”.

This is very much related to the trap that I fell into with Call Arity in #11064. Seems to be quite a nasty trap.

comment:3 Changed 4 years ago by nomeata

Description: modified (diff)

comment:4 Changed 4 years ago by nomeata

Summary: Demand analysis: Thunk wrongly determined single-entrySimplifier: Inlining trivial let can lose sharing

comment:5 Changed 4 years ago by simonpj

Good point! For example, if the demand analyser saw

let y = factorial v in
let x = y in
x + x

I think it'd conclude that x was demanded many times, but y was demanded only once. (Which is correct). But if we substitute for x, and then use call-by-name for y we'll evaluate the factorial twice.

Urk.

It seems to affect bindings like

x = y    -- Or perhaps y |> gamma etc; exprIsTrivial anyway

where

  • The demand signature on x is not marked "used-once"
  • The demand signature on y is marked "used-once"

Under these circumstances, the binding for x is serving a useful role, to memo-ise the computation of y.

Perhaps we should simply refrain from inlining x under these circumstances, leaving the trivial let in place. The fix would be in postInlineUnonditionally, postInlineUnonditionally, and callSiteInline.

Would you like to try that? I think it'd fix this bug. But it would then be important to know how many trivial lets were thereby retained. Perhaps not many.

Changed 4 years ago by nomeata

Attachment: T11731.hs added

Test case

comment:6 Changed 4 years ago by nomeata

I’ve created a test case for this; if you run T11731, it will print Evaluated twice.

Would you like to try that? I think it'd fix this bug.

I’m not sure I like that path. This means that the strictness signatures will now not only describe the actual semantics of the code, but rather affect it. We would lose the invariant that the semantics of the program does not change if we remove all strictness signatures. Feels wrong to me.

But nevertheless, I see if that indeed fixes the problem.

comment:7 Changed 4 years ago by nomeata

Hmm. Not so easy. Given that problematic code, i.e.

$s$wwwMe_s49K :: Int -> Int -> GHC.Prim.Int# -> (# Int, Int #)
[LclId, Arity=3, Str=DmdType <L,U(U)><L,1*U(U)><S,1*U>]
$s$wwwMe_s49K =
  \ (sc_s49I :: Int) (sc_s49J :: Int) (sc_s49H :: GHC.Prim.Int#) ->
    case sc_s49H of ds_X1WV {
      __DEFAULT -> $wwwMe_s48A (GHC.Prim.-# ds_X1WV 1#) lvl_s1Z6;
      0# ->
        case foo @ Int @ Int (sc_s49I, sc_s49J)
        of _ [Occ=Dead] { GHC.Types.I# ipv_s1XV [Dmd=<L,A>] ->
        let {
          b_s1Z1 [Dmd=<L,U(U)>] :: Int
          [LclId, Str=DmdType]
          b_s1Z1 = sc_s49J } in
        let {
          a_s1Z0 [Dmd=<L,U(U)>] :: Int
          [LclId, Str=DmdType]
          a_s1Z0 = sc_s49I } in
        (# GHC.Num.$fNumInt_$c+ b_s1Z1 a_s1Z0,
           GHC.Num.$fNumInt_$c+ a_s1Z0 b_s1Z1 #)
        }
    }

it seems that the used-once information is not attached to the occurrence of sc_s49J, but only to the function $s$wwwMe_s49J that binds it. This is confirmed by a pprTrace of the idDemandInfo in exprIsTrivial:

exprIsTrivial sc_s49J <L,U>

comment:8 Changed 4 years ago by simonpj

Humph. I was thinking that if a function has a particular strictness signature we can't change it. Certainly we can't change the strictness signature for an imported function.

BUT we CAN change the strictness signatures of local functions and actually those are the ones that matter, because their sharing properties may worsen, as above.

Indeed we already have the idea of running the demand analyser just before code generation: -flate-dmd-anal. (Main motivation: discover used-once thunks that weren't previously used-once.)

Does using -flate-dmd-anal fix the problem? It may not becuase there is a simplifer run afterwards; but cleraly this bug only shows up rarely anyway, so running it late might actually fix it

Thinking about it

  • Until code generation, the used-once info is ignored; only one-shot-call info is used
  • During code generation, the used-once info is used; and the one-shot-call info is ignored.

So perhaps

  • We should not even generate used-once info in the main run of the demand analyser, since it is so easily invalidated.
  • In the run of the simplifier after late demand analysis we should be careful not to eliminate the dangerous lets.

I tried the test program with -O; but it only prints "Evaluated" once. So, strangely, I can't reproduce the bug. What exact commands did you use?

comment:9 Changed 4 years ago by simonpj

it seems that the used-once information is not attached to the occurrence of sc_s49J

Ha, well, that's not very cool. I bet it is easily fixed. Just need to annotate the binders of the worker, perhaps.

Simon

comment:10 Changed 4 years ago by nomeata

I tried the test program with -O; but it only prints "Evaluated" once. So, strangely, I can't reproduce the bug. What exact commands did you use?

Well spotted! I ran with -O2, and indeed, it does not happen with -O, but with -O -fspec-constr. Nothing deep here, it’s just that constructor specialization is required in this particular example to massage the code into the shape we want.

Does using -flate-dmd-anal fix the problem?

Indeed it does. It recalculates the demand signature correctly, and will share the shareMeThunk. But after -flate-dmd-anal, there is another round of worker-wrapper and simplifier. So the principle, the problem could occur again.

I would prefer to take a step back and think about what, at the level of Core, the semantics of

let x = y in foo x

should be, independent of any possible annotations. Here a digression that follows that train of thought:

If this code should indeed guarantee that y is evaluated at most once, then rewriting to foo y is in general wrong (and may only be justified if we know that foo evaluates its argument only once – this is more conservative than not doing the substitution if we happen to know that it can go wrong).

Alternatively, we declare the semantics that a let with a trivial RHS should not (guarantee) sharin (just like a trivial function argument does not indicate sharing!). But then the problem was one step earlier where GHC replaced

let x = case (y,z) of (y,z) -> y
in foo x

with the trivial let above. Maybe it should have changed it to something like

let x = share y
in foo x

where share is a magic id that is semantically the identity, but prevents the let from being trivial. Now x can even be safely inlined then foo (share y). The simplifier could know some rules about when occurrences of share can be removed (e.g. when the argument is not trivial any more, or when it is itself known to be shared).

comment:11 Changed 4 years ago by nomeata

Ha, well, that's not very cool. I bet it is easily fixed. Just need to annotate the binders of the worker, perhaps.

I found that the worker/wrapper code of the demand analyizer already does this annotation. In this instance, it’s a constructor specialization which does not preserve the annotations. I’ll have a look.... it’s easy to do, the new functions strictness signature is already calculated, so I simply pull the demand on the individual arguments out of that: (changeset:8649ac61698c8600f5db64ff7947828bb4715a5d for now, but this is a temporary branch). But still, it shows that it is tricky to try to preserve the analysis results through the compiler.

comment:12 Changed 4 years ago by simonpj

Well, it all amounts to the same thing:

  • Introduce share
  • Use let to create sharing

The rules for eliminating share would be the same as the rules governing inlining of trivial lets. Which is preferable is largely an engineering choice.

comment:13 Changed 4 years ago by nomeata

So you suggest that the desired semantics is that a trivial let should guarantee sharing, right?

In this case, transforming let x = y in foo x to foo y is not a valid transformation any more. Unless we can guarantee that foo y evaluates y only once or unless we can guarantee that y is shared anyways. Assuming we do not want to drag strictness annotations into the semantics, we cannot do that! (But maybe dragging them in is fine?)

The rule “a trivial let does not guarantee sharing” allows the the above transformation unconditionally, but shifts the onus to the earlier simplification of the RHS. Is that less hairy there? Not sure...

Or maybe, as a third option, we should flag a variable with a bit of information that indicates whether it is shared (and safe to evaluate multiple times). How is that different from the demand annotation? The demand annotation says something about how it is used, while this flag contains information about how the variable is going to be created (with or without an update frame) and then guides how it can be used. I don’t see all the consequences, of course, but such a variable would in many cases behave more like a possibly costly expression, i.e. should not be duped, not be floated into a let etc.

comment:14 Changed 4 years ago by Joachim Breitner <mail@…>

In 80d4fdf/ghc:

SpecConstr: Transport strictness data to specialization’s argument’s binders

This is a result of the discussion in ticket:11731#comment:9.

comment:15 Changed 4 years ago by nomeata

Differential Rev(s): Phab:D2064
Status: newpatch

A small change to exprIsTrivial fixes this at least for the test case produced above.

comment:16 Changed 4 years ago by simonpj

For the record, here is the proposed fix (currently on a wip branch):

commit 7ac5610606e8f338cd2eb92eb5d711e054d9d55a
Author: Joachim Breitner <mail@joachim-breitner.de>
Date:   Wed Mar 30 12:55:10 2016 +0200

    Used-once variables are not trivial
    
    The specification for exprIsTrivial demands that we are unconditionally
    happy to duplicate the expression. This is not true for variables where
    we would like to exploit (or already have exploited) that they are used
    at most once. In order to preserve this property, they must not be
    duplicated nilly-willy. This fixes #11731 and comes with a test case.


>---------------------------------------------------------------

7ac5610606e8f338cd2eb92eb5d711e054d9d55a
 compiler/coreSyn/CoreUtils.hs                      | 12 +++++++-
 testsuite/.gitignore                               |  1 +
 testsuite/tests/simplCore/should_run/T11731.hs     | 36 ++++++++++++++++++++++
 testsuite/tests/simplCore/should_run/T11731.stderr |  1 +
 testsuite/tests/simplCore/should_run/all.T         |  1 +
 5 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/compiler/coreSyn/CoreUtils.hs b/compiler/coreSyn/CoreUtils.hs index 1d9b83b..d02b934 100644
--- a/compiler/coreSyn/CoreUtils.hs
+++ b/compiler/coreSyn/CoreUtils.hs
@@ -62,6 +62,7 @@ import DataCon
 import PrimOp
 import Id
 import IdInfo
+import Demand ( isUsedOnce )
 import Type
 import Coercion
 import TyCon
@@ -755,6 +756,13 @@ Note [exprIsTrivial]
 
 Note [Variables are trivial]
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Variables are usually trivial.
+
+Except if 'isUsedOnce (idDemandInfo v) == True':
+In this case we have previously determined that a variable is used at 
+most once, and we likely rely on this information, e.g. during code 
+generation. In this case, we are not unconditionally happy to duplicate, so we don’t.  See #11731.
+
 There used to be a gruesome test for (hasNoBinding v) in the  Var case:
         exprIsTrivial (Var v) | hasNoBinding v = idArity v == 0 @@ -793,7 +801,9 @@ it off at source.
 -}
 
 exprIsTrivial :: CoreExpr -> Bool
-exprIsTrivial (Var _)          = True        -- See Note [Variables are trivial]
+exprIsTrivial (Var v) -- See Note [Variables are trivial]
+  | isUsedOnce (idDemandInfo v) = False
+  | otherwise                  = True
 exprIsTrivial (Type _)         = True
 exprIsTrivial (Coercion _)     = True
 exprIsTrivial (Lit lit)        = litIsTrivial lit

comment:17 Changed 4 years ago by nomeata

The change is also visible at Phab:D2064. (So much duplication of code.. :-()

Unfortunately, while it fixes the problem in the above test case, it does not yet fix the instance observed in nofib. I’m investigating...

comment:18 Changed 4 years ago by nomeata

I think I nailed it. With the above change, demand information becomes something that we should try to keep hold on. But we don’t:

simplExprF1 env expr@(Lam {}) cont
  = simplLam env zapped_bndrs body cont
        -- The main issue here is under-saturated lambdas
        --   (\x1. \x2. e) arg1
        -- Here x1 might have "occurs-once" occ-info, because occ-info
        -- is computed assuming that a group of lambdas is applied
        -- all at once.  If there are too few args, we must zap the
        -- occ-info, UNLESS the remaining binders are one-shot
  where
    (bndrs, body) = collectBinders expr
    zapped_bndrs | need_to_zap = map zap bndrs
                 | otherwise   = bndrs

    need_to_zap = any zappable_bndr (drop n_args bndrs)
    n_args = countArgs cont
        -- NB: countArgs counts all the args (incl type args)
        -- and likewise drop counts all binders (incl type lambdas)

    zappable_bndr b = isId b && not (isOneShotBndr b)
    zap b | isTyVar b = b
          | otherwise = zapLamIdInfo b

Here, we would remove the demand information from the parameters of a top-level function (unless all of them happen to be one-shot-binders). I don’t fully understand the comment here. Does "occurs-once" refer to occurence information only? But zapLamInfo in IdInfo also removes Demand information...sometimes.

This looks delicate. Why does demand information (which is calculated from the body of the lamda, not how it is being used) need to be zapped in zapLamInfo at all?

comment:19 Changed 4 years ago by nomeata

A small change to exprIsTrivial fixes this

And almost validates, with one exception. There is an increase of allocations in T5549 by 10%. The -ddump-simpl differs only slightly: In the code of

    aux (_,0) _ = ([],0)
    aux _ (_,0) = ([],0)
    aux (a@(ha:as),la) (b@(hb:bs), lb)
      | ha == hb  = let (s,n) = aux (as,la-1) (bs,lb-1) in (ha : s, n+1)
      | otherwise =
        let (sa,na) = aux (as,la-1) (b,lb) -- ←
            (sb,nb) = aux (a,la) (bs,lb-1) in
        if na > nb then (sa,na) else (sb,nb)

the marked recursive call to aux re-assembles the second tuple argument, instead of passing the tuple that was passed to aux; this is CSE failing.

CSE uses exprIsTrivial in tryForCSE. From my reading of the code, in the context of

case foo of foo' 
  { (x [Dmd=<L,1*U>], y) -> 
     ... case x of z { ... (x,y) ... }

it would previously *not* CSE the inner mention of x to z, as x is trivial, and then successfully replace (x,y) with foo'. But after my change, x is no longer trivial, so it does CSE it to z, and then leave (z,y) as it is.

Digging further into it, I found some code smell in CSE. Note Case binders 2 says that with trivial scrutinees, we want a mapping case binder -> scrutinee instead of the usual scrutinee -> case binder. Nevertheless, cseExpr in the case for Case adds the unwanted mapping via cseRhs. Only the check for exprIsTrivial in tryForCSE prevents this entry from being used. Fixing this code smell (I’ve just updated Phab:D2064 in a moment) fixes this particular regression.

comment:20 Changed 4 years ago by simonpj

I agreed comment:18 is odd, but first why does "demand information become something we should try to keep hold on"? What, specifically, goes wrong?

The reasoning here is as follows. Consider

let f = \xy. x+y
in ...map (f expensive)....

Now, in f's definition x and y are both marked

  • OneOcc not-in-lambda (syntactically occurring once not inside a lambda; added by the occurrence analyser)
  • Demanded (added by the demand analyser)

These annotations assume that f is fully applied

But when we inline f at this partial application site, we get

...map (let x = expensive in \y. x+y)...

and now x jolly well shouldn't be marked as occurring not inside a lambda; nor should it be marked as strictly demanded.

Hence the zapLamIdInfo.

What I had forgotten (and has somehow never arisen) is that this also happens at f definition site.

Maybe it doesn't matter. Back to the quesion at the beginning: why does it matter?

comment:21 Changed 4 years ago by nomeata

Over at Phab you write

I still think this is the wrong approach.

but now I’m confused. Where did you say what’s wrong with this approach? This approach is my attempt to implement your suggestion “Perhaps we should simply refrain from inlining x under these circumstances” from comment:5!

From this suggestion, we need to detect “these circumstances”, which requires us to detect that “The demand signature on y is marked "used-once"”. And this implies that looking at demand information. And this implies that we need to keep hold to it.

Personally, I quite agree that the annotation should *not* be something we need to hold on to, although I don’t have a convincing solution, only the rough ideas above that boil down to systematically distinguishing between variables that guarantee to have sharing behavior and those where this is not guaranteed.

Anyways, thanks for the explanation about zapLamIdInfo. It makes perfect sense when inlining a function into an expression. But do you agree that it should not be done to f’s definition?

comment:22 Changed 4 years ago by simonpj

Well that's extremely annoying. I'd typed a long comment in, but Trac has binned it somehow. Sigh.

I have concluded that the approach in comment:16 (based on my suggestion in comment:5) is too fragile.

  • As Ben points out there are other exprIsTrivial variants.
  • We'd probably need to do similar surgery for exprIsCheap, lest we change sharing by floating something like
    x = case v of (p,q) -> p
    
    out, aand then discover we know that v is a pair (x,y), so it all turns into x=y

in short, I'm just not confident that we can be sure to maintain this single-entry property.

Moreover, these restrictions cripple otherwise-useful transformations. For example, we might end up with a function that claims to evaluate its argument at most once, but to maintain that claim looks like \x. let y = x in ...body.... So now two thunks get allocated for every call, one single entry, and one indirection-thunk that does the update. This is bad bad.

Much better, I think, to do this:

  • Allow unrestricted transformation, as now, with no guarantees about maintaining single-entry-nes. Indeed maybe the demand analyser should not even record signle-entry-ness in the Ids, since it is so fragile.
  • Just before code generation, perhaps even after CorePrep, do demand analysis but not worker-wrapper. So all it does is pin on the single-entry-ness to each thunk, just before it is needed in CoreToStg. No simplifier pass afterwards to mess it up!

That should nail it in a robust way.

Note that this final demand-analysis is there solely to attach single-entry-ness. We might still want to do "late demand analysis" with -O2, as now, complete with worker/wrapper and subsequent simplification pass, to exploit the usual w/w benefits of strictness exposed at later stages of the pipeline. That becomes an independent choice.

comment:23 Changed 4 years ago by nomeata

Sounds reasonable, but it raises a few questions:

Indeed maybe the demand analyser should not even record signle-entry-ness in the Ids, since it is so fragile.

Yes, I’m all for not storing information that we do not keep up-to-date, as it will just be too likely that a later pass uses them, and things break again in obscure places. But then this should also apply to the strictness-and-demand signature of functions, which should be stripped of any 1* information! But that’s not too bad; the final, non-ww pass could attach full strictness signatures to exported functions (at least as long as we don’t have CSE as a STG-to-STG-transformation as proposed in #9291).

What should be the consequence for OneShot annotations on lambda binders? Don’t all the same considerations apply here?

comment:24 Changed 4 years ago by simonpj

  • The awkwardness of not recording top-level 1* info is that we'd need a demand-analyser flag to tell it whether to do so or not.
  • I've just realised that this final sweep (dmd anal only) must be just before CoreTidy. Reason: currently anyway, CoreTidy establishes the exported strictness signatures. We do want to export f with a "I use my arg at most once" 1* annotation. So we want to pin in reliable info just before we freeze it for export, and then not invalidate it.

One thing I have longed to do for some time is to get CAF info from the STG form, and combine that CAF info into the ModIface bindings. Currently we predict what it'll be at CoreTidy, which is fragile. Maybe the same holds for strictness/usage info.

  • OneShot annotations on lambda binders are a different matter. They say something about the use of the function, not about the use of the binder in the function body. Can a OneShot annotation go wrong? Yes if we call that function twice instead of once. But GHC really is pretty paranoid about duplicating arbitrary amounts of work so we are at least much safer here.

comment:25 Changed 4 years ago by nomeata

The awkwardness of not recording top-level 1* info is that we'd need a demand-analyser flag to tell it whether to do so or not.

Not too awkward, there is AnalEnv that can hold on to it.

But alternatively, we could do a second sweep that goes through the AST after DmdAnal and remove the bits we want removed. This would allow the demand analyzer to set and make use of 1* internally, e.g. during fixed-point iteration and such. This pass would also allow us to set the OneShot annotations on lambda binders based on the demand on the function, before we remove it – the occurrence analyzer cannot do it for us any more.

Or we move the onus of removing invalidated demand signatures onto the pass that actually invalidates them, and just be much more aggressive about this. The occurrence analyzer (if we think of it as the first thing done by the simplifier) would be a natural choice, with the added benefit that it can still use the information to set OneShot information before zapping it.

Can a OneShot annotation go wrong?

I cannot construct an example out of my head that would still work if we eradicated thanks that are wrongly marked as single entry, so we might be safe.

comment:26 Changed 4 years ago by simonpj

Let's be careful to distinguish three kinds of IdInfo, all pinned on an Id:

  • demandInfo: says how the Id is used (demanded) by its scope.
  • strictnessInfo: describes the demand transformer for the Id; it tnansforms a demand on the Id to a demand on its arguments and free variables.
  • oneShotInfo: for lambda-bound Ids only, says whether the function is called at most once. (It says nothing about how the Id is used in the body of the lambda.)

I think we have decided (#11770) that

  • only the demand analyser sets demandInfo and strictnessInfo
  • only the occurrence analyser sets oneShotInfo.

The demandInfo on an Id includes 1* usage info, which is fragile so perhaps we should erase it.

The strictnessInfo on an Id also contains 1* usage info, e.g about the function arguments. That too is fragile to transformations of the function body, so perhaps we should erase it too.

I've just realised that a convenient place to erase it might be the worker/wrapper pass. So then we'd have

  • the demand analyser sets demandInfo and strictnessInfo
  • the worker-wrapper pass erases 1* info from demandInfo and strictnessInfo
  • the occurrence analyser sets oneShotInfo

comment:27 Changed 4 years ago by nomeata

Sounds like a plan. I am currently checking the effect of a very late, non-ww, demand analyzer.

Are you sure that that is the most convenient place?

  • It would no longer get erased with -fstrictness -fno-worker-wrapper. Although we might argue that this is not a supported configuration.
  • Since the worker-wrapper pass happens before the next occurrence analyzer (does it?), the occurrence analyzer no longer sees the 1* demandInfo and thus cannot set the oneShotInfo.

Why not do it in the occurrence analyser?

comment:28 in reply to:  27 Changed 4 years ago by simonpj

  • It would no longer get erased with -fstrictness -fno-worker-wrapper. Although we might argue that this is not a supported configuration.

I'm very dubious about -no-worker-wrapper. Maybe we should kill it off.

  • Since the worker-wrapper pass happens before the next occurrence analyzer (does it?), the occurrence analyzer no longer sees the 1* demandInfo and thus cannot set the oneShotInfo.

False. The occurrence analyser uses call-once C1 info (the Count argument to UCall). It does not use the demanded-once 1* info (the Count field to Use). It is the latter that we are going to erase, NOT the former.

Why not do it in the occurrence analyser?

By "it" I guess you mean the erasure? Because the occurrence analyser runs repeatedly; no sense in repeatedly erasing info that is already erased.

comment:29 Changed 4 years ago by nomeata

It is the latter that we are going to erase, NOT the former.

Ah, thanks for the clarification; I must admit that I did not think it through with sufficient detail.

I’ll give it a shot as outlined by you.

comment:30 Changed 4 years ago by simonpj

Meanwhile, there is a separate issue in comment:20. It seems that for any un-saturated lambda, we nuke (a) the occInfo on the binders (fair enough; the next run of the occurrence analyser will regenerate it), and (b) the demandInfo on the binders. The reasoning is explained in comment:20.

But (b) is permanent; it won't be re-generated until the next run of the demand analyser. And, worse, it applies to function definitions

f = \xy . x+y

Here x and y will be marked as strictly demanded, but that info will get stripped after the first run of the simplifier. Try it on

f :: [Int] -> [Int] -> Int
f x y = head x + head y

After the demand analyser we have

f = \ (x_amY [Dmd=<S,1*U>] :: [Int]) (y_amZ [Dmd=<S,1*U>] :: [Int]) -> ...

but after the first run of the simplifier we get

f = \ (x_amY :: [Int]) (y_amZ :: [Int]) -> ...

(f itself still has a strictness signature that says it is strict in both args.)

Does this matter? Well, the main reason is that if f is inlined, we'd like to get a strict let. And now we won't.

Happily I think it is easily fixed. The key thing is that when doing beta-reduction, which effectively does (\x.e) b into let x=b in e, we must kill off x's demand/occurrence info if the lambda is not saturated.

So, idea, in Simplify.hs:

  • Give an extra Bool to simplLam indictating "saturated".
  • Compute (value) saturation in the simplExprF1 env expr@(Lam {}) cont, before calling simpLam, but do no zapping.
  • In simplLam, in the beta-reduction case,
    simplLam env (bndr:bndrs) body (ApplyToVal { sc_arg = arg, sc_env = arg_se
                                               , sc_cont = cont })
      = do  { tick (BetaReduction bndr)
            ; simplNonRecE env' (zap_unfolding bndr) (arg, arg_se) (bndrs, body) cont }
    
    do the binder-zapping right there, if "saturated" is not true. (That neatly puts it with the unfolding-zapping code.)

Now the non-beta-reduced lambdas won't be zapped, which is right.

Would you like to try that?

comment:31 in reply to:  30 Changed 4 years ago by nomeata

I moved that separate issue to #11778, and will comment after lunch.

comment:32 Changed 4 years ago by nomeata

Sounds like a plan. I am currently checking the effect of a very late, non-ww, demand analyzer.

It does fix the bug here (both as observed in nofib, and the test case I produced here). The numbers for thunks are the same as with -flate-dmd-anal, though.

I wonder: What else (besides the code-generator) uses the used-once information in a critical way? In particular, isUsedOnce is *only* mentioned in CoreToStg. Should we even bother removing the information then?

Anyways, I implemented it (wip/T11731 for now). I did put the code into DmdAnal.hs, after all, that’s where it belongs, and added a boolean flag to CoreDoStrictness. A full validation is running.

comment:33 in reply to:  32 Changed 4 years ago by simonpj

I wonder: What else (besides the code-generator) uses the used-once information in a critical way? In particular, isUsedOnce is *only* mentioned in CoreToStg. Should we even bother removing the information then?

I think nothing. And that's why I said it's optional to erase it. It's just a little smelly to have info in the tree that is wrong.

comment:34 Changed 4 years ago by simonpj

Crumbs. I see that you are doing an entire pass of the program to erase this info. That does seem like overkill, to remove info that is never examined anyway. I'd much rather not do this.

comment:35 Changed 4 years ago by nomeata

Differential Rev(s): Phab:D2064Phab:D2073

Crumbs. I see that you are doing an entire pass of the program to erase this info. That does seem like overkill, to remove info that is never examined anyway. I'd much rather not do this.

Agreed, I reverted it; the final result (with Note) is now in Phab:D2073.

I did look at the worker/wrapper module, but found that it would be quite ugly to do it there. For example, clear “In this case do nothing” code paths would have to turn to “In this case, do nothing (besides this change to all binders that we happen to do here).”

One clear implementation would have been to adjust the calls to setIdDemandInfo in DmdAnal, but that would go wrong because of the Cunning Plan with fixed-point iteration, where later iterations use the existing annotations.

comment:36 in reply to:  35 Changed 4 years ago by simonpj

One clear implementation would have been to adjust the calls to setIdDemandInfo in DmdAnal, but that would go wrong because of the Cunning Plan with fixed-point iteration, where later iterations use the existing annotations.

Well imagine that the demand analyser simply NEVER attributed single-use to anything; e.g. mkOnceUsedDmd returned Use Many .... Then we'd get no useful use-once info from the analyser so we wouldn't need to erase it.

Not sure if it's worth it. There are other occurrence of the string Use One in Demand.

comment:37 Changed 4 years ago by nomeata

That would be a clean approach... but I’m worried that we’d get less useful information also for called-once, and thus needlessly cripple the analysis. Isn’t the information about single-use demand used internally to determine that certain functions are called once?

comment:38 in reply to:  37 Changed 4 years ago by simonpj

Replying to nomeata:

Isn’t the information about single-use demand used internally to determine that certain functions are called once?

No, I don't think so. I'm 85% certain!

comment:39 Changed 4 years ago by nomeata

Then I’ll give it a shot: I’ll re-add the Bool parameter to the demand analysis which indicates whether we want used-once-information, modify the calls to mkOnceUsedDmd, and just see what happens.

comment:40 Changed 4 years ago by simonpj

Trouble is that there are other places where Use One shows up in Demand.hs, and we'd need to look at all of them too. But I think this alone will kill off a lot.

comment:41 Changed 4 years ago by nomeata

With the late demand analysis, suddenly demand signatures involving free variables turn up in -ddump-simpl:

 Roman.foo_$s$wgo [Occ=LoopBreaker]
   :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
-[GblId, Arity=2, Caf=NoCafRefs, Str=<L,U><L,U>]
+[GblId, Arity=2, Caf=NoCafRefs, Str=<L,U><S,U>]
 Roman.foo_$s$wgo =
   \ (sc :: GHC.Prim.Int#) (sc1 :: GHC.Prim.Int#) ->
     let {
       m :: GHC.Prim.Int#
-      [LclId]
+      [LclId, Str={s1Yc-><S,A>}]
...

As this contains uniques, this breaks a few test cases. So it seems that previously, this information has been removed somewhere (in the simplifier or so). With DmdAnal running right at the end, this is not happening. So I wondered if CoreTidy should remove them, but there it says

        -- Note [Tidy IdInfo]
        -- We need to keep around any interesting strictness and
        -- demand info because later on we may need to use it when
        -- converting to A-normal form.
        -- eg.
        --      f (g x),  where f is strict in its argument, will be converted
        --      into  case (g x) of z -> f z  by CorePrep, but only if f still
        --      has its strictness info.
        --
        -- Similarly for the demand info - on a let binder, this tells
        -- CorePrep to turn the let into a case.

Questions:

  • At CorePrep time, should there still be a demand environment in the strictness signatures? (Note that the Binary instance does _not_ serialize the environment, but discards it).
  • If not, should I remove the environment in TidyCore?
  • Where was it removed before?

comment:42 Changed 4 years ago by nomeata

Answers via skype talk:

  • No, should not be there.
  • Yes
  • WorkWrap (tryWW)

comment:43 Changed 4 years ago by nomeata

Very nice results due to the final demand analyzer:

nofib/allocs/cichelli 	80313656 	- 22.92% 	61907656 	bytes
nofib/allocs/mandel2 	1041544 	- 11.40% 	922768 	bytes
nofib/time/cryptarithm1 0.421 	 	 - 5.23% 	0.399 	seconds
tests/alloc/T9233 	1030551552 	 + 3.61% 	1067738016 	bytes
tests/alloc/T9675 	574886360 	 + 5.26% 	605104208 	bytes

(I hope they are not too good to be true).

The regressions in the two test cases are compiler benchmarks; I think this is the expected cost for another run of the demand analyzer.

Left to do for this ticket: Remove the (unused, possibly wrong) 1* annotations in the Worker/Wrapper-Phase.

comment:44 Changed 4 years ago by simonpj

Wow, that's quite remarkable. I hope that the ticky numbers will show why cichelli and mandel2 are improving . Since this is from the "final run" thing, it must presumably be from increasing the number of single entry thunks we find, right? And that should show up in the ticky numbers you gather.

It's curious that other compiler benchmarks don't also worsen. Maybe you are right, but I rather doubt that 5% of compile time is the cost of single run of the demand analyser, unless it's a very pathological program with deeply nested letrecs.

Again, it'd be good to try this with a stage1 compiler to isolate the effects. But we'd expect this to speed up a stage2 compiler, so the effect would be more pronounced with stage1. In HEAD Ben has added a bit more time-tracking to passes which may help.

comment:45 Changed 4 years ago by nomeata

The regressions in the two test cases are compiler benchmarks; I think this is the expected cost for another run of the demand analyzer.

These regressions also happen with a stage1 compiler, and the extra allocations correspond quite well to the the numbers shown by the new per-pass-numbers. They don’t have deeply nested letrecs, but rather data types with lots and lots of record fields.

Now to cichelli. The number of dynamic thunks entered (ENT_DYN_THK_MANY_ctr) goes down drastically (from 576007 to 115857). This is due to Auxil.$whinsert getting a strictness signature in its first argument, which causes the code

  where
  try newAssocs = ( case hinsert (hash newCharAssocs k) keyHashSet of
             Nothing -> (NotEver 1)

in Prog to not allocate a thunk for the argument to hinsert, but rather evaluate hash strictly. The reason the worker $whinsert (or rather the original hinsert) is not stricter earlier is that this is only visible after more inlining and case-of-case transformations. So this is nice.

For mandel2, there are lets (split_y) that now have a strict demand and thus are compiled differently by the code generator. Nice as well.

So the effect was not so much due to single entry thunks, but rather due to strictness (either via exported signatures and knock-on effects on later modules, or via strict lets).

Does that address the concerns?

comment:46 Changed 4 years ago by simonpj

Yes that's good. For the paper we can talk about the find dmd-anal pass, but in showing the payoff from single-entry etc we will have to compare like with like; i.e. not claim the big wins here as wins for single-entry-ness.

We'd also flirted with -flate-dmd-anal (complete with worker/wrapper) which could show greater benefits still (from the w/w). I hypothesise that if we did -flate-dmd-anal then the final dmd-anal pass would buy very little except the correct single-entry info.

comment:47 Changed 4 years ago by Joachim Breitner <mail@…>

In 3a34b5c/ghc:

Add a test case for #11731.

comment:48 Changed 4 years ago by Joachim Breitner <mail@…>

In f4fd98c/ghc:

Add a final demand analyzer run right before TidyCore

in order to have precise used-once information in the exported
strictness signatures, as well as precise used-once information on
thunks. This avoids the bad effects of #11731.

The subsequent worker-wrapper pass is responsible for removing the
demand environment part of the strictness signature. It does not run
after the final demand analyzer pass, so remove this also in CoreTidy.

The subsequent worker-wrapper pass is also responsible for removing
used-once-information from the demands and strictness signatures, as
these might not be preserved by the simplifier. This is _not_ done by
CoreTidy, because we _do_ want this information, as produced by the last
round of the demand analyzer, to be available to the code generator.

Differential Revision: https://phabricator.haskell.org/D2073

comment:49 Changed 4 years ago by nomeata

Alright, pushed this. No outstanding patches related to the cardinality analysis now from my side.

comment:50 Changed 4 years ago by nomeata

Resolution: fixed
Status: patchclosed

comment:51 Changed 3 years ago by simonpj

Milestone: 8.0.3
Status: closedmerge

This patch should go into 8.0.3. See #12996.

comment:52 Changed 3 years ago by bgamari

Milestone: 8.0.38.2.1

At this point it is rather unlikely that there will be an 8.0.3. Re-milestoning.

comment:53 Changed 3 years ago by bgamari

Status: mergeclosed
Note: See TracTickets for help on using tickets.