Opened 5 years ago

Last modified 7 months ago

#9279 new bug

Local wrapper function remains in final program; result = extra closure allocation

Reported by: simonmar Owned by: simonpj
Priority: normal Milestone:
Component: Compiler Version: 7.8.2
Keywords: LateLamLift, Simplifier Cc: nicolas.frisby@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I have a strange problem with a local function binding that is not being inlined. I've attached the repro code, build it like this:

$ ghc -O2 Haxl/Core/Monad.hs

There are several modules, but the one with the problem is Haxl/Core/Monad.hs. In particular Haxl.Core.Monad.$fApplicativeGenHaxl2, which contains this fragment:

            let {
              $wa4_s6YQ
              $wa4_s6YQ =
                \ w_s6YF _ ww1_s6YM _ _ w1_s6YI ->
                  case GHC.Prim.readMutVar# ipv7_X5Ne w1_s6YI
                  of _ { (# ipv10_X5QO, ipv11_X5QQ #) ->
                  case ipv11_X5QQ of _ {
                    Haxl.Core.Monad.IVarFull a2_a3kK -> case lvl6_r7qi of wild3_00 { };
                    Haxl.Core.Monad.IVarEmpty dt2_d4Vs ->
                      case GHC.Prim.writeMutVar#
                             ipv7_X5Ne (Haxl.Core.Monad.IVarFull w_s6YF) ipv10_X5QO
                      of s2#_a5OR { __DEFAULT ->
                      case GHC.Prim.readMutVar# dt2_d4Vs s2#_a5OR
                      of _ { (# ipv12_X5R0, ipv13_X5R2 #) ->
                      case GHC.Prim.readMutVar# ww1_s6YM ipv12_X5R0
                      of _ { (# ipv14_X5SS, ipv15_X5SU #) ->
                      letrec {
                        go_a5ti
                        go_a5ti =
                          \ ds10_a5tj ->
                            case ds10_a5tj of _ {
                              [] -> ipv15_X5SU;
                              : y_a5to ys_a5tp -> GHC.Types.: (y_a5to w_s6YF) (go_a5ti ys_a5tp)
                            }; } in
                      case go_a5ti ipv13_X5R2 of x'_a5P6 { __DEFAULT ->
                      case GHC.Prim.writeMutVar# ww1_s6YM x'_a5P6 ipv14_X5SS
                      of s2#1_a5P7 { __DEFAULT ->
                      (# s2#1_a5P7, Haxl.Core.Monad.cacheRequest6 #)
                      }
                      }
                      }
                      }
                      }
                  }
                  } } in
            let {
              a2_s6lH
              a2_s6lH =
                \ w_s6YF _ w2_s6YH w3_s6YI ->
                  case w2_s6YH
                  of _
                  { Haxl.Core.Monad.SchedState ww1_s6YL ww2_s6YM ww3_s6YN ww4_s6YO ->
                  $wa4_s6YQ w_s6YF ww1_s6YL ww2_s6YM ww3_s6YN ww4_s6YO w3_s6YI
                  } } in

I want $wa4 to be inlined at its single occurrence in a2. I believe the reason it is not being inlined is that a2 is a wrapper, but the situation seems silly because the wrapper isn't going away either, so we have a redundant closure being built (this is in my inner loop).

Attachments (3)

repro.tar.gz (25.9 KB) - added by simonmar 5 years ago.
repro code
llf-ddump-simpl (41.8 KB) - added by nfrisby 5 years ago.
simpl output with the late lambda float
repro2.tgz (16.1 KB) - added by sgraf 12 months ago.
Repro without reference to unordered-containers/hashable

Download all attachments as: .zip

Change History (28)

Changed 5 years ago by simonmar

Attachment: repro.tar.gz added

repro code

comment:1 Changed 5 years ago by simonmar

Owner: set to simonpj

comment:2 Changed 5 years ago by simonpj

So how do I reproduce this? I get

simonpj@cam-05-unx:~/tmp/T9279$  ghc -O2 Haxl/Core/Monad.hs

Haxl/Core/Types.hs:66:8:
    Could not find module ‘Data.Aeson’
    Perhaps you meant Data.Version (from base)
    Use -v to see a list of the files searched for.

Haxl/Core/Types.hs:67:8:
    Could not find module ‘Data.Hashable’
    Use -v to see a list of the files searched for.

Haxl/Core/Types.hs:72:8:
    Could not find module ‘Data.HashMap.Strict’
    Perhaps you meant
      Data.IntMap.Strict (from containers-0.5.5.1)
      Data.Map.Strict (from containers-0.5.5.1)
    Use -v to see a list of the files searched for.

Haxl/Core/Monad.hs:73:8:
    Could not find module ‘Control.Concurrent.STM’
    Perhaps you meant
      Control.Concurrent.QSem (from base)
      Control.Concurrent (from base)
      Control.Concurrent.Chan (from base)
    Use -v to see a list of the files searched for.

Also can you try with -ddump-inlinings -dverbose-core2core -ddump-occur-anal and add the log file?

Simon

comment:3 Changed 5 years ago by simonmar

You also need to install aeson, unordered-containers, stm and text (I think those should be enough).

I've collected the log you asked for, it's too big to attach here so I've put it at http://community.haskell.org/~simonmar/log.bz2.

comment:4 Changed 5 years ago by simonpj

Cc: nicolas.frisby@… added

OK I can see what is going on; it is more or less as you say. We have something like

   let worker  = \x. BIG in
   let {-# INLINE wrapper #-}  
       wrapper = \y. ...worker....  in
   ...map wrapper xs....

(Technically, the wrapper has a "stable inlining" which I have not written above, which is a copy of the body of the wrapper.)

Now, GHC is (in general rightly) unhappy about inlining worker in wrapper's rhs, even though there is currently just one textual occurrence of worker, because there might (now or in the future) be many occurrences of wrapper, and the INLINE pragma means we'll make a copy of wrapper's rhs at each occurrence.

However, in this case, wrapper is not inlined because it's not applied to anything, so nothing is gained. But something is lost.

I can see various possible solutions:

  • Lambda-lift local function definitions, late in the pipeline. This is something Nicolas Frisby was working on, with promising results, but he sadly he never completed the work. I'm copying him on this ticket.
  • Towards the end of the pipeline, forget that wrappers are special by discarding their INLINE pragmas. This is a bit delicate; I think you would not want to do this for things marked INLINE by the programmer. And certainly not for top-level things either (else you'd defeat the entire w/w idea).

I wonder how often this happens. Might not be hard to find out; how often does a local function definition with an INLINE pragma appear in the final Core?

Simon

comment:5 Changed 5 years ago by simonpj

Summary: Optimisation bugLocal wrapper function remains in final program; result = extra closure allocation

comment:6 Changed 5 years ago by nfrisby

Regarding status of late lambda lift, please see ticket:9374#comment:3.

comment:7 Changed 5 years ago by nfrisby

Based only on the eye-balled free variable analysis of $wa4, I think that the current late lambda lift implementation would lift $wa4. ipv7_X5Ne is the only free variable I see in the closure, so replacing $wa4 with ip7_X5Ne in a2_s6lH would not constitute closure growth, which is what would most likely prevent the lift. (I'm assuming $wa4 is not LNE.)

I'll attempt the repro with my local build and let you know for sure.

Changed 5 years ago by nfrisby

Attachment: llf-ddump-simpl added

simpl output with the late lambda float

comment:8 Changed 5 years ago by nfrisby

9c2904c, the tip of wip/llf, using my recommended settings (cf LateLamLift) does lift our $wa4 (or whatever lead to it).

The new attachment is the output of /home/nifr/tmp/llf-for-9279/bin/ghc -fforce-recomp -O2 $(echo $(cat ~/ghc-llf/llf-nr10-r6)) -c Haxl/Core/Monad.hs -ddump-llf -ddump-simpl -dsuppress-all. The subcommand spews out the "recommended settings" that I refer to as llf-nr10-r6.

HTH.

comment:9 Changed 12 months ago by simonpj

Keywords: LateLamLift added

Changed 12 months ago by sgraf

Attachment: repro2.tgz added

Repro without reference to unordered-containers/hashable

comment:10 Changed 12 months ago by sgraf

The current tip of my lambda lifting branch will not lift $wa4, because that would result in a top-level function with 7 parameters, which is two more than available hardware registers.

Since the above link with the reproduction is offline, I had to delete/rename stuff from the original reproduction until it compiled again myself, but I think I have identified the binding which corresponds to the old $wa4:

let {
  $wlvl_smyy =
      sat-only \r [w_smyz
                    ww_smyA
                    ww1_smyB
                    ww2_smyC
                    ww3_smyD
                    void_X1V]
          case
              readMutVar# [ipv7_smyu void#]
          of
          { Unit# ipv11_smyH ->
                case ipv11_smyH of {
                  IVarFull _ -> lvl12_rmt8;
                  IVarEmpty dt2_smyL ->
                      let {
                        sat_smyM =
                            CCCS IVarFull! [w_smyz];
                      } in 
                        case
                            writeMutVar# [ipv7_smyu
                                          sat_smyM
                                          void#]
                        of
                        s2#_smyN
                        { (##) ->
                              case
                                  readMutVar# [dt2_smyL
                                                void#]
                              of
                              { Unit# ipv13_smyQ ->
                                  <the previous go loop was
                                   inlined a few times>
                              };
                        };
                };
          }; } in
let {
  lvl20_smz3 =
      \r [w_smz4 w1_smz5 w2_smz6 void_X1U]
          case w2_smz6 of {
            SchedState ww1_smz9
                        ww2_smza
                        ww3_smzb
                        ww4_smzc ->
                $wlvl_smyy
                    w_smz4
                    ww1_smz9
                    ww2_smza
                    ww3_smzb
                    ww4_smzc
                    void#;
          }; } in

There's also this debug output:

stgLiftLams:goodToLift
  [$wlvl_smyy]
  args spill on stack

Which tells me that it won't lift entirely due to $wlvl taking to many parameters. Which is a shame, because it's pretty clear that 3 parameters are absent. I suspect some Demand Analyser / WW thing at fault here.

comment:11 Changed 12 months ago by simonpj

You could try -flate-dmd-anal, which comes with -O2. It runs the demand analyser late in the pipeline, precisely to eliminate useless absent args.

Does it help?

comment:12 Changed 11 months ago by sgraf

-flate-dmd-anal doesn't seem to have any effect. Weird, considering that DmdAnal definitely sees the opportunity, but WW doesn't seem to exploit it.

comment:13 Changed 11 months ago by sgraf

The absent case in WWLib.hs requires that we can mk_absent_let for that particular type. For the 3 absent parameters here are of type MutVar# s a, MutVar# s a and TVar# s a.

We could extend Literal.absentLiteralOf for these unlifted, boxed cases simply by returning (the equivalent of) NULL. Alternatively, we could make sure that we don't unpack these things in the first place: If we had taken lifted MVars/TVars, mk_absent_let would successfully conjure a lifted let binding with an absent error (so I'd say).

Last edited 11 months ago by sgraf (previous) (diff)

comment:14 Changed 11 months ago by sgraf

If I apply this diff

diff --git a/compiler/basicTypes/Literal.hs b/compiler/basicTypes/Literal.hs
index 21f4a92290..4444f69c7b 100644
--- a/compiler/basicTypes/Literal.hs
+++ b/compiler/basicTypes/Literal.hs
@@ -618,6 +618,8 @@ absentLiteralOf tc = lookupUFM absent_lits (tyConName tc)
 
 absent_lits :: UniqFM Literal
 absent_lits = listToUFM [ (addrPrimTyConKey,    MachNullAddr)
+                        , (mutVarPrimTyConKey,  MachNullAddr)
+                        , (tVarPrimTyConKey,    MachNullAddr)
                         , (charPrimTyConKey,    MachChar 'x')
                         , (intPrimTyConKey,     mkMachIntUnchecked 0)
                         , (int64PrimTyConKey,   mkMachInt64Unchecked 0)

Everything WWs properly and the troubling binding $wlvl gets lifted to top-level. Given that this might bite somewhere else, should we maybe generalise absentLiteralOf by returning MachNullAddr whenever it gets a boxed type?

comment:15 Changed 11 months ago by sgraf

Here's a diff that handles all UnliftedReps uniformly:

diff --git a/compiler/basicTypes/Literal.hs b/compiler/basicTypes/Literal.hs
index 21f4a92290..0e9f25f51e 100644
--- a/compiler/basicTypes/Literal.hs
+++ b/compiler/basicTypes/Literal.hs
@@ -62,6 +62,7 @@ import Binary
 import Constants
 import DynFlags
 import Platform
+import RepType
 import UniqFM
 import Util
 
@@ -614,11 +615,14 @@ literalType (LitNumber _ _ t) = t
 absentLiteralOf :: TyCon -> Maybe Literal
 -- Return a literal of the appropriate primitive
 -- TyCon, to use as a placeholder when it doesn't matter
-absentLiteralOf tc = lookupUFM absent_lits (tyConName tc)
+absentLiteralOf tc
+  | tyConPrimRep tc == [UnliftedRep]
+  = ASSERT (isUnliftedTyCon tc) Just MachNullAddr
+  | otherwise
+  = lookupUFM absent_lits (tyConName tc)
 
 absent_lits :: UniqFM Literal
-absent_lits = listToUFM [ (addrPrimTyConKey,    MachNullAddr)
-                        , (charPrimTyConKey,    MachChar 'x')
+absent_lits = listToUFM [ (charPrimTyConKey,    MachChar 'x')
                         , (intPrimTyConKey,     mkMachIntUnchecked 0)
                         , (int64PrimTyConKey,   mkMachInt64Unchecked 0)
                         , (wordPrimTyConKey,    mkMachWordUnchecked 0)

But now MachNullAddr isn't always a literal of type Addr#. In particular, the definition

-- | Find the Haskell 'Type' the literal occupies
literalType :: Literal -> Type
literalType MachNullAddr      = addrPrimTy
literalType (MachChar _)      = charPrimTy
literalType (MachStr  _)      = addrPrimTy

is probably a lie. But then it also lies for MachStr and MachLabels, so maybe this isn't such a bad thing? The binding should be immediately eliminated by the simplifier, after all.

comment:16 Changed 11 months ago by sgraf

OK, this diff in isolation provokes Core Lint errors everywhere:

*** Core Lint errors : in result of Worker Wrapper binds ***
<no location info>: warning:
    [RHS of ww_scA1 :: Array# b_sczT]
    The type of this binder doesn't match the type of its RHS: ww_scA1
    Binder's type: Array# b_sczT

So, it's pretty clear we would have to add an additional field to MachNullAddr (MachNull would probably be more appropriate) for the type it represents, similar to LitNumber.

comment:17 Changed 11 months ago by sgraf

Note [Absent errors] reads

Note: I did try the experiment of using an error thunk for unlifted
things too, relying on the simplifier to drop it as dead code.
But this is fragile

 - It fails when profiling is on, which disables various optimisations

 - It fails when reboxing happens. E.g.
      data T = MkT Int Int#
      f p@(MkT a _) = ...g p....
   where g is /lazy/ in 'p', but only uses the first component.  Then
   'f' is /strict/ in 'p', and only uses the first component.  So we only
   pass that component to the worker for 'f', which reconstructs 'p' to
   pass it to 'g'.  Alas we can't say
       ...f (MkT a (absentError Int# "blah"))...
   bacause `MkT` is strict in its Int# argument, so we get an absentError
   exception when we shouldn't.  Very annoying!

So absentError is only used for lifted types.

But I suppose having just use NULL for the other unlifted+boxed cases is OK. We've been doing it for Addr# for a long time.

comment:18 Changed 11 months ago by sgraf

Ah, just saw that AddrRep /= UnliftedRep... Also, defining | MachNull Type makes it hard to define Ord and Binary instances. Urgh. Not sure how to proceed.

comment:19 Changed 11 months ago by sgraf

Extending the existing approach by adding e.g. | MachNullMutVar doesn't work, because we still don't know which type to return in literalType. The type arguments s and a to the MutVar# constructor are lost. So there is no way around this issue unless we store Types in Literal that are relevant in Ord and Binary instances.

comment:20 Changed 11 months ago by simonpj

There are two things going on in this ticket

  1. Should we inline wrappers late in the pipeline? See comment:4
  1. Can we do a better job for "absent" arguments of unlifted types.

I'll concentrate on (2) in this comment; but we should not lose sight of (1). In fact it might be better to make (2) a new ticket and leave this one for (1) -- or vice versa.

For a long time, the worker/wrapper splitter has given up on absent arguments of certain unlifted types: see Literal.absentLiteralOf and Note [Absent errors] in WwLib. This is very annoying because it means that we get left with functions that take a bunch of arguments they do not use, as in this ticket.

For lifted types T we build an absent value as a thunk of form

  aBSENT_ERROR_ID @T "Used absent value"

This does two things

  1. It gives us something, of the right type, to use in place of the value

we aren't passing any more.

  1. It gives an extra sanity check: if that value is ever used (a compiler

bug) we'll get a runtime error message.

For unlifted types we don't have thunks, so we can't do this. As you can see in absentLiteralOf, for some types we just make up a silly value: e.g. for Char# we use 'x#'; for Int# we use 0#.

Note, however that

  • Substituting a particular value serves purpose (A) but not purpose (B). A compiler bug would go undetected. This is sad: e.g. #11126 is a real bug that was detected by (B). But I see no way out.
  • It doesn't work for Array#, MutVar#, TVar# etc because we have no available literal values of those types.

So Sebastian is suggesting that we add a new literal value -- call it a rubbish value -- which can work for any (unlifted type), extending Literal something like this

data Literal = ...
  | RubbishLit Type

We need to store the type so we can still do literalType.

Now

  • Maybe we could get rid of MachNullAddr in favour of this new literal.
  • I think -- but I am not sure -- that this literal should never occur in code generation. For example, we should never pass a rubbish value to a function. Before then dead-code elimination should have got rid of it I'm not 100% certain, but if this was true, it'd be a great sanity check.
  • Yes, Literal has Eq and Ord -- but I'm not sure why. Try removing them and seeing what happens! (Generally I think it'd be better to define eqLit and cmpLit and cal them, rather than use == and >; so much easier to grep for!

And in fact, we do have eqType and cmpType.

  • Do we need to spit out a RubbishLit in the Binary instance. This seems more likely, because perhaps these rubbish values can occur in unfoldings, which are serialised as their parse tree. But the we can just serialise the Type. It won't happen much.

comment:21 Changed 11 months ago by simonpj

Concerning comment@4 I have not looked in detail at the example in the Description, but I'm guessing that we have a situation like this

f x xs = let g :: Int -> Int
             g y = y+x
         in map g xs

We see that g is strict, so we w/w it thus:

f x xs = let $wg :: Int# -> Int#
             $wg y# = case x of I# x# -> y# + x#

             g :: Int -> Int
             {-# Stable unfolding = \y -> <as below> #-}
             g y = case y of I# y# ->
                   case $wg y# of r# ->
                   I# r#
         in map g xs

But alas, g never gets to inline, so it is all in vain. Worse, we have lost out, because map calls g which calls $wg and that's slower than what we started with.

What we want is this:

  • If the only call to $wg is from g, then inline it back in.

Currently there are two calls to gw, one in the RHS of g and one in the stable unfolding of g. If we simply nuked the stable unfolding to g (which was added by w/w), then there'd only be one call to gw and we'd inine it happily.

On the other hand, if the body of f was (map g xs, g 7), then the g 7 would by now have turned into a call of $wg, so whether we inlined $wg would depend on how big $wg is, which is absolutely fine.

Arguably should not do this nuking stuff until after TidyCore, which generates bindings to put in the interface file. We want to leave f's RHS undisturbed until then, in case f itself is inlined in other modules. (An alternative view: it'd be ok to dump the w/w split in before TidyCore because we'll rediscover the strictness (and perhaps better strictness) in any module that inlines f.)

My conclusion:

  • for local functions (i.e. bound by a let, or at top level but not exported)
  • that have been w/w'd
  • at a fairly late stage in the pipeline
  • kill off the stable-unfolding introduced by w/w
  • and simplify

I think it'd be interesting to try this.

comment:22 Changed 11 months ago by sgraf

Thanks for your input! I created #15627 to track progress on the mk_absent_let thing.

comment:23 Changed 10 months ago by Krzysztof Gogolewski <krz.gogolewski@…>

In 448b77b9/ghc:

Add RubbishLit for absent bindings of UnliftedRep

Summary:
Trac #9279 reminded us that the worker wrapper transformation copes
really badly with absent unlifted boxed bindings.

As `Note [Absent errors]` in WwLib.hs points out, we can't just use
`absentError` for unlifted bindings because there is no bottom to hide
the error in.
So instead, we synthesise a new `RubbishLit` of type
`forall (a :: TYPE 'UnliftedRep). a`, which code-gen may subsitute for
any boxed value. We choose `()`, so that there is a good chance that
the program crashes instead instead of leading to corrupt data, should
absence analysis have been too optimistic (#11126).

Reviewers: simonpj, hvr, goldfire, bgamari, simonmar

Reviewed By: simonpj

Subscribers: osa1, rwbarton, carter

GHC Trac Issues: #15627, #9279, #4306, #11126

Differential Revision: https://phabricator.haskell.org/D5153

comment:24 Changed 10 months ago by simonpj

The patch in comment:23 nails issue (2) of comment:20, leaving issue (1) open. So we should not close this ticket.

comment:25 Changed 7 months ago by simonpj

Keywords: Simplifier added
Note: See TracTickets for help on using tickets.