Opened 5 years ago

Last modified 6 months ago

#10069 patch bug

CPR related performance issue

Reported by: pacak Owned by:
Priority: normal Milestone: 8.8.1
Component: Compiler Version: 8.6.2
Keywords: DemandAnalysis Cc: tkn.akio@…, maoe@…, ghc.haskell.org@…, Fuuzetsu, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: T10069
Blocked By: Blocking:
Related Tickets: #13143 Differential Rev(s): !401
Wiki Page:

Description

By default CRP analysis can be too aggressive in trying to pass as much as possible in unboxed tuples, in general it's not a problem but when one big datatype is passed to several consumers it might end up pushed to stack several times instead of once - to heap, things are getting worse when there are sufficient fields to cause stack overflow which otherwise is possible to avoid - in our codebase adding one field with ExistentialQuantification (unused, but that prevents ghc from doing CRP transformation) reduces number of stack overflow by a factor of 1000 and increases overall performance by 10%.

In provided example performance for both A and B should be identical and yet B is consistently faster by 3-5%

It's possible to increase this performance gap by adding more and more fields.

I was able to replicate this issue in ghc 7.8.3 and 7.10,1rc2

{-# LANGUAGE ExistentialQuantification #-}

module Blah where

import Criterion
import Criterion.Main
import Data.Typeable

data A = A ()
    !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int
    !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int

data B = forall rep. (Typeable rep) => B rep
    !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int
    !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int !Int

a :: A
a = A () 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

b :: B
b = B () 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

{-# NOINLINE a1 #-}
a1 :: A -> Int
a1 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f1

{-# NOINLINE a2 #-}
a2 :: A -> Int
a2 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f2

{-# NOINLINE a3 #-}
a3 :: A -> Int
a3 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f3

{-# NOINLINE a4 #-}
a4 :: A -> Int
a4 (A _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f4

{-# NOINLINE b1 #-}
b1 :: B -> Int
b1 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f1

{-# NOINLINE b2 #-}
b2 :: B -> Int
b2 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f2

{-# NOINLINE b3 #-}
b3 :: B -> Int
b3 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f3

{-# NOINLINE b4 #-}
b4 :: B -> Int
b4 (B _ f1 f2 f3 f4 f5 f6 f7 f8 g1 g2 g3 g4 g5 g6 g7 g8 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _) = f4

{-# NOINLINE fa #-}
fa :: A -> Int
fa a = a1 a + a2 a + a3 a + a4 a

{-# NOINLINE fb #-}
fb :: B -> Int
fb b = b1 b + b2 b + b3 b + b4 b

main :: IO ()
main = defaultMain [
   bgroup "single call" [
     bench "A" $ whnf fa a
   , bench "B" $ whnf fb b
   ]
   ]


Attachments (7)

bench.hs (5.1 KB) - added by pacak 5 years ago.
Benchmark file itself
bench.dump-simpl (121.2 KB) - added by pacak 5 years ago.
Produced core file
out.txt (1.3 KB) - added by pacak 5 years ago.
benchmark output
ManualStrategy.dump-simpl (184.7 KB) - added by pacak 5 years ago.
sample.tar.gz (4.6 KB) - added by pacak 5 years ago.
Blah.dump-simpl (70.7 KB) - added by Fuuzetsu 9 months ago.
results.tar.gz (36.0 KB) - added by Fuuzetsu 9 months ago.
Results from runs of multiple GHC versions.

Download all attachments as: .zip

Change History (44)

comment:1 Changed 5 years ago by akio

Cc: tkn.akio@… added

comment:2 Changed 5 years ago by maoe

Cc: maoe@… added

comment:3 Changed 5 years ago by simonpj

For me fa and fb generate identical code. Are you sure about those NOINLINE pragmas?

Maybe give the command line you use for compiling, and show the output of -ddump-simpl?

Simon

Changed 5 years ago by pacak

Attachment: bench.hs added

Benchmark file itself

Changed 5 years ago by pacak

Attachment: bench.dump-simpl added

Produced core file

Changed 5 years ago by pacak

Attachment: out.txt added

benchmark output

comment:4 Changed 5 years ago by pacak

Those NOINLINE pragmas are required to replicate the behavior I'm getting in the production code. Generated core looks the same, generated assembly code is different though. With more fields added (see attachment) performance difference is 4x, goes away as soon as I get rid of forall'ed field.

comment:5 Changed 5 years ago by pacak

It's compiled with ghc -O2 bench.hs

comment:6 Changed 5 years ago by simonpj

Thanks for the -ddump-simpl output. I can see what is happening.

The trouble is that a1, a2 etc are marked NOINLINE, but their strictness information is still visible. That makes fa strict too. So the original code looks like this (I have decreased the number of selectors to make it easy to read):

--------- Before strictness analysis -------------
fa a = case a1 a of I# i1 ->
       case a2 a of I# i2 ->
       I# (i1 +# i2)

--------- After strictness analysis and worker/wrapper -------------
fa a = case a of A _ _ _ _ i1 _ _ _ _ i2 _ _ _ _ ->
       $wfa i1 i2

$wfa i1 i2
  = let a = A bot bot bot bot i1 bot bot bot bot i2 bot bot bot bot
    in case a1 a of I# i1 ->
       case a2 a of I# i2 ->
       I# (i1 +# i2)

The worker $wfa would normally collapse into nice tight code, when the workers for a1 and a2 are inlined. But here they are not! So $wfa does reboxing which is terrible.

comment:7 Changed 5 years ago by pacak

There are no NOINLINE in code we are using in production but behavior is similar - works faster with unused field with ExistentialQuantification. So it disables not only CPR but some other optimizations as well. What's the conclusion then? Is that fixable or I need to look for other examples?

comment:8 Changed 5 years ago by simonpj

That's odd. I don't understand how that is happening. I need more information.

Can you show a -ddump-simpl of the bit of your production code that is allocating more? (With and without the change.) If you compile with -ticky and then run with +RTS -rfoo.ticky the file foo.ticky will show which function is allocating how much. If you compare with and without the existential change you should see an obvious culprit.

Thanks

Simon

comment:9 Changed 5 years ago by pacak

Found a file, it contains -fno-spec-constr on top, a comment to that key: -- this was added because it makes compilation blow up with ghc-7.8.3, and -- sims were neutral. -- Consider removing after ghc is fixed.

It contains relatively simple Parser parser and a bunch of functions modifying a huge data structure - different fields in this structure.

507572 1368414112 0 313 >MSMLSSSSSSLLMMSDDDD XXXXXXX-0.1.0.0:XXXXXXX.$wa{v rbsKm}

With removed -fno-spec-constr it takes forever to compile (30 minutes already, that's 700 lines of code). Trying to pick relevant bits of core.

comment:10 Changed 5 years ago by simonpj

Leave the -fno-spec-constr. Let's change only one thing at a time.

Does it happen with -O or only with -O2?

Simon

comment:11 Changed 5 years ago by pacak

With -O $wa is still there and looks about the same.

comment:12 Changed 5 years ago by pacak

And performance also suffers.

comment:13 Changed 5 years ago by simonpj

Like I say in comment:8, I can't help without more info. Can you give a -ddump-simpl of code that shows the problem without use of NOINLINE? Maybe your production code.

I understand why it happens with NOINLINE (comment:6). But you say you don't use it.

comment:14 Changed 5 years ago by pacak

22k lines, trying to pick relevant bits.

comment:15 Changed 5 years ago by pacak

Simon, can I send you an email? Core is 1.3Mb so I can't attach it here.

comment:16 Changed 5 years ago by pacak

$wa takes half of the file.

comment:17 Changed 5 years ago by simonpj

send email if you like, yes

comment:18 Changed 5 years ago by pacak

Did you got my email? Can I get any other extra information for you?

comment:19 Changed 5 years ago by simonpj

Yes, got it, but I'm just snowed under. It would be super-helpful to be able to reproduce your problem myself, if you were somehow able to boil your production code down into a test case.

comment:20 Changed 5 years ago by pacak

Emailed 10x smaller version. I'm still trying to untangle some dependencies to provide hs file.

comment:21 Changed 5 years ago by simonpj

OK, I looked at the 10x smaller version. Can you attach it to the ticket? (180kb is ok I think.) I could do so, but do not want to upload sensitive info.

I have learned some things (see below). But to explore solutions I really do need a reproducible test case. So if you untangle dependencies that would be really helpful. It's hard for me to make more progress without that. E.g. I want to see the code just before strictness analysis. I want to see the code and strictness for Trader.Order.Logging.$wa, which is in a different module.

The rest of this is notes for me or other hackers, to record what I have learned. I'm certain that difficulty is to do with reboxing (c.f. #2289). I see a function processManualOrders that I believe looks somethign like this (before strictness analysis)

pma _ p q r
  = let help x v = case select_deep_field_of p of
                    True  -> ...(help x v')...
                    False -> log_fn x   -- log_fn is lazy in x
    in
    help p r

So the help function is not strict in x. But pma is strict in p because of the (select_deep_field p) in help.

Sadly that means that p is passed to help in boxed form (since help is lazy), but is passed to pma in unboxed form (since pma is strict). The result is that p is re-boxed in the worker for pma. And p is a record with many tens of fields.

Various things would help here:

  • Running the static argument transformation on help would solve the problem completely.
  • It probably does not make sense to do worker/wrapper on records with huge numbers of fields. We end up passing arguments in memory anyway, just on the stack instead of in the heap. And the reboxing penalty is large.
  • In Demand.bothUse I see
    bothUse (UProd {}) (UCall {})       = Used
    -- bothUse (UProd {}) Used             = Used  -- Note [Used should win]
    bothUse Used (UProd ux)             = UProd (map (`bothMaybeUsed` useTop) ux)
    bothUse (UProd ux) Used             = UProd (map (`bothMaybeUsed` useTop) ux)
    bothUse Used _                      = Used  -- Note [Used should win]
    
    Note that commented out line. If Used really did win, I think that most of this fruitless work would go away. But it's delicate and I'm sure the commented out line was commented out for a reason.

Changed 5 years ago by pacak

Attachment: ManualStrategy.dump-simpl added

Changed 5 years ago by pacak

Attachment: sample.tar.gz added

comment:22 Changed 5 years ago by pacak

Haskell code itself that generates slowdown is in attached sample.tar.gz. I've managed to get rid of most of dependencies that not in Haskell Platform.

Last edited 5 years ago by simonpj (previous) (diff)

comment:23 Changed 5 years ago by simonpj

The dependencies are still substantial, on lens in particular. pacak tells me that he is working on cutting it down further. (Thank you.)

comment:24 Changed 5 years ago by pacak

Simon, did you checked most recent upload? From non-base packages that's only enummapset-th and lifted-base

comment:25 Changed 4 years ago by liyang

Cc: ghc.haskell.org@… added

comment:26 Changed 9 months ago by Fuuzetsu

Data-point from 8.6.2 using original code (first comment):

[nix-shell:/tmp]$ ghc -O2 Blah.hs && ./Blah
benchmarking single call/A
time                 21.83 ns   (21.55 ns .. 22.09 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 21.77 ns   (21.62 ns .. 22.07 ns)
std dev              705.5 ps   (434.8 ps .. 1.223 ns)
variance introduced by outliers: 53% (severely inflated)

benchmarking single call/B
time                 9.741 ns   (9.581 ns .. 9.937 ns)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 9.762 ns   (9.641 ns .. 9.942 ns)
std dev              490.4 ps   (347.4 ps .. 649.6 ps)
variance introduced by outliers: 74% (severely inflated)

I'm attaching -ddump-simpl output (Blah.dump-simpl). It seems the ticket is still a problem and not 3-5% but 100%

Changed 9 months ago by Fuuzetsu

Attachment: Blah.dump-simpl added

comment:27 Changed 9 months ago by Fuuzetsu

I ran with multiple GHC versions. I'm attaching a tarball with the results above and Core from each version. It appears the problem went way briefly in 8.0.2 but came back worse than ever afterwards. For exact package sets, I used these:

resolver: lts-6.35 # 7.10.3
resolver: lts-9.21 # 8.0.2
resolver: lts-11.22 # 8.2.2
resolver: lts-12.20 # 8.4.4
resolver: nightly-2018-11-24 # 8.6.2

GHC version 7.10.3

benchmarking single call/A
time                 9.800 ns   (9.730 ns .. 9.890 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 9.907 ns   (9.844 ns .. 9.996 ns)
std dev              250.2 ps   (167.4 ps .. 344.7 ps)
variance introduced by outliers: 42% (moderately inflated)

benchmarking single call/B
time                 10.09 ns   (10.00 ns .. 10.20 ns)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 10.37 ns   (10.25 ns .. 10.49 ns)
std dev              403.4 ps   (353.8 ps .. 475.9 ps)
variance introduced by outliers: 63% (severely inflated)

GHC version 8.0.2

benchmarking single call/A
time                 9.794 ns   (9.695 ns .. 9.914 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 9.884 ns   (9.805 ns .. 10.01 ns)
std dev              308.2 ps   (211.6 ps .. 449.5 ps)
variance introduced by outliers: 52% (severely inflated)

benchmarking single call/B
time                 9.861 ns   (9.784 ns .. 9.946 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 9.834 ns   (9.799 ns .. 9.879 ns)
std dev              133.3 ps   (92.41 ps .. 201.6 ps)
variance introduced by outliers: 17% (moderately inflated)

GHC version 8.2.2

benchmarking single call/A
time                 22.82 ns   (22.59 ns .. 23.16 ns)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 22.83 ns   (22.60 ns .. 23.15 ns)
std dev              890.2 ps   (655.8 ps .. 1.268 ns)
variance introduced by outliers: 62% (severely inflated)

benchmarking single call/B
time                 10.10 ns   (10.06 ns .. 10.15 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.11 ns   (10.08 ns .. 10.15 ns)
std dev              120.7 ps   (72.17 ps .. 184.5 ps)
variance introduced by outliers: 14% (moderately inflated)

GHC version 8.4.4

benchmarking single call/A
time                 21.51 ns   (21.33 ns .. 21.72 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 21.55 ns   (21.35 ns .. 22.21 ns)
std dev              1.128 ns   (374.2 ps .. 2.297 ns)
variance introduced by outliers: 75% (severely inflated)

benchmarking single call/B
time                 9.505 ns   (9.449 ns .. 9.568 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 9.509 ns   (9.454 ns .. 9.570 ns)
std dev              197.7 ps   (159.3 ps .. 243.1 ps)
variance introduced by outliers: 33% (moderately inflated)

GHC version 8.6.2

benchmarking single call/A
time                 21.71 ns   (21.44 ns .. 22.06 ns)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 21.93 ns   (21.70 ns .. 22.24 ns)
std dev              884.3 ps   (721.8 ps .. 1.228 ns)
variance introduced by outliers: 64% (severely inflated)

benchmarking single call/B
time                 10.28 ns   (10.20 ns .. 10.40 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 10.31 ns   (10.24 ns .. 10.41 ns)
std dev              286.9 ps   (224.3 ps .. 372.6 ps)
variance introduced by outliers: 46% (moderately inflated)

Changed 9 months ago by Fuuzetsu

Attachment: results.tar.gz added

Results from runs of multiple GHC versions.

comment:28 Changed 9 months ago by pacak

Version: 7.10.1-rc28.6.2

comment:29 Changed 8 months ago by Fuuzetsu

Cc: Fuuzetsu added

comment:30 Changed 8 months ago by sgraf

Cc: sgraf added

comment:31 Changed 7 months ago by simonpj

Keywords: CPRAnalysis DemandAnalysis added

comment:32 Changed 6 months ago by sgraf

Keywords: CPRAnalysis removed

Looking at https://ghc.haskell.org/trac/ghc/attachment/ticket/10069/Blah.dump-simpl#L1668, I don't think this is related to CPR analysis but to the worker/wrapper transformation having issues with NOINLINE functions.

What happens here is that f1 to f4 can't be inlined (so we don't see the case on the A), but fa still gets a strictness signature saying that all but arguments 2 to 5 are dead. WW will now split fa into a wrapper function that scrutinises the A to just project out the 4 arguments that aren't dead and pass it on to the worker $wfa unboxed.

So far so good. Now, WW arranges it so that the worker $wfa builds up a new A with dummy values for absent fields (0# for Int#). Normally, this new A binding would cancel out with case matches in $wfa, because the strictness signature must ultimately come from some case expression. These however are hidden in NOINLINE functions, so no cancelling is happening. As a result, we allocate the dummy A for nothing, we could have just passed along the old A.

Here's an example demonstrating this in the small:

data C = C !Int !Int

{-# NOINLINE c1 #-}
c1 :: C -> Int
c1 (C _ c) = c

{-# NOINLINE fc #-}
fc :: C -> Int
fc c = c1 c +  c1 c

Relevant Core:

c1_rP
  = \ (ds_d3af :: C) ->
      case ds_d3af of { C dt_d3PA dt1_d3PB -> GHC.Types.I# dt1_d3PB }

Main.$wfc
  = \ (ww_s7DJ :: GHC.Prim.Int#) ->
      case c1_rP (Main.C 0# ww_s7DJ) of { GHC.Types.I# x_a4kS ->
      GHC.Prim.*# 2# x_a4kS
      }

fc
  = \ (w_s7DF :: C) ->
      case w_s7DF of { C ww1_s7DI ww2_s7DJ ->
      case Main.$wfc ww2_s7DJ of ww3_s7DN { __DEFAULT ->
      GHC.Types.I# ww3_s7DN
      }
      }

The problem I see here is that we don't WW c1, or that we don't inline the resulting wrapper into fc before the hypothetical worker $wc1 of c1 gets inlined back into c1 because it's so small. If we inlined $wc1 into $wfc, the case on C would cancel out with the dummy C and everything would be well.

So: If we WW fc, we should also WW c1, otherwise we end up with bad code.

comment:33 Changed 6 months ago by simonpj

So: If we WW fc, we should also WW c1, otherwise we end up with bad code.

Quite right. Why does that not happen? It's because of

tryWW dflags fam_envs is_rec fn_id rhs
  -- See Note [Worker-wrapper for NOINLINE functions]

  | Just stable_unf <- certainlyWillInline dflags fn_info
  = return [ (fn_id `setIdUnfolding` stable_unf, rhs) ]
        -- See Note [Don't w/w INLINE things]
        -- See Note [Don't w/w inline small non-loop-breaker things]

The Notes are informative.

What happens here is that

  • certainlyWillInline returns True for c1, despite the NOINLINE pragma
  • We set a stable unfolding, for reasons described in Note [Don't w/w inline small non-loop-breaker things]
  • But alas the NOINLINE stays there and prevents c1 inlining.

I think we should always create a wrapper for NOINLINE things, just as if they were big. We already have technology for transferring the NOINLINE pragma to the worker; see

  • Note [Worker-wrapper for INLINABLE functions]
  • Note [Worker activation], which even talks about NOINLINE specifically.

So the bottom line is, I think, that the certainlyWillInline test should return False for NOINLINE functions. This call in WorkWrap is its only call site, so the change should be easy.

comment:34 Changed 6 months ago by sgraf

Differential Rev(s): !401
Status: newpatch
Test Case: T10069

comment:35 Changed 6 months ago by sgraf

comment:36 Changed 6 months ago by sgraf

Milestone: 8.8.1

I think this would be nice to have in 8.8.

comment:37 Changed 6 months ago by Marge Bot <ben+marge-bot@…>

In 1675d40a/ghc:

Always do the worker/wrapper split for NOINLINEs

Trac #10069 revealed that small NOINLINE functions didn't get split
into worker and wrapper. This was due to `certainlyWillInline`
saying that any unfoldings with a guidance of `UnfWhen` inline
unconditionally. That isn't the case for NOINLINE functions, so we
catch this case earlier now.

Nofib results:

--------------------------------------------------------------------------------
        Program         Allocs    Instrs
--------------------------------------------------------------------------------
 fannkuch-redux          -0.3%      0.0%
             gg          +0.0%     +0.1%
       maillist          -0.2%     -0.2%
        minimax           0.0%     -0.8%
--------------------------------------------------------------------------------
            Min          -0.3%     -0.8%
            Max          +0.0%     +0.1%
 Geometric Mean          -0.0%     -0.0%

Fixes #10069.

-------------------------
Metric Increase:
    T9233
-------------------------
Note: See TracTickets for help on using tickets.