Opened 3 years ago

Last modified 10 months ago

#13253 new bug

Exponential compilation time with RWST & ReaderT stack with `-02`

Reported by: phadej Owned by: bgamari, osa1
Priority: high Milestone: 8.8.1
Component: Compiler Version: 8.0.1
Keywords: Cc: ocharles
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #15630 Differential Rev(s):
Wiki Page:

Description

See the attached file. The compilation time grows exponentially with additional fields in HuteStruct. Tried with GHC-8.0.1 and quite recent HEAD (8.2)

adding a field doubles the compilation time with -O2.

  • 9: 17s
  • 10: 34s
  • 11: 83s

on my machine. With -01 around 3sec, independently of the struct size.

The original issue: https://github.com/yesodweb/yesod/issues/1348

Attachments (1)

bad.hs (3.2 KB) - added by phadej 3 years ago.
Example file

Download all attachments as: .zip

Change History (48)

Changed 3 years ago by phadej

Attachment: bad.hs added

Example file

comment:1 Changed 3 years ago by mpickering

We suspect the problem is that <*> for RWST is marked as INLINE in transformers. This ticket is to check whether that is the case and then decide on the course of action.

comment:2 Changed 3 years ago by simonpj

Exponential is bad. Why would marking it INLINE make things go exponential?

comment:3 Changed 3 years ago by rwbarton

Looks like the same as #10421. Unfortunately I didn't upload the modules Form and Y used in the reproducer there, and I can't find copies of them on my system.

comment:4 Changed 3 years ago by rwbarton

But note that #10421 doesn't appear to involve transformers; the only monad involved is IO. (Though I don't have the definitions of mreq, mopt or FormResult.)

comment:5 Changed 3 years ago by phadej

rwbarton: I suspect mreq and FormResult are the similar to ones in my attachment (which are originally from yesod, but stripped out of stuff).

comment:6 Changed 3 years ago by ocharles

Cc: ocharles added

comment:7 Changed 3 years ago by simonpj

It would be good to get to the point where we can reproduce this. Exponential is bad.

comment:8 Changed 3 years ago by phadej

Simon, see the bad.hs attached. I can see the expontential behaviour with current HEAD too. Remember to use -O2.

time .../bin/ghc-stage2 -O2 -fforce-recomp       bad.hs # 38sec
time .../bin/ghc-stage2 -O2 -fforce-recomp -DBIG bad.hs # 3min 16sec

time .../bin/ghc-stage2 -O1 -fforce-recomp       bad.hs # 9.8sec
time .../bin/ghc-stage2 -O1 -fforce-recomp -DBIG bad.hs # 9.9sec

comment:9 Changed 2 years ago by simonpj

See #10421 for another example. We should investigate this!

comment:10 Changed 2 years ago by bgamari

Milestone: 8.4.1
Type of failure: None/UnknownCompile-time performance bug

comment:11 Changed 23 months ago by bgamari

Owner: set to dfeuer

David, perhaps you could look into this?

comment:12 Changed 23 months ago by mpickering

I think this comment was bogus before.

The reason why I said I suspected the inline pragma was that if you look at -ddump-simpl-stats then the work is dominated by

4632 UnfoldingDone
  4106 $c<*>_a5bw

which then cascades a lot of additional work. If we simplified before inlining we could probably avoid this.

4374 KnownBranch
  4106 wild_X13
18268 BetaReduction
  4106 a_a5by
  4106 b_a5bz
  4106 ds_d5jY
  4106 ds_d5jZ
8261 PostInlineUnconditionally
  4106 ds_d5jZ
5212 PreInlineUnconditionally
  4106 ds_d5jY
Last edited 23 months ago by mpickering (previous) (diff)

comment:13 Changed 23 months ago by mpickering

Adding an extra field leads to

8714 UnfoldingDone
  8201 $c<*>_a4Ng

So each extra field leads to double the number of <*>s being inlined and hence the exponential behaviour.

comment:14 Changed 23 months ago by mpickering

Adding a NOINLINE to <*> for FormResult predictably fixes the problem.

comment:15 Changed 23 months ago by mpickering

Looking at the n=2 case for simplicity.

At one stage of the compilation we have some core which looks like

($c<*>_a4dL                                                              
               (case x01_a2dZ of {                                                   
                  FormMissing -> FormMissing;                                        
                  FormFailure errs_a2dS -> FormFailure errs_a2dS;                    
                  FormSuccess a_a2dU ->                                              
                    FormSuccess                                                      
                      (\ dt_a2M6 ->                                                  
                         case a_a2dU of dt_X2M8 { PS ipv_s5JI ipv_s5JJ ipv_s5JK ipv_s5JL ->
                         case dt_a2M6 of dt_X2Ma { PS ipv_s5JO ipv_s5JP ipv_s5JQ ipv_s5JR ->
                         HugeStruct dt_X2M8 dt_X2Ma                                  
                         }                                                           
                         })                                                          
                })                                                                   
               x02_a2e0

Next thing we know, the expression has been commuted and the call to $c<*> has been pushed into the branches !?

                                                      
 (case x01_a2dZ of {                                                 
               FormMissing -> $c<*>_a4dL FormMissing x02_a2e0;                  
               FormFailure errs_a2dS ->                                         
                 $c<*>_a4dL (FormFailure errs_a2dS) x02_a2e0;                   
               FormSuccess a_a2dU ->                                            
                 $c<*>_a4dL                                                     
                   (FormSuccess                                                 
                      (\ dt_a2M6 ->                                             
                         case a_a2dU of dt_X2M8 { PS ipv_s5JI ipv_s5JJ ipv_s5JK ipv_s5JL ->
                         case dt_a2M6 of dt_X2Ma { PS ipv_s5JO ipv_s5JP ipv_s5JQ ipv_s5JR ->
                         HugeStruct a_a2dU dt_a2M6                              
                         }                                                      
                         }))                                                    
                   x02_a2e0

and then $c<*>_a4dL is inlined individually in each branch rather than once at the top level.

I don't know where in the compiler performs this transformation, perhaps you know where it happens David?

comment:16 Changed 23 months ago by mpickering

The reason why -O2 is needed is because an extra pass of the simplifer is ran after the -fliberate-case pass. If you don't run this extra pass, there are lots of uninlined calls to c<*> left over.

The core running with -O is here. Starting on about line 950 you can see all the nested calls to <*> which have been duplicated.

https://gist.github.com/bb395581d8c2b4d6c66636875dcfce94

Last edited 23 months ago by mpickering (previous) (diff)

comment:17 Changed 23 months ago by mpickering

The code duplication alluded to in comment:12 is a result of the special case in rebuildCall for a strict argument.

comment:18 Changed 22 months ago by simonpj

The code duplication alluded to in comment:12 is a result of the special case in rebuildCall for a strict argument.

Can you elaborate on how that special case duplicates code?

comment:19 Changed 22 months ago by mpickering

I think it is explained in Note [Duplicating StrictArg]?

I did confirm this by commenting out the case and observing that the program compiled in a reasonable amount of time.

comment:20 Changed 22 months ago by simonpj

Whoa! That comment says

We make a StrictArg duplicable simply by making all its
stored-up arguments (in sc_fun) trivial, by let-binding
them.  Thus:
        f E [..hole..]
        ==>     let a = E
                in f a [..hole..]

So the f a is duplicated. For example we'll transform

  f E (case x of
         True -> e1
         False -> e2)
---->
  let a = E
  in case x of
       True -> f a e1
       False -> f a e2

I see no code duplication except of the f a, which is by-design. But that's pretty modest.

Can you explain more?

comment:21 Changed 22 months ago by mpickering

I think this is what happens but I've sinced paged out the details.

We start with ((FormSuccess f <*> x0) <*> x1)

You have

((FormSuccess f <*> x0) <*> x1) <*> x2
=>
((<*>) (FormSuccess f) x0) <*> x1 <*> x2
=> (Inline (<*>))
 (case x0 of {                                                   
                  FormMissing -> FormMissing;                                        
                  FormFailure errs_a2dS -> FormFailure errs_a2dS;                    
                  FormSuccess a_a2dU -> FormSuccess ...;                                                        
                }) <*> x1 <*> x2
=> (prefixify)
((<*>) (case x0 of {                                                   
                  FormMissing -> FormMissing;                                        
                  FormFailure errs_a2dS -> FormFailure errs_a2dS;                    
                  FormSuccess a_a2dU -> FormSuccess ...;                                                      
                }) x1) <*> x2
=> (StrictArg)
(case x0 of {                                                   
                  FormMissing -> (<*>) FormMissing x1;                                        
                  FormFailure errs_a2dS -> (<*>) (FormFailure errs_a2dS) x1;                    
                  FormSuccess a_a2dU -> (<*>) (FormSuccess ...) x1     ;                                                   
                }) <*> x2
=> (Inline)

(case x0 of {                                                   
                  FormMissing -> (case x1...)                                        
                  FormFailure errs_a2dS -> case x1....                  
                  FormSuccess a_a2dU -> case x1...     ;                                                   
                }) <*> x2

=> (Strict arg for x2)
(case x0 of {                                                   
                  FormMissing -> (<*>) (case x1...) x2;                                        
                  FormFailure errs_a2dS -> (<*>) (case x1...) x2;                  
                  FormSuccess a_a2dU -> (<*>) (case x1...) x2     ;                                                   
                }) 
=> (Strict arg fires again pushing the <*> into the inner cases and now we have 9 copies of (<*>).

Does that sound plausible?

comment:22 Changed 22 months ago by simonpj

Ah, yes I think you may be on to something. Suppose we have

case2 (case1 e of
       True -> e1
       False -> e2) of
   True -> r1
   False -> r2

and suppose that r1 and r2 are small. (If they aren't they get bound as join points.) I've put numbers on the case1 and case2 so we can talk about them, but they are just ordinary Core case expressions.

Then we push the outer case2 into the right hand sides of case1 thus

case1 e of
  True -> case2 e1 of
            True -> r1
            False -> r2
  False -> case2 e2 of
            True -> r1
            False -> r2

We have (by design) duplicated outer case2.

Now suppose that entire expression E was surrounded by case3 E of { True -> s1; False -> s2 }. Again s1 and s2 are small. Then we'll duplicate that into alternatives of case1 and then into the alternatives of case2, to get this

case1 e of
  True -> case2 e1 of
            True  -> case3 r1 of { True -> s2; False -> s2 }
            False -> case3 r2 of { True -> s2; False -> s2 }
  False -> case2 e2 of
            True  -> case3 r1 of { True -> s2; False -> s2 }
            False -> case3 r2 of { True -> s2; False -> s2 }

Now we have four copies of case3. You can see how this may go exponential.

How can we get these deepyly nested cases? Suppose

f x = case x of { True -> e1; False -> e2 }

and we have f (f (f (f (f (f blah))))). If we inline f we'll get exactly such a deep nest of cases.

Here is a concrete example

f :: Int -> Bool -> Bool
{-# INLINE f #-}
f y x = case x of { True -> y>0 ; False -> y<0 }

foo y x = f (y+1) $
          f (y+2) $
          f (y+3) $
          f (y+4) $
          f (y+5) $
          f (y+6) $
          f (y+7) $
          f (y+8) $
          f (y+9) $
          f y x

Sure enough, adding one more line to foo doubles the size of the optimised code. And this is very similar to the chain of <*> applications that seems to trigger the problem in the Description.

So this looks like the root cause of the problem, which is great progress. And now we have a tiny repro case, which is also super-helpful.

comment:23 Changed 22 months ago by phadej

I might be pointing out the obvious, but given Simon's example

module Expfoo (f, foo, bar) where

f :: Int -> Bool -> Bool
{-# INLINE f #-}
f y x = case x of { True -> y>0 ; False -> y<0 }

foo :: Int -> Bool -> Bool
foo y x = f (y+1) $
          f (y+2) $
          f (y+3) $
          f (y+4) $
          f (y+5) $
          f (y+6) $
          f (y+7) $
          f (y+8) $
          f (y+9) $
          f y x

bar :: Bool
bar = foo 10 True

The -ddump-simpl -ddump-simpl-stats say

2157 PostInlineUnconditionally
  2060 x_a28M

2129 KnownBranch
  2060 wild_a28K

1024 FillInCaseDefault
  512 wild_X1l
  256 wild_X1n
  128 wild_X1p
  64 wild_X1r
  32 wild_X1t
  16 wild_X1v
  8 wild_X1x
  4 wild_X1z
  2 wild_Xc
  2 wild_X1y

bar = GHC.Types.True

I.e. simplifier is able to calculate bar value, which is great! But it does exponential job while trying to figure out what's foo.

However if we only export

module Expfoo (f, bar) where

then bar is still True, but stats looks way better, all Grand total simplifier statistics are under 12, e.g.

8 CaseOfCase 8 wild_Xb
73 KnownBranch
  11 wild1_a28w
  10 wild_a28s
  9 wild1_a28m
  9 wild_a28K
  9 wild1_a28O
  8 wild_a28i
  2 wild_Xb
  2 wild_X17
  2 wild_X19
  2 wild_X1b
  2 wild_X1d
  2 wild_X1f
  2 wild_X1h
  2 wild_X1j
  1 wild_X1m
1 FillInCaseDefault 1 wild_Xb
12 SimplifierDone 12

I'm quite sure that while fixing the exponential time issue, we should "break" the fact that bar is fully simplified to True.

comment:24 Changed 22 months ago by simonpj

This patch

modified   compiler/simplCore/SimplUtils.hs
@@ -1255,7 +1255,8 @@ postInlineUnconditionally env top_lvl bndr occ_info rhs
         -- in allocation if you miss this out
       OneOcc { occ_in_lam = in_lam, occ_int_cxt = int_cxt }
                -- OneOcc => no code-duplication issue
-        ->     smallEnoughToInline dflags unfolding     -- Small enough to dup
+        ->   not (isJoinId bndr)
+          && smallEnoughToInline dflags unfolding     -- Small enough to dup
                         -- ToDo: consider discount on smallEnoughToInline if int_cxt is true
                         --
                         -- NB: Do NOT inline arbitrarily big things, even if one_br is True

makes a big difference.

It makes my reproducer work in linear time.

Nofib says

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
       fibheaps          -2.6%     +1.1%     0.033     0.033      0.0%
         gamteb          -2.3%     +4.0%     0.062     0.062      0.0%
            ida          -2.8%     +2.0%     0.107     0.107    +33.3%
           mate          -2.3%    -19.6%     -5.8%     -5.8%      0.0%
           para          -2.8%     +0.7%     -2.1%     -2.4%      0.0%
--------------------------------------------------------------------------------
            Min          -4.9%    -19.6%     -8.6%     -8.6%      0.0%
            Max          -1.5%     +4.0%     +3.5%     +3.4%    +33.3%
 Geometric Mean          -2.5%     -0.1%     -0.5%     -0.5%     +0.5%

testsuite perf improves too

Unexpected stat failures:
   perf/compiler/ManyConstructors.run  ManyConstructors [stat too good] (normal)
   perf/compiler/T13056.run            T13056 [stat too good] (optasm)
Last edited 22 months ago by simonpj (previous) (diff)

comment:25 Changed 22 months ago by simonpj

Questions

  • Are those perf/compiler improvements happening because (a) GHC is generating less code of (b) GHC's code is running faster?
  • Is compilation generally faster, or are those two cases exceptional? (The testsuite only reports changes when they break through a threshold.)
  • Where are the gains in mate and the losses in gameteb? (Use -ticky to see.) I'd like to be sure that the loss in gameteb isn't for some silly reason that could readily be fixed.

Fundamentally this change looks good, but I don't want to commit it and forget it because I'd like to understand the reasons a bit better.

comment:26 Changed 22 months ago by bgamari

Owner: changed from dfeuer to bgamari, osa1

comment:27 Changed 22 months ago by osa1

I checked Core outputs of gameteb generated with and without this patch. What happens with this patch is we don't inline some join points that allocates Ints, so we end up allocating more Ints than before.

In gameteb project these files change with this patch:

  • InitTable
  • Output
  • Utils

Changes in InitTable and Output do not cause any increase in allocations. In Utils we generate join points like this:

eIndx_s686 [Dmd=<L,U(U)>] :: Int
[LclId]
eIndx_s686
  = let {
      wild_a4If [Dmd=<L,U(U)>] :: Int
      [LclId, Unf=OtherCon []]
      wild_a4If = GHC.Types.I# 1# } in
    case logE_s687 of { GHC.Types.D# x_a364 ->
    case ergs of { GHC.Arr.Array l_a5Ty u_a5Tz dt_a5TA ds_a5TB ->
    case l_a5Ty of wild3_s6zo { GHC.Types.I# m_s6zp ->
    case u_a5Tz of wild4_s6zr { GHC.Types.I# n_s6zs ->
    join {
      $j_s68y [Dmd=<L,1*U(U)>] :: Int
      [LclId[JoinId(0)], Str=x]
      $j_s68y = exit_r6GT wild_a4If wild3_s6zo wild4_s6zr } in
    case GHC.Prim.<=# m_s6zp 1# of {
      __DEFAULT -> jump $j_s68y;
      1# -> ...

notice the Int allocation in wild_a4If, which is referenced by the join point.

Without this patch, this code is simplified to:

eIndx_s68j [Dmd=<L,U(U)>] :: Int
[LclId]
eIndx_s68j
  = case logE_s68k of { GHC.Types.D# x_a362 ->
    case ergs of { GHC.Arr.Array l_a5TL u_a5TM dt_a5TN ds_a5TO ->
    case l_a5TL of wild2_s6zD { GHC.Types.I# m_s6zE ->
    case u_a5TM of wild3_s6zG { GHC.Types.I# n_s6zH ->
    case GHC.Prim.<=# m_s6zE 1# of {
      __DEFAULT -> exit_r6Hk 1# wild2_s6zD wild3_s6zG;
      1# -> ...

In this version the exit function called with 1# instead of an Int.

There are a few changes that all look like this.

I tried running both versions with -ticky, but couldn't make sense of the output yet. I think the code explains the increase though so maybe we don't need -ticky output.

I'll now look at what reduced allocationsin mate.

comment:28 Changed 22 months ago by osa1

In mate there are various simple changes that doesn't change allocation, but Move.hs changes quite significantly so I'll need to investigate more to see what join point inlining is causing this reduction in allocations and why.

comment:29 Changed 22 months ago by osa1

ManyConstructors outputs are exactly the same with and without this patch, so any changes in the stats should be because of the compiler itself doing the same work more or less efficiently.

In T13056 there's only a small difference, with this patch there's this join point:

join {
  ds1_s5PK [Dmd=<L,1*U>] :: base-4.11.0.0:Data.Semigroup.Internal.Any
  [LclId[JoinId(0)]]
  ds1_s5PK = jump go_a5kA ys_a5kH }

this is inlined without this patch. I doubt this will make any difference though because this is just a single jump.

So to answer

Are those perf/compiler improvements happening because (a) GHC is generating less code of (b) GHC's code is running faster?

I think these are because GHC's code is running faster.

FWIW, on my laptop this patch does not cause any perf failures. When I look at the stats files for these two tests I see that with this patch GHC allocates less in T13056 but more in ManyConstructors (just a few MB difference). Other stats differ (sometimes better with this patch, sometimes worse) but only very small amounts. Overall I don't get any perf failures.

comment:30 Changed 22 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:31 Changed 17 months ago by bgamari

This will not be addressed in GHC 8.6.

comment:32 Changed 17 months ago by bgamari

Milestone: 8.6.18.8.1

These will not be addressed in GHC 8.6.

comment:33 Changed 14 months ago by simonpj

I think #15630 is another example.

comment:34 Changed 14 months ago by simonpj

#15488 may be another case.

comment:35 Changed 14 months ago by tdammers

#15488 is not a case of this particular problem, but it does provide some related insights.

comment:36 Changed 14 months ago by tdammers

A quick test suggests that the patch from comment:24 makes things better, but not by the amount we would expect from actually solving the excessive inlining / Core blowup issue. On my machine, compilation time between GHC 8.4.2 and HEAD with the patch applied goes down from 2:00 minutes to 1:45 approximately; significantly faster, but not the orders-of-magnitude improvement we would expect from a proper fix.

comment:37 Changed 14 months ago by tdammers

I went wild on the debug logging when compiling bad.hs, with GHC HEAD and the patch from comment:24 in place. One thing that sticks out is that at some point, total Core size goes from about 7,700 to 685,000:

Result size of Simplifier iteration=4
  = {terms: 7,759, types: 9,295, coercions: 847, joins: 9/84}
Result size of Simplifier
  = {terms: 7,759, types: 9,295, coercions: 847, joins: 9/84}
!!! Simplifier [Main]: finished in 1123.65 milliseconds, allocated 1251.533 megabytes
*** SpecConstr [Main]:
Result size of SpecConstr
  = {terms: 685,052,
     types: 613,224,
     coercions: 22,309,
     joins: 1,571/11,551}

Subsequent simplifier runs bring this down to 70,000 eventually, and then it jumps up to 96,000 again at the CorePrep stage, but it never goes back down to anywhere near those 7,000.

If, however we enable the {-#NOINLINE#-} pragma on mhelper, then that same SpecConstr step maintains Core size exactly (suggesting that it doesn't do anything at all).

mhelper is :: MonadIO m => ...; if we inline it, then m can be specialized to Handler, which is a synonym for ReaderT () IO. I don't know why this particular specialization blows up right there and then, but that's what I've figured so far.

comment:38 Changed 14 months ago by tdammers

On further investigation, the MonadIO constraint is not the one that makes the specializer blow up; removing the MonadIO constraints and changing Handler to type Handler = IO retains the blowup. Which leaves us with the Semigroup, Monoid, Functor, Applicative and Monad instances for FormResult.

So I implemented non-polymorphic formMap and formAp to replace fmap and <*>, and used FormSuccess directly instead of pure, allowing me to delete the Functor and Applicative instances for FormResult. Things still blow up in the SpecConst step, until I add a NOINLINE formMap pragma - and suddenly everything is "fine".

Interestingly, changing the Monoid and Semigroup instances such that all methods are implemented in a pathologically trivial way (mappend x y = FormMissing, etc.) doesn't make things faster, so it's not that specializing on either of those is the problem.

comment:39 Changed 14 months ago by tdammers

I've managed to simplify it further:

module LessBad where

import Data.ByteString (ByteString)
import qualified Data.ByteString.Char8 as Char8

data HugeStruct = HugeStruct
  !ByteString
  !ByteString
  !ByteString
  !ByteString
  !ByteString
  !ByteString
  !ByteString
  !ByteString
  !ByteString -- 9th

data FormResult a = FormMissing
                  | FormFailure [ByteString]
                  | FormSuccess a
    deriving Show

formMap _ FormMissing = FormMissing
formMap _ (FormFailure errs) = FormFailure errs
formMap f (FormSuccess a) = FormSuccess $ f a
infixl 4 `formMap`

formAp :: FormResult (a -> b) -> FormResult a -> FormResult b
(FormSuccess f) `formAp` (FormSuccess g) = FormSuccess $ f g
(FormFailure x) `formAp` (FormFailure y) = FormFailure $ x ++ y
(FormFailure x) `formAp` _ = FormFailure x
_ `formAp` (FormFailure y) = FormFailure y
_ `formAp` _ = FormMissing
infixl 4 `formAp`


mreq :: String -> IO (FormResult ByteString, ())
mreq v = mhelper v (\m l -> FormFailure [Char8.pack "fail"]) FormSuccess

askParams :: IO (Maybe [(String, ByteString)])
askParams = do
    return $ Just []

mhelper
    :: String
    -> (() -> () -> FormResult b)   -- on missing
    -> (ByteString -> FormResult b) -- on success
    -> IO (FormResult b, ())
mhelper v onMissing onFound = do
    mp <- askParams
    (res, x) <- case mp of
        Nothing -> return (FormMissing, ())
        Just p -> do
            return $ case lookup v p of
                Nothing -> (onMissing () (), ())
                Just t -> (onFound t, ())
    return (res, x)
     
-- Either of these fixes the blowup
-- {-# NOINLINE mreq #-}
-- {-# NOINLINE mhelper #-}
-- {-# NOINLINE formMap #-}

sampleForm2 :: IO (FormResult HugeStruct)
sampleForm2 = do
    (x01, _) <- mreq "UNUSED"
    (x02, _) <- mreq "UNUSED"
    (x03, _) <- mreq "UNUSED"
    (x04, _) <- mreq "UNUSED"
    (x05, _) <- mreq "UNUSED"
    (x06, _) <- mreq "UNUSED"
    (x07, _) <- mreq "UNUSED"
    (x08, _) <- mreq "UNUSED"
    (x09, _) <- mreq "UNUSED"

    let hugeStructRes = HugeStruct
          `formMap` x01
          `formAp`  x02
          `formAp`  x03
          `formAp`  x04
          `formAp`  x05
          `formAp`  x06
          `formAp`  x07
          `formAp`  x08
          `formAp`  x09

    return hugeStructRes

There are hardly any constraints left to specialize here, except the Monad instance for IO. And indeed, changing all those IO into Monad m => m gets compilation times down from almost a minute to under one second.

So for some reason, specializing the monadic binds and / or returns for IO here increases code size by a factor of almost 50:

Result size of Simplifier
  = {terms: 6,615, types: 5,827, coercions: 39, joins: 8/70}
!!! Simplifier [LessBad]: finished in 950.46 milliseconds, allocated 1067.498 megabytes
*** SpecConstr [LessBad]:
Result size of SpecConstr
  = {terms: 286,335, types: 251,655, coercions: 39, joins: 960/5,435}
!!! SpecConstr [LessBad]: finished in 10107.67 milliseconds, allocated 10077.732 megabytes

comment:40 Changed 14 months ago by tdammers

Just to verify that SpecConstr is indeed to blame, I did a run with -O1 instead of -O2; no blowup happens then.

comment:41 Changed 13 months ago by tdammers

I've instrumented GHC to dump more intermediate information during SpecConstr. Specifically, I've added the following logic:

In scExpr, capture the expression to be specialized before and after specialization; then compare number of terms to detect a blowup (a factor of 2 is enough to weed out the non-blowups), and if this particular incantation does in fact blow up, dump the expression before the blowup.

The resulting dump is a bit too large to attach, but it shows a typical pattern: the expressions that blow up all look very much alike; as we progress through the compilation, the "before" size stays at ~6000 terms, while the "after" size progressively increases, and it looks like expressions from earlier incantations get inlined into expressions later in the process.

Which fits the hypothesis from the GHC HQ meeting, namely, that the inlining heuristic only looks at the size of the inlinee, but not at the inlining context, so when something gets inlined "top-down", then functions to be inlined (which don't have their own dependencies inlined yet) are all still small, and so the inliner happily copies them many times, and then in the next round, it inlines exponentially more invocations of the inlined functions' dependencies, and so on.

For example, given:

f = g1 + g2 + g3 + g4 + g5

g1 = a1 + a2 + a3 + a4

g2 = b1 + b2 + b3 + b4

...

...the inlining heuristic will first look at f, conclude that each of g1 through g5 is small, and can thus be inlined; then after inlining, it will look at f again, and conclude that each of a1 etc. is small, and inline those; and if those have further dependencies following the same matter, it will happily keep inlining all those small things, not realizing that it is creating a monstrosity. And because all the inlinees involved are different, there will not be any opportunities for optimizations that might shrink things back down, so the resulting program just keeps growing exponentially.

One fun thing I'll try now is this:

Considering that I already have code in place to detect blowups, I can trivially use this logic to just say "if this blows up, then throw away the specialized Core, and use the original expression instead". So I'll try that, and see what that does to various performance metrics.

comment:42 Changed 13 months ago by tdammers

Suggestion from Simon PJ: track nesting depth through SpecConstr, and just stop specializing after exceeding some threshold.

comment:43 Changed 13 months ago by simonpj

Hypothesis:

  • SpecConstr is specialising a function f1
  • In f's RHS there is a local function f2
  • In each specialised copy of f1 we create two specialised copies of f2
  • Alas f2 has a nested function f3 -- and it too specialises...

and so on.

Result: the specialised code is exponential in the nesting depth.

comment:44 Changed 13 months ago by simonpj

Solution 1 (brutal): keep track of nesting depth, and simply stop specialising at some depth.

Solution 2 (nicer): I think the actual code in this example looks like

join j1 x
  = join j2 y
      = join j3 z = blah3
        in blah2
    in blah1
in blah0

(only nested more deeply).

I've been working (slowly) on a join-point-floating patch that would do a kind of local lambda lifting to give

join j3 x y z = blah3 in
join j2 x y = blah2 in
join j1 x = blah1 in
blah0

This would specialise a lot better, I think.

Tobias will try (1); I will get back to (2).

comment:45 Changed 13 months ago by tdammers

I did try (1), but failed. A naive attempt at "simply" stopping the recursion leads to core lint errors, causing uniques to escape their scope. AFAICT this has to do with how the specialiser mixes the "gather usages", "generate uniques" and "specialise" concerns.

Unfortunately, the particular code that does this (scExpr) doesn't come with sufficient documentation, and I can't seem to figure out how to do it properly - it's not even clear to me what "do nothing" should look like in the context of scExpr - return (nullUsage, e), as I originally expected, is certainly not it.

Also, it looks like there is a recursion counter of sorts, sc_count, in place already (or rather, a counter that calculates the number of specialisations), but it doesn't seem to be effective in this particular case.

comment:46 Changed 13 months ago by bgamari

Priority: normalhigh

Bumping in priority so we don't lose track of it.

comment:47 Changed 10 months ago by onslaughtq

This issue has currently bit us as well, after we upgraded from using ghc 7.10.3 to ghc 8.6.3. We have an operator that we marked as INLINE and, when chaining the operator at least 12 times, as <*> is chained in the original bad.hs, compilation blows up with "Simplifier ticks exhausted". Marking the function with NOINLINE works, but using -O1 instead of -O2 doesn't. If necessary, I can provide code examples.

Last edited 10 months ago by onslaughtq (previous) (diff)
Note: See TracTickets for help on using tickets.