Opened 21 months ago

Last modified 20 months ago

#14594 new bug

2 modules / 2500LOC takes nearly 3 minutes to build

Reported by: schyler Owned by: dfeuer
Priority: normal Milestone:
Component: Compiler Version: 8.0.2
Keywords: Cc: kvanberendonck@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I tested this from a brand new MacBook 15" with 8 threads x 2.8 GHz Intel Core i7 / 16 GB RAM on OS X:

$ time stack build
WARNING: /Users/kvanb/git/github-webhooks/github-webhooks.cabal was generated with a newer version of hpack, please upgrade and try again.
github-webhooks-1.0.0: unregistering (local file changes: github-webhooks.cabal package.yaml spec/DecodeEventsSpec.hs spec/Spec.hs src/GitHub/Data/Webhooks...)
github-webhooks-1.0.0: configure (lib)
Configuring github-webhooks-1.0.0...
github-webhooks-1.0.0: build (lib)
Preprocessing library github-webhooks-1.0.0...
[1 of 3] Compiling Paths_github_webhooks ( .stack-work/dist/x86_64-osx/Cabal-1.24.2.0/build/autogen/Paths_github_webhooks.hs, .stack-work/dist/x86_64-osx/Cabal-1.24.2.0/build/Paths_github_webhooks.o )
[2 of 3] Compiling GitHub.Data.Webhooks.Payload ( src/GitHub/Data/Webhooks/Payload.hs, .stack-work/dist/x86_64-osx/Cabal-1.24.2.0/build/GitHub/Data/Webhooks/Payload.o )
[3 of 3] Compiling GitHub.Data.Webhooks.Events ( src/GitHub/Data/Webhooks/Events.hs, .stack-work/dist/x86_64-osx/Cabal-1.24.2.0/build/GitHub/Data/Webhooks/Events.o )
github-webhooks-1.0.0: copy/register
Installing library in
/Users/kvanb/git/github-webhooks/.stack-work/install/x86_64-osx/lts-9.18/8.0.2/lib/x86_64-osx-ghc-8.0.2/github-webhooks-1.0.0-LDmeffpkXvwH6ZXCPE95ke
Registering github-webhooks-1.0.0...

real	2m49.342s
user	2m42.441s
sys	0m5.805s

Source code has about 30 derivations of the form:

type IssueState = Text

data HookIssue = HookIssue
    { whIssueUrl                :: !URL
    , whIssueLabelsUrl          :: !URL
    , whIssueCommentsUrl        :: !URL
    , whIssueEventsUrl          :: !URL
    , whIssueHtmlUrl            :: !URL
    , whIssueId                 :: !Int
    , whIssueNumber             :: !Int
    , whIssueTitle              :: !Text
    , whIssueUser               :: !HookUser
    , whIssueLabels             :: !(Vector HookIssueLabels)
    , whIssueState              :: IssueState
    , whIssueIsLocked           :: !Bool
    , whIssueAssignee           :: !(Maybe HookUser)
    , whIssueMilestone          :: !(Maybe HookMilestone)
    , whIssueCommentCount       :: !Int
    , whIssueCreatedAt          :: !UTCTime
    , whIssueUpdatedAt          :: !UTCTime
    , whIssueClosedAt           :: !(Maybe UTCTime)
    , whIssueBody               :: !Text
    }
    deriving (Eq, Show, Typeable, Data, Generic)

instance NFData HookIssue where rnf = genericRnf

Steps to reproduce

  1. clone https://github.com/onrock-eng/github-webhooks
  2. checkout sha 40a7ecc3a5845717055ee372b89e645a498cf1e2
  3. time stack build

This is really really really slow.

Change History (20)

comment:1 Changed 21 months ago by schyler

Version: 8.2.18.0.2
$ stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 8.0.2

comment:2 Changed 21 months ago by mpickering

I think the resulting program after you expand the deriving clauses to source code will be many times bigger than 2500 lines. Is there something concrete you have identified as being unexpectedly slow?

comment:3 Changed 21 months ago by bgamari

Indeed this program produces some moderately large Core.

*** Desugar [GitHub.Data.Webhooks.Events]:
Result size of Desugar (after optimization)
  = {terms: 18,603, types: 76,783, coercions: 21,668, joins: 0/547}

*** Desugar [GitHub.Data.Webhooks.Payload]:
Result size of Desugar (after optimization)
  = {terms: 26,920, types: 195,711, coercions: 41,708, joins: 1/937}

Nearly all of this is derived bindings. It would be interesting to study which classes are contributing the most.

comment:4 Changed 21 months ago by dfeuer

Owner: set to dfeuer

I'll take a look today.

comment:5 Changed 21 months ago by dfeuer

Some good news: this appears to build much more quickly (1m13s real time) using the nightly-2017-12-18 resolver (ghc-8.2.2) as with the 8.0-based lts-9.18 resolver (which took 2m9s). Of course, those resolvers also differ in other ways. I'm still reducing the test file, and I'll report back when I have a better sense of the real difference.

comment:6 Changed 21 months ago by schyler

Hey, thanks for looking into this. I don't have a concrete "X feature is slow", it just seemed so slow as a whole that I thought it might be a good candidate for profiling. Actually, the first few times I ran it I ctrl+c it because I thought it was hung.

It's good to know that the latest resolver is quicker, thanks for that suggestion! When I get home I'll let you know exactly how much quicker it is on my machine.

comment:7 Changed 21 months ago by dfeuer

I've been digging a bit further (with GHC 8.2.2). It appears that the (derived) Data instances are responsible for most of the compilation time. I haven't yet tracked down what's taking the most time (no particular compiler phase seems egregiously bad). After peeking at some Core and thinking about the methods of Data, it doesn't really seem too surprising that the instances for large products might compile slowly.

gfoldl :: Data a =>
       => (forall d b. Data d => c (d -> b) -> d -> c b)
       -> (forall g. g -> c g)
       -> a
       -> c a

gunfold :: Data a
        => (forall b r. Data b => c (b -> r) -> c r)
        -> (forall r. r -> c r)
        -> Constr
        -> c a 

Note the Data constraints on the function arguments. The implementation of gfoldl or gunfold needs to provide Data dictionaries for each field type. Those will initially be local let-bound values. Hopefully the duplicates are eventually eliminated, but if that doesn't happen quickly, we could waste some time.

I'm still building HEAD to see if this has changed any lately.

comment:8 Changed 21 months ago by dfeuer

HEAD (with master branches of primitive and vector) is a little faster than 8.2.2, at least for building Payload, but not by much.

comment:9 Changed 21 months ago by dfeuer

I see also that there's a fair bit of type gunk going on. The signature of gfoldl means that the combining function will be applied at O(n) different types, each a bit bigger than the last. Some other methods are similar. I don't know if this has any performance impact in the optimizer, but it might.

comment:10 Changed 21 months ago by schyler

On a desktop (Intel i5-3470 which is 3.2-3.6ghz, 8gb RAM):

resolver: lts-9.18

real    1m55.157s
user    0m0.000s
sys     0m0.015s
resolver: lts-10.0

real    1m19.322s
user    0m0.000s
sys     0m0.015s

A concrete example where lts-10 is considerably faster (nearly 50%)! Well done everyone!

comment:11 Changed 21 months ago by dfeuer

I've looked in a little more detail. The thing I've found most surprising and suspicious is that we end up with absolutely enormous unfoldings for some dictionaries. For Data (Maybe Int), we get this mouthful:

$s$fDataMaybe_s57H [InlPrag=NOUSERINLINE CONLIKE]
  :: Data (Maybe Int)
[LclId,
 Unf=DFun: \ ->
       Data.Data.C:Data TYPE: Maybe Int
                        (base-4.11.0.0:Data.Typeable.Internal.mkTrApp
                           @ * @ * @ Maybe @ Int Data.Data.$fDataMaybe5 Data.Data.$fDataInt4)
                        `cast` (Sym (base-4.11.0.0:Data.Typeable.Internal.N:Typeable[0]) <*>_N <Maybe
                                                                                                  Int>_N
                                :: (Type.Reflection.TypeRep (Maybe Int) :: *)
                                   ~R# (Typeable (Maybe Int) :: Constraint))
                        \ (@ (c_a5c3 :: * -> *))
                          (k_a5c4 [Occ=Once!]
                             :: forall d b. Data d => c_a5c3 (d -> b) -> d -> c_a5c3 b)
                          (z_a5c7 [Occ=Once*!] :: forall g. g -> c_a5c3 g)
                          (ds_a5c9 [Occ=Once!] :: Maybe Int) ->
                          case ds_a5c9 of {
                            Nothing -> z_a5c7 @ (Maybe Int) (GHC.Base.Nothing @ Int);
                            Just a1_a5ce [Occ=Once] ->
                              k_a5c4
                                @ Int
                                @ (Maybe Int)
                                Data.Data.$fDataInt
                                (z_a5c7 @ (Int -> Maybe Int) (GHC.Base.Just @ Int))
                                a1_a5ce
                          }
                        Data.Data.$fDataMaybe_$cgunfold @ Int Data.Data.$fDataInt
                        \ (ds_a51o [Occ=Once!] :: Maybe Int) ->
                          case ds_a51o of {
                            Nothing -> Data.Data.$cNothing;
                            Just _ [Occ=Dead] -> Data.Data.$cJust
                          }
                        \ _ [Occ=Dead] -> Data.Data.$tMaybe

...
...

Why are we inlining method definitions into the dictionary unfolding at all? I'd expect us to wait for someone to pluck a method from the dictionary unfolding and then consider whether to inline the method.

Now I don't know if this has anything to do with the problem, but it looks weird.

comment:12 Changed 21 months ago by schyler

Could strictness have anything to do with it, perhaps

comment:13 Changed 21 months ago by simonpj

David, can you make a small repro case for comment:11? I agree it's puzzling.

Last edited 21 months ago by simonpj (previous) (diff)

comment:14 Changed 21 months ago by dfeuer

Simon, here's a tiny repro:

{-# language RankNTypes #-}
module T14594 where
import Data.Data

newtype Foo = Foo (Maybe Char)

gfoldlFoo
  :: (forall d b. Data d => c (d -> b) -> d -> c b)
     -> (forall g. g -> c g) -> Foo -> c Foo
gfoldlFoo k z (Foo a) = z Foo `k` a

Compile with -O and -ddump-simpl, and search for Data (Maybe Char).

comment:15 Changed 21 months ago by dfeuer

I would think we'd want a general rule against inlining under a constructor in core2core. The context seems inherently too boring. And indeed we don't inline into the definition of the dictionary itself; we only inline into its unfolding. I don't know why.

comment:16 Changed 20 months ago by dfeuer

There's always a certain amount of non-linearity in the types of multi-field records. For example, given

data Foo = Foo Int Char Integer

gfoldlFoo
  :: (forall d b. Data d => c (d -> b) -> d -> c b)
     -> (forall g. g -> c g) -> Foo -> c Foo
gfoldlFoo c k (Foo x y z) =
  k Foo `c` x `c` y `c` z

we get

Testy.$wgfoldlFoo
  = \ (@ (c_s4UQ :: * -> *))
      (w_s4UR :: forall d b. Data d => c_s4UQ (d -> b) -> d -> c_s4UQ b)
      (w1_s4US :: forall g. g -> c_s4UQ g)
      (ww_s4UW :: Int)
      (ww1_s4UX :: Char)
      (ww2_s4UY :: Integer) ->
      w_s4UR
        @ Integer
        @ Foo
        Data.Data.$fDataInteger
        (w_s4UR
           @ Char
           @ (Integer -> Foo)
           Data.Data.$fDataChar
           (w_s4UR
              @ Int
              @ (Char -> Integer -> Foo)
              Data.Data.$fDataInt
              (w1_s4US @ (Int -> Char -> Integer -> Foo) Testy.Foo)
              ww_s4UW)
           ww1_s4UX)
        ww2_s4UY

Note that we get types Foo, Integer -> Foo, Char -> Integer -> Foo, and Int -> Char -> Integer -> Foo. But does this affect the simplifier? I don't know.

comment:17 Changed 20 months ago by dfeuer

There's always a certain amount of non-linearity in the types of multi-field records. For example, given

data Foo = Foo Int Char Integer

gfoldlFoo
  :: (forall d b. Data d => c (d -> b) -> d -> c b)
     -> (forall g. g -> c g) -> Foo -> c Foo
gfoldlFoo c k (Foo x y z) =
  k Foo `c` x `c` y `c` z

we get

Testy.$wgfoldlFoo
  = \ (@ (c_s4UQ :: * -> *))
      (w_s4UR :: forall d b. Data d => c_s4UQ (d -> b) -> d -> c_s4UQ b)
      (w1_s4US :: forall g. g -> c_s4UQ g)
      (ww_s4UW :: Int)
      (ww1_s4UX :: Char)
      (ww2_s4UY :: Integer) ->
      w_s4UR
        @ Integer
        @ Foo
        Data.Data.$fDataInteger
        (w_s4UR
           @ Char
           @ (Integer -> Foo)
           Data.Data.$fDataChar
           (w_s4UR
              @ Int
              @ (Char -> Integer -> Foo)
              Data.Data.$fDataInt
              (w1_s4US @ (Int -> Char -> Integer -> Foo) Testy.Foo)
              ww_s4UW)
           ww1_s4UX)
        ww2_s4UY

Note that we get types Foo, Integer -> Foo, Char -> Integer -> Foo, and Int -> Char -> Integer -> Foo. But does this affect the simplifier? I don't know.

comment:18 Changed 20 months ago by simonpj

Yes, we can definitely get big type-blowup. See ​http://research.microsoft.com/en-us/um/people/simonpj/papers/variant-f/if.pdf, page 5. cf #5227.

But that would show up clearly through the sizes reported by -dshow-passes. comment:3 shows some large numbers. What numbers do you see now? Do we have a standalone reproducer?

comment:19 Changed 20 months ago by dfeuer

I have not as yet been able to find anything non-linear, but we seem to pay an awful lot per instance. Given

data Foo7 = Foo7 (Int) deriving Data
data Foo8 = Foo8 (Int) deriving Data
-- plus 25 more like this

compiling with -O takes 5 seconds. The single most expensive part is codegen, which takes about a second, but otherwise no particular pass stands out as being particularly expensive.

I experimentally tried deriving more of the Data methods to try to reduce simplification time. Unfortunately, I paid for it in typechecking time. I still want to check if some of those extra methods are worthwhile, but otherwise I really don't know.

comment:20 Changed 20 months ago by dfeuer

I tried generating a bunch of modules with derived data instances. A module with 29 single-field instances desugars to 5201 terms, 13488 types, 145 coercions. This is a similar number of terms as Data.Set.Internal, but about twice as many types.

Core tidy is 11,705 terms, 21,960 types, and 1,479 coercions.

GHC allocates 2,222,336,112 bytes compiling this compared to 1,663,610,072 for compiling Data.Set.Internal.

I also noticed that there's actually a strong non-linearity at the low end: types with small umbers of fields have high incremental costs. For 50 definitions, using 0,1,2,3, and 4 fields each, I get 2.49, 3.85, 4.92, 5.51 and 5.66 GB of GHC allocation, respectively. I need to explore this a bit more.

Note: See TracTickets for help on using tickets.