Opened 13 months ago

Last modified 8 months ago

#15488 new bug

GHC takes up huge amount of memory when compiling accelerate 1.2.0

Reported by: noah Owned by: tdammers
Priority: high Milestone: 8.10.1
Component: Compiler Version: 8.4.3
Keywords: accelerate, memory, compile Cc: tmcdonell, alpmestan, benl@…
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Compile-time performance bug Test Case: accelerate 1.2.0
Blocked By: Blocking:
Related Tickets: #15751 Differential Rev(s):
Wiki Page:

Description

Every time I try to run

cabal install accelerate

the compiler take up nearly all 4GB of memory.

Stracing it reveals that it seems to be spending nearly all of the time it runs checking a timer file descriptor, and occasionally checking its memory usage without actually doing anything about the problem.

This makes installing accelerate nearly impossible.

Attachments (3)

stdout.txt (135.1 KB) - added by noah 12 months ago.
stderr.txt.gz (118.0 KB) - added by noah 12 months ago.
Repro.hs (3.6 KB) - added by tdammers 11 months ago.
Further simplified reproduction case

Download all attachments as: .zip

Change History (29)

comment:1 Changed 13 months ago by j.waldmann

Have you tried

   strace cabal install accelerate -v -v -v -j1 --ghc-options=-v2

(or similar) it definitely shows that ghc is working.

Compilation still takes a lot of time, esp. for Data.Array.Accelerate.Analysis.Hash, Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith.

I measured 15 min for compiling accelerate (just the package, including profiling and haddock, with dependencies already built), on i7-6700 CPU @ 3.40GHz

Last edited 13 months ago by j.waldmann (previous) (diff)

comment:2 Changed 13 months ago by tmcdonell

Cc: tmcdonell added

comment:3 Changed 13 months ago by noah

Okay, I've tried that and now it seems to be able to compile accelerate.

Edit: I may have spoken too soon. It will compile now, but it still takes up a huge amount of memory. It spewed out a lot of text but I managed to notice one of lines:

!!! Simplifier [Data.Array.Accelerate.Test.NoFib.Prelude.ZipWith]: finished in 41042.74 milliseconds, allocated 8907.486 megabytes

So it seems to be the Simplifier that is causing problems.

Last edited 13 months ago by noah (previous) (diff)

comment:4 Changed 13 months ago by bgamari

Indeed spending 40 seconds in the simplifier does sound bad. Can you attach the full output from compiling with ghc -v3?

Changed 12 months ago by noah

Attachment: stdout.txt added

Changed 12 months ago by noah

Attachment: stderr.txt.gz added

comment:5 Changed 12 months ago by noah

Okay, Uploaded the files.

comment:6 Changed 12 months ago by tdammers

Owner: set to tdammers

comment:7 Changed 11 months ago by tdammers

Looks like the simplifier blows up core size (compiling Data.Array.Accelerate.Analysis.Hash):

Result size of Desugar (after optimization)
  = {terms: 3,048, ...

This is what we get before the simplifier kicks in - perfectly normal.

And then the first round of simplification happens:

Result size of Specialise
  = {terms: 6,666, types: 10,365, coercions: 1,218, joins: 0/5}
--
                              OverSatApps = False})
  = {terms: 10,863, types: 14,960, coercions: 1,218, joins: 0/5}
--
Result size of Simplifier iteration=1
  = {terms: 137,878,
--
Result size of Simplifier iteration=2
  = {terms: 66,407, types: 72,375, coercions: 25,148, joins: 0/991}
Result size of Simplifier
  = {terms: 66,295, types: 72,319, coercions: 25,148, joins: 0/991}

...increasing core size by a factor of 20 (peaking at 40). And the next round is even worse:

Result size of Simplifier iteration=1
  = {terms: 465,354,
--
Result size of Simplifier iteration=2
  = {terms: 259,229,
--
Result size of Simplifier iteration=3
  = {terms: 469,826,
--
Result size of Simplifier iteration=4
  = {terms: 345,027,
--
Result size of Simplifier
  = {terms: 345,027,

Our perfectly reasonable 3000-something terms now blow up to over 300k.

Now on to figuring out *why* it blows up.

comment:8 Changed 11 months ago by tdammers

So here's an interesting candidate:

{-# INLINE encodeConst #-}
encodeConst :: TupleType t -> t -> Builder
encodeConst TypeRunit         ()    = mempty
encodeConst (TypeRscalar t)   c     = encodeScalarConst t c
encodeConst (TypeRpair ta tb) (a,b) = encodeConst ta a <> encodeConst tb b

After desugaring, we have this Core:

-- RHS size: {terms: 27, types: 44, coercions: 4, joins: 0/0}
encodeConst
encodeConst
  = \ @ t_areG ds_dtLe ds_dtLf ->
      case ds_dtLe of {
        TypeRunit co_areI ->
          case ds_dtLf `cast` <Co:2> of { () -> mempty $fMonoidBuilder };
        TypeRscalar t_aq5y -> encodeScalarConst t_aq5y ds_dtLf;
        TypeRpair @ a1_areO @ b_areP co_areQ ta_aq5A tb_aq5B ->
          case ds_dtLf `cast` <Co:2> of { (a_aq5C, b_aq5D) ->
          <>
            $fSemigroupBuilder
            (encodeConst ta_aq5A a_aq5C)
            (encodeConst tb_aq5B b_aq5D)
          }
      }
end Rec }

Perfectly benign, but after simplification, we get:

-- RHS size: {terms: 81,945, <- !!!
              types: 43,626,
              coercions: 10,191,
              joins: 0/2,111}
encodeConst1_r4XSg
encodeConst1_r4XSg
...

(not pasting the entire monstrosity)

Last edited 11 months ago by tdammers (previous) (diff)

comment:9 Changed 11 months ago by tdammers

Unfortunately, some dependencies (e.g. primitive) no longer build on GHC HEAD, so I'll try and boil it down to a self-contained reproduction case.

Last edited 11 months ago by tdammers (previous) (diff)

comment:10 Changed 11 months ago by tdammers

Here's a reduced version:

{-# LANGUAGE GADTs               #-}
{-# LANGUAGE PatternGuards       #-}
{-# LANGUAGE RankNTypes          #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TemplateHaskell     #-}
{-# OPTIONS_HADDOCK hide #-}
-- |
-- Module      : Repro
--
-- based on Data.Array.Accelerate.Analysis.Hash from accelerate
--
-- Copyright   : [2017] Manuel M T Chakravarty, Gabriele Keller, Trevor L. McDonell
-- License     : BSD3

module Repro where

import Data.ByteString.Builder
import Data.ByteString.Builder.Extra
import Data.Monoid
import Foreign.C.Types
import System.IO.Unsafe                                             ( unsafePerformIO )
import System.Mem.StableName                                        ( hashStableName, makeStableName )
import Prelude                                                      hiding ( exp )

import Data.Array.Accelerate.Type
import Data.Array.Accelerate.Analysis.Hash.TH

{-# INLINE encodeSingleConst #-}
encodeSingleConst :: SingleType t -> t -> Builder
encodeSingleConst (NumSingleType t)    = encodeNumConst t
encodeSingleConst (NonNumSingleType t) = encodeNonNumConst t

{-# INLINE encodeVectorConst #-}
encodeVectorConst :: VectorType t -> t -> Builder
encodeVectorConst (Vector2Type t) (V2 a b)     = intHost $(hashQ "V2") <> encodeSingleConst t a <> encodeSingleConst t b
encodeVectorConst (Vector3Type t) (V3 a b c)   = intHost $(hashQ "V3") <> encodeSingleConst t a <> encodeSingleConst t b <> encodeSingleConst t c
encodeVectorConst (Vector4Type t) (V4 a b c d) = intHost $(hashQ "V4") <> encodeSingleConst t a <> encodeSingleConst t b <> encodeSingleConst t c <> encodeSingleConst t d
encodeVectorConst (Vector8Type t) (V8 a b c d e f g h) =
  intHost $(hashQ "V8") <> encodeSingleConst t a <> encodeSingleConst t b <> encodeSingleConst t c <> encodeSingleConst t d
                        <> encodeSingleConst t e <> encodeSingleConst t f <> encodeSingleConst t g <> encodeSingleConst t h
encodeVectorConst (Vector16Type t) (V16 a b c d e f g h i j k l m n o p) =
  intHost $(hashQ "V16") <> encodeSingleConst t a <> encodeSingleConst t b <> encodeSingleConst t c <> encodeSingleConst t d
                         <> encodeSingleConst t e <> encodeSingleConst t f <> encodeSingleConst t g <> encodeSingleConst t h
                         <> encodeSingleConst t i <> encodeSingleConst t j <> encodeSingleConst t k <> encodeSingleConst t l
                         <> encodeSingleConst t m <> encodeSingleConst t n <> encodeSingleConst t o <> encodeSingleConst t p

{-# INLINE encodeNonNumConst #-}
encodeNonNumConst :: NonNumType t -> t -> Builder
encodeNonNumConst TypeBool{}   x          = intHost $(hashQ "Bool")   <> word8 (fromBool x)
encodeNonNumConst TypeChar{}   x          = intHost $(hashQ "Char")   <> charUtf8 x
encodeNonNumConst TypeCSChar{} (CSChar x) = intHost $(hashQ "CSChar") <> int8 x
encodeNonNumConst TypeCUChar{} (CUChar x) = intHost $(hashQ "CUChar") <> word8 x
encodeNonNumConst TypeCChar{}  (CChar  x) = intHost $(hashQ "CChar")  <> $( [e| int8 |] ) x

{-# INLINE fromBool #-}
fromBool :: Bool -> Word8
fromBool True  = 1
fromBool False = 0

{-# INLINE encodeNumConst #-}
encodeNumConst :: NumType t -> t -> Builder
encodeNumConst (IntegralNumType t) = encodeIntegralConst t
encodeNumConst (FloatingNumType t) = encodeFloatingConst t

{-# INLINE encodeIntegralConst #-}
encodeIntegralConst :: IntegralType t -> t -> Builder
encodeIntegralConst TypeInt{}     x           = intHost $(hashQ "Int")     <> intHost x
encodeIntegralConst TypeInt8{}    x           = intHost $(hashQ "Int8")    <> int8 x
encodeIntegralConst TypeInt16{}   x           = intHost $(hashQ "Int16")   <> int16Host x
encodeIntegralConst TypeInt32{}   x           = intHost $(hashQ "Int32")   <> int32Host x
encodeIntegralConst TypeInt64{}   x           = intHost $(hashQ "Int64")   <> int64Host x
encodeIntegralConst TypeWord{}    x           = intHost $(hashQ "Word")    <> wordHost x
encodeIntegralConst TypeWord8{}   x           = intHost $(hashQ "Word8")   <> word8 x
encodeIntegralConst TypeWord16{}  x           = intHost $(hashQ "Word16")  <> word16Host x
encodeIntegralConst TypeWord32{}  x           = intHost $(hashQ "Word32")  <> word32Host x
encodeIntegralConst TypeWord64{}  x           = intHost $(hashQ "Word64")  <> word64Host x
encodeIntegralConst TypeCShort{}  (CShort x)  = intHost $(hashQ "CShort")  <> int16Host x
encodeIntegralConst TypeCUShort{} (CUShort x) = intHost $(hashQ "CUShort") <> word16Host x
encodeIntegralConst TypeCInt{}    (CInt x)    = intHost $(hashQ "CInt")    <> int32Host x
encodeIntegralConst TypeCUInt{}   (CUInt x)   = intHost $(hashQ "CUInt")   <> word32Host x
encodeIntegralConst TypeCLLong{}  (CLLong x)  = intHost $(hashQ "CLLong")  <> int64Host x
encodeIntegralConst TypeCULLong{} (CULLong x) = intHost $(hashQ "CULLong") <> word64Host x
encodeIntegralConst TypeCLong{}   (CLong x)   = intHost $(hashQ "CLong")   <> $( [e| int64Host |] ) x
encodeIntegralConst TypeCULong{}  (CULong x)  = intHost $(hashQ "CULong")  <> $( [e| word64Host |] ) x

{-# INLINE encodeFloatingConst #-}
encodeFloatingConst :: FloatingType t -> t -> Builder
encodeFloatingConst TypeHalf{}    (Half (CUShort x)) = intHost $(hashQ "Half")    <> word16Host x
encodeFloatingConst TypeFloat{}   x                  = intHost $(hashQ "Float")   <> floatHost x
encodeFloatingConst TypeDouble{}  x                  = intHost $(hashQ "Double")  <> doubleHost x
encodeFloatingConst TypeCFloat{}  (CFloat x)         = intHost $(hashQ "CFloat")  <> floatHost x
encodeFloatingConst TypeCDouble{} (CDouble x)        = intHost $(hashQ "CDouble") <> doubleHost x

In order to compile this, only 3 dependencies need to be downloaded: base-orphans, half, and hashable. You still need the modules from accelerate in place, but not all of them are touched, and most of the dependencies can be skipped. This makes it much easier to build this without Cabal, plug in different compilers, and compile files individually.

Now, it turns out that the above sample, after everything it depends on has been compiled, takes about 30 seconds to compile (with full optimizations), and Core size blows up to about 140k terms. Changing the pragmas for encodeSingleConst and encodeVectorConst to NOINLINE however brings this down to only 3 seconds; changing only encodeVectorConst still gets us 20 seconds.

Tentative hypothesis: simplifier somehow chokes on the TH-generated code in encodeVectorConst.

Last edited 11 months ago by tdammers (previous) (diff)

comment:11 Changed 11 months ago by simonpj

Interesting -- thanks for the smaller code. This is beginning to smell like #13253. See esp comment:22.

encodeVectorConst has stuff like

intHost $(hashQ "V16") <> encodeSingleConst t a
                       <> encodeSingleConst t b 
                       <> etc

where encodeSingleConst is a giant case. So I think we may be over-doing case-of-case as in #13253. I've been trying to work on #13253 but keep getting distracted.

Last edited 11 months ago by simonpj (previous) (diff)

comment:12 Changed 11 months ago by tdammers

Unfortunately, the dump-simpl output is too large to attach (~3 MiB), but almost all of it is indeed one huge case for encodeVectorConst, and it does fit the case-of-case pattern.

So let me see if I understand this correctly: case-of-case is supposed to unwrap constructs of the shape case (case a of b -> c) of d -> e into case a of b -> (case c of d -> e), and normally, the result would then reduce to something simpler than the original case-of-case construct - but in this case, it doesn't. Does that mean what we need is a way to tell when this is going to happen, and in those cases, skip the transform?

And something I don't understand yet is how this relates to #15253 (adding support for type-level integers).

comment:13 Changed 11 months ago by alpmestan

Cc: alpmestan added

comment:14 Changed 11 months ago by simonpj

Darn. I meant #13253, esp comment:22. Sorry about that.

comment:15 Changed 11 months ago by tdammers

Ah, yes, that makes sense. A lot, actually. The situation fits the description like a glove - we have the nested cases, we have small expressions at the leaves, and we have the "-O2 triggers exponential blowup" part.

However, the patch proposed in https://ghc.haskell.org/trac/ghc/ticket/13253#comment:24 does not make things any better - core size still blows up the same (114k terms), and compilation time is about the same. In fact, no matter what conditions I put in that particular spot, I always get the same result, suggesting that this particular program never hits the OneOcc branch at all.

I still think case-of-case is the direction to investigate, but this particular patch does not seem to be the solution.

Changed 11 months ago by tdammers

Attachment: Repro.hs added

Further simplified reproduction case

comment:16 Changed 11 months ago by tdammers

Further simplified reproduction case produces about 25,000 terms as-is, but only 7,000 when marking encodeVectorConst and encodeSingleConst as NOINLINE. I've removed the dependency on TH, some extensions, and all other modules except the .Type one.

comment:17 Changed 11 months ago by tdammers

Here's what the updated example looks like after desugaring:

encodeIntegralConst
encodeIntegralConst
  = \ @ t_aaZ7 ds_dbct x_aaUu ->
      case ds_dbct of {
        TypeInt co_aaZJ _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 200#))
            (intHost (x_aaUu `cast` <Co:2>));
        TypeInt8 co_aaZR _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 208#))
            (int8 (x_aaUu `cast` <Co:2>));
        TypeInt16 co_ab00 _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 216#))
            (int16Host (x_aaUu `cast` <Co:2>));
        -- ...
      }

-- RHS size: {terms: 45, types: 57, coercions: 14, joins: 0/0}
encodeFloatingConst
encodeFloatingConst
  = \ @ t_aaY8 ds_db5K ds_db5L ->
      case ds_db5K of {
        TypeHalf co_aaYn _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 500#))
            (word16Host
               (((ds_db5L `cast` <Co:2>) `cast` <Co:1>) `cast` <Co:1>));
        TypeFloat co_aaYC _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 510#))
            (floatHost (ds_db5L `cast` <Co:2>));
        -- ...
      }

-- RHS size: {terms: 10, types: 12, coercions: 0, joins: 0/0}
encodeNumConst
encodeNumConst
  = \ @ t_ab24 ds_dbjh ->
      case ds_dbjh of {
        IntegralNumType t_aaUs -> encodeIntegralConst t_aaUs;
        FloatingNumType t_aaUt -> encodeFloatingConst t_aaUt
      }

-- RHS size: {terms: 5, types: 0, coercions: 0, joins: 0/0}
$trModule
$trModule = Module (TrNameS "main"#) (TrNameS "Repro"#)

-- RHS size: {terms: 11, types: 4, coercions: 0, joins: 0/0}
fromBool
fromBool
  = \ ds_dbjn ->
      case ds_dbjn of {
        False -> fromInteger $fNumWord8 0;
        True -> fromInteger $fNumWord8 1
      }

-- RHS size: {terms: 46, types: 57, coercions: 13, joins: 0/0}
encodeNonNumConst
encodeNonNumConst
  = \ @ t_ab2w ds_dbjr x_aaUn ->
      case ds_dbjr of {
        TypeBool co_ab2I _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 0#))
            (word8 (fromBool (x_aaUn `cast` <Co:2>)));
        TypeChar co_ab2N _ ->
          <>
            $fSemigroupBuilder
            (intHost (I# 100#))
            (charUtf8 (x_aaUn `cast` <Co:2>));
        -- ...
      }

-- RHS size: {terms: 10, types: 12, coercions: 0, joins: 0/0}
encodeSingleConst
encodeSingleConst
  = \ @ t_ab3l ds_dblJ ->
      case ds_dblJ of {
        NumSingleType t_aaUe -> encodeNumConst t_aaUe;
        NonNumSingleType t_aaUf -> encodeNonNumConst t_aaUf
      }

-- RHS size: {terms: 47, types: 52, coercions: 4, joins: 0/0}
encodeVectorConst
encodeVectorConst
  = \ @ t_ab3r ds_dblP ds_dblQ ->
      case ds_dblP of {
        __DEFAULT ->
          patError "Repro.hs:(29,1)-(30,133)|function encodeVectorConst"#;
        Vector2Type @ a_ab3G co_ab3H t_aaUg ->
          case ds_dblQ `cast` <Co:2> of { V2 a_aaUh b_aaUi ->
          <>
            $fSemigroupBuilder
            (intHost (I# 2#))
            (<>
               $fSemigroupBuilder
               (encodeSingleConst t_aaUg a_aaUh)
               (encodeSingleConst t_aaUg b_aaUi))
          };
        Vector3Type @ a_ab3X co_ab3Y t_aaUj ->
          case ds_dblQ `cast` <Co:2> of { V3 a_aaUk b_aaUl c_aaUm ->
          <>
            $fSemigroupBuilder
            (intHost (I# 3#))
            (<>
               $fSemigroupBuilder
               (encodeSingleConst t_aaUj a_aaUk)
               (<>
                  $fSemigroupBuilder
                  (encodeSingleConst t_aaUj b_aaUl)
                  (encodeSingleConst t_aaUj c_aaUm)))
          }
      }

After inlining, the pattern that we get is something like:

case a of
    A1 b1 ->
        case b1 of
            B1 c -> case c of
                ...
    A2 b2 ->
        case b2 of
            B2 d -> case d of
                ...

But this not the case-of-case pattern at all! It's just a very large nested case, resulting from excessive inlining, and the structure of the whole thing is such that every path through the tree of cases is unique at every step, so statically analyzing the case branches does not lead to any simplification.

In fact, we can achieve the same kind of blowup with the following example code (no dependencies whatsoever):

module SimpleBlowup
where

data Foo
    = Foo1 Bar 
    | Foo2 Bar 
    | Foo3 Bar 
    | Foo4 Bar 
    | Foo5 Bar 
    | Foo6 Bar 
    | Foo7 Bar 
    | Foo8 Bar 
    | Foo9 Bar 
    | Foo10 Bar 
    | Foo11 Bar 
    | Foo12 Bar 
    | Foo13 Bar 
    | Foo14 Bar 
    | Foo15 Bar 
    | Foo16 Bar 
    | Foo17 Bar 
    | Foo18 Bar 
    | Foo19 Bar 
    | Foo20 Bar

data Bar = Bar1 Baz | Bar2 Baz | Bar3 Baz | Bar4 Baz

data Baz
    = Baz1 Int 
    | Baz2 Int 
    | Baz3 Int 
    | Baz4 Int 
    | Baz5 Int 
    | Baz6 Int 
    | Baz7 Int 
    | Baz8 Int 
    | Baz9 Int 
    | Baz10 Int 
    | Baz11 Int 
    | Baz12 Int 
    | Baz13 Int 
    | Baz14 Int 
    | Baz15 Int 
    | Baz16 Int 
    | Baz17 Int 
    | Baz18 Int 
    | Baz19 Int 
    | Baz20 Int

{-#INLINE encodeFoo #-}
encodeFoo :: Foo -> Int
encodeFoo (Foo1 bar) = encodeBar bar + 1
encodeFoo (Foo2 bar) = encodeBar bar + 2
encodeFoo (Foo3 bar) = encodeBar bar + 3
encodeFoo (Foo4 bar) = encodeBar bar + 4
encodeFoo (Foo5 bar) = encodeBar bar + 5
encodeFoo (Foo6 bar) = encodeBar bar + 6
encodeFoo (Foo7 bar) = encodeBar bar + 7
encodeFoo (Foo8 bar) = encodeBar bar + 8
encodeFoo (Foo9 bar) = encodeBar bar + 9
encodeFoo (Foo10 bar) = encodeBar bar + 10
encodeFoo (Foo11 bar) = encodeBar bar + 11
encodeFoo (Foo12 bar) = encodeBar bar + 12
encodeFoo (Foo13 bar) = encodeBar bar + 13
encodeFoo (Foo14 bar) = encodeBar bar + 14
encodeFoo (Foo15 bar) = encodeBar bar + 15
encodeFoo (Foo16 bar) = encodeBar bar + 16
encodeFoo (Foo17 bar) = encodeBar bar + 17
encodeFoo (Foo18 bar) = encodeBar bar + 18
encodeFoo (Foo19 bar) = encodeBar bar + 19
encodeFoo (Foo20 bar) = encodeBar bar + 20

{-#INLINE encodeBar #-}
encodeBar :: Bar -> Int
encodeBar (Bar1 baz) = encodeBaz baz + 1
encodeBar (Bar2 baz) = encodeBaz baz + 2
encodeBar (Bar3 baz) = encodeBaz baz + 3
encodeBar (Bar4 baz) = encodeBaz baz + 4

{-#INLINE encodeBaz #-}
encodeBaz :: Baz -> Int
encodeBaz (Baz1 i) = encodeInt i + 1
encodeBaz (Baz2 i) = encodeInt i + 2
encodeBaz (Baz3 i) = encodeInt i + 3
encodeBaz (Baz4 i) = encodeInt i + 4
encodeBaz (Baz5 i) = encodeInt i + 5
encodeBaz (Baz6 i) = encodeInt i + 6
encodeBaz (Baz7 i) = encodeInt i + 7
encodeBaz (Baz8 i) = encodeInt i + 8
encodeBaz (Baz9 i) = encodeInt i + 9
encodeBaz (Baz10 i) = encodeInt i + 10
encodeBaz (Baz11 i) = encodeInt i + 11
encodeBaz (Baz12 i) = encodeInt i + 12
encodeBaz (Baz13 i) = encodeInt i + 13
encodeBaz (Baz14 i) = encodeInt i + 14
encodeBaz (Baz15 i) = encodeInt i + 15
encodeBaz (Baz16 i) = encodeInt i + 16
encodeBaz (Baz17 i) = encodeInt i + 17
encodeBaz (Baz18 i) = encodeInt i + 18
encodeBaz (Baz19 i) = encodeInt i + 19
encodeBaz (Baz20 i) = encodeInt i + 20

{-#INLINE encodeInt #-}
encodeInt :: Int -> Int
encodeInt i = (i * 47 + 31) `mod` 17

This blows up Core size to about 45,000 terms. Changing all the INLINEs to NOINLINE however, we only get 1,200 terms.

So AFAICT, this isn't case-of-case being overly eager, it's just GHC obediently honoring those INLINE pragmas, which legit produces a lot of Core, and due to the structure of the things being inlined, the usual crossing-off of obvious non-matches isn't possible.

This is probably not something you encounter a lot in the wild, because in order to trigger this in a noticable way, the following conditions must be met:

  • Several layers of function applications forming a nested pattern-match
  • ...each of them marked INLINE
  • ...and essentially consisting of a pattern-matching construct with many (tens or more) branches

I presume that this would typically involve nested data types with many constructors on multiple levels.

comment:18 Changed 11 months ago by simonpj

So AFAICT, this isn't case-of-case being overly eager, it's just GHC obediently honoring those INLINE pragmas, which legit produces a lot of Core

Yes you are right about that.

  • But why did it not happen in earlier versions of GHC, which should have been equally obedient?

I'm worried that this kind of blow-up can happen even without INLINE pragmas:

f1 x = if x>0 then 0 else 1
f2 x = if x>0 then f1 (x-1) else f1 (x-2)
f3 x = if x>0 then f2 (x-1) else f2 (x-2)
f4 x = if x>0 then f3 (x-1) else f3 (x-2)
h x = f x

Now

  • f4 looks small, so we could inline it at its call in h
  • Now we have two calls to f3; but f3 is small so we can inline them both.
  • Now we have four calls to f2; but f2 is small so we can inline them all.

...and so on.

This happens if we inline "bottom up". If instead we did "top-down" we might inline f1 into f2, and f2 into f3... but then f3 would look big so we would not inline it into f4. But we are clearly walking close to the precipice.

Who writes such function nests? Well, accelerate perhaps (but see the above question). But they also arise naturally from the join points created from deeply-nested case-of-case transforms, and that's the relevance of the case-of-case stuff.

So we have two threads to pursue

  • Why does accelerate have INLINE pragmas on these nested functions? Obedience to those pragmas will certainly cause trouble.
  • How can we avoid blow-up when (absent INLINE pragmas) such definition nests occur naturally? This is the thing I've been thinking about.

comment:19 Changed 11 months ago by tdammers

But why did it not happen in earlier versions of GHC, which should have been equally obedient?

Do we have evidence of this? So far, I have only tested accelerate against 8.4 and HEAD, and both show the blow-up. AFAICT, the accelerate situation is rather specific, I don't see people come up with code like this a lot, not with the explicit inlining anyway. Which leaves those situations where GHC decides to inline on its own; but for those cases, my guess would be that the explanation why this happens now and not before is because we simply optimize more aggressively now. I'll run accelerate and the reproduction case against an older GHC just to double check, but I'd expect things to still blow up.

Why does accelerate have INLINE pragmas on these nested functions? Obedience to those pragmas will certainly cause trouble.

Well, what do you know, there's already a ticket: https://github.com/AccelerateHS/accelerate/issues/428. I took the liberty to comment on that, so hopefully we'll get some feedback on the matter from there.

How can we avoid blow-up when (absent INLINE pragmas) such definition nests occur naturally? This is the thing I've been thinking about.

So essentially this boils down to figuring out whether inlining is going to pay off; and the challenge is that just looking at the size before and after inlining isn't going to be enough, because normally we expect the large post-inlining (or post-case-of-case-transform) core to be large but sufficiently shrinkable (by crossing off obvious non-matches). So what we would need is a way to tell whether our code, once blown up, will indeed be shrinkable, either before we blow it up, or right after (but before any further transformations, because those are now potentially expensive).

This is probably something that should be addressed via #13253 though; it's not what causes the problem in this case here.

comment:20 Changed 11 months ago by tdammers

Update: it appears that accelerate is already slow to compile on GHC 8.0.2, so this doesn't seem to be a new issue - more likely, it's one that we run into more often now.

comment:21 Changed 11 months ago by simonpj

Cc: benl@… added

I talked to Ben Lippmeier at ICFP, who said he'd look into the accelerate end of this ticket, and discuss it with Trevor.

comment:22 Changed 11 months ago by George

Shouldn't the milestone be moved to 8.8.1?

comment:23 Changed 11 months ago by bgamari

Milestone: 8.6.18.8.1

Yep!

comment:24 Changed 10 months ago by tdammers

comment:25 Changed 10 months ago by tdammers

The ZipWith module peformance problems seem to have a different cause; we'll look into that one separately in #15751.

comment:26 Changed 8 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of some high and highest priority tickets.

Note: See TracTickets for help on using tickets.