Opened 6 years ago

Closed 6 years ago

Last modified 5 years ago

#7994 closed bug (fixed)

Make foldl into a good consumer

Reported by: simonpj Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.6.3
Keywords: Cc: tkn.akio@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I really want foldl to be a good consumer, but our arity/cardinality analysis still isn't up to it. Here's a test case, derived from nofib's x2n1:

module Foo( foo ) where
import Data.Complex

foo x = sum [f n | n <- [1 .. x]]

f :: Int -> Complex Double
{-# NOINLINE f #-}
f n = mkPolar 1 ((2*pi)/fromIntegral n) ^ n

With the patch below (which is what I'd like to use), we get very obviously bad code:

Foo.foo :: GHC.Types.Int -> Data.Complex.Complex GHC.Types.Double
Foo.foo =
  \ (x_aia :: GHC.Types.Int) ->
    case x_aia of _ { GHC.Types.I# y_aye ->
    case GHC.Prim.># 1 y_aye of _ {
      GHC.Types.False ->
        letrec {
          go_aD8 [Occ=LoopBreaker]
            :: GHC.Prim.Int#
               -> Data.Complex.Complex GHC.Types.Double
               -> Data.Complex.Complex GHC.Types.Double
          go_aD8 =
            \ (x_aD9 :: GHC.Prim.Int#) ->
              let {
                ds_doR [Lbv=OneShot]
                  :: Data.Complex.Complex GHC.Types.Double
                     -> Data.Complex.Complex GHC.Types.Double
                ds_doR =
                  case GHC.Prim.==# x_aD9 y_aye of _ {
                    GHC.Types.False -> go_aD8 (GHC.Prim.+# x_aD9 1);
                    GHC.Types.True ->
                      GHC.Base.id @ (Data.Complex.Complex GHC.Types.Double)
                  } } in
              let {
                ds_aCs :: Data.Complex.Complex GHC.Types.Double
                ds_aCs = Foo.f (GHC.Types.I# x_aD9) } in
              \ (ds2_aCu :: Data.Complex.Complex GHC.Types.Double) ->
                ds_doR (Data.Complex.$fFloatingComplex_$s$c+ ds2_aCu ds_aCs); } in
        go_aD8
          1
          (Data.Complex.:+
             @ GHC.Types.Double
             (GHC.Types.D# 0.0)
             Data.Complex.$fFloatingComplex1);
      GHC.Types.True ->
        Data.Complex.:+
          @ GHC.Types.Double
          (GHC.Types.D# 0.0)
          Data.Complex.$fFloatingComplex1   }   }

The local go function should have arity 2.

The patch below is the one I'd like to apply to base:

simonpj@cam-05-unx:~/code/HEAD/libraries/base$ git diff
diff --git a/Data/List.hs b/Data/List.hs
index e7e8602..a2e7ac0 100644
--- a/Data/List.hs
+++ b/Data/List.hs
@@ -1,5 +1,5 @@
 {-# LANGUAGE Trustworthy #-}
-{-# LANGUAGE CPP, NoImplicitPrelude, MagicHash #-}
+{-# LANGUAGE CPP, NoImplicitPrelude, ScopedTypeVariables, MagicHash #-}
 
 -----------------------------------------------------------------------------
 -- |
@@ -995,11 +995,15 @@ unfoldr f b  =
 -- -----------------------------------------------------------------------------
 
 -- | A strict version of 'foldl'.
-foldl'           :: (b -> a -> b) -> b -> [a] -> b
+foldl' :: forall a b. (b -> a -> b) -> b -> [a] -> b
 #ifdef __GLASGOW_HASKELL__
+{-# INLINE foldl' #-}
+foldl' k z xs = foldr (\(v::a) (fn::b->b) (z::b) -> z `seq` fn (k z v)) (id :: b -> b) xs z
+{-
 foldl' f z0 xs0 = lgo z0 xs0
     where lgo z []     = z
           lgo z (x:xs) = let z' = f z x in z' `seq` lgo z' xs
+-}
 #else
 foldl' f a []     = a
 foldl' f a (x:xs) = let a' = f a x in a' `seq` foldl' f a' xs
@@ -1022,6 +1026,17 @@ foldl1' _ []             =  errorEmptyList "foldl1'"
 -- -----------------------------------------------------------------------------
 -- List sum and product
 
+-- | The 'sum' function computes the sum of a finite list of numbers.
+sum                     :: (Num a) => [a] -> a
+-- | The 'product' function computes the product of a finite list of numbers.
+product                 :: (Num a) => [a] -> a
+
+{-# INLINE sum #-}
+sum                     =  foldl (+) 0
+{-# INLINE product #-}
+product                 =  foldl (*) 1
+
+{-
 {-# SPECIALISE sum     :: [Int] -> Int #-}
 {-# SPECIALISE sum     :: [Integer] -> Integer #-}
 {-# INLINABLE sum #-}
@@ -1048,6 +1063,7 @@ product l       = prod l 1
     prod []     a = a
     prod (x:xs) a = prod xs (a*x)
 #endif
+-}
 
 -- -----------------------------------------------------------------------------
 -- Functions on strings
diff --git a/GHC/List.lhs b/GHC/List.lhs
index 049aa2a..87c93ae 100644
--- a/GHC/List.lhs
+++ b/GHC/List.lhs
@@ -1,6 +1,6 @@
 \begin{code}
 {-# LANGUAGE Trustworthy #-}
-{-# LANGUAGE CPP, NoImplicitPrelude, MagicHash #-}
+{-# LANGUAGE CPP, NoImplicitPrelude, ScopedTypeVariables, MagicHash #-}
 {-# OPTIONS_HADDOCK hide #-}
 
 -----------------------------------------------------------------------------
@@ -179,11 +179,17 @@ filterFB c p x r | p x       = x `c` r
 -- can be inlined, and then (often) strictness-analysed,
 -- and hence the classic space leak on foldl (+) 0 xs
 
+foldl :: forall a b. (b -> a -> b) -> b -> [a] -> b
+{-# INLINE foldl #-}
+foldl k z xs = foldr (\(v::a) (fn::b->b) (z::b) -> fn (k z v)) (id :: b -> b) xs z
+
+{-
 foldl        :: (b -> a -> b) -> b -> [a] -> b
 foldl f z0 xs0 = lgo z0 xs0
              where
                 lgo z []     =  z
                 lgo z (x:xs) = lgo (f z x) xs
+-}
 
 -- | 'scanl' is similar to 'foldl', but returns a list of successive
 -- reduced values from the left:

Change History (23)

comment:1 Changed 6 years ago by akio

Cc: tkn.akio@… added

comment:2 Changed 6 years ago by nomeata

Yesterday it looked as if this has magically improved, but at least with the example from the ticket, put into self-contained as follows:

module Foo( foo ) where
import Data.Complex

import Prelude hiding (sum, foldl)

foldl k z xs = foldr (\v fn z -> fn (k z v)) id xs z
{-# INLINE foldl #-}

sum                     =  foldl (+) 0
{-# INLINE sum #-}

foo x = sum [f n | n <- [1 .. x]]

f :: Int -> Complex Double
{-# NOINLINE f #-}
f n = mkPolar 1 ((2*pi)/fromIntegral n) ^ n

I still get bad code:

Foo.$wfoo :: GHC.Prim.Int# -> Data.Complex.Complex GHC.Types.Double
[GblId,
 Arity=1,
 Str=DmdType <L,U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True,
         ConLike=True, WorkFree=True, Expandable=True,
         Guidance=IF_ARGS [0] 225 0}]
Foo.$wfoo =
  \ (ww_s1nN :: GHC.Prim.Int#) ->
    case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.># 1 ww_s1nN)
    of _ [Occ=Dead] {
      GHC.Types.False ->
        letrec {
          go_a1k0 [Occ=LoopBreaker]
            :: GHC.Prim.Int#
               -> Data.Complex.Complex GHC.Types.Double
               -> Data.Complex.Complex GHC.Types.Double
          [LclId, Arity=1, Str=DmdType <L,U>]
          go_a1k0 =
            \ (x_a1k1 :: GHC.Prim.Int#) ->
              let {
                v_ayO [Dmd=<L,U(U(U),U(U))>]
                  :: Data.Complex.Complex GHC.Types.Double
                [LclId, Str=DmdType]
                v_ayO = Foo.f (GHC.Types.I# x_a1k1) } in
              let {
                ds_dWw [Dmd=<L,C(U)>]
                  :: Data.Complex.Complex GHC.Types.Double
                     -> Data.Complex.Complex GHC.Types.Double
                [LclId, Str=DmdType]
                ds_dWw =
                  case GHC.Prim.tagToEnum#
                         @ GHC.Types.Bool (GHC.Prim.==# x_a1k1 ww_s1nN)
                  of _ [Occ=Dead] {
                    GHC.Types.False -> go_a1k0 (GHC.Prim.+# x_a1k1 1);
                    GHC.Types.True ->
                      GHC.Base.id @ (Data.Complex.Complex GHC.Types.Double)
                  } } in
              \ (z_ayQ :: Data.Complex.Complex GHC.Types.Double) ->
                ds_dWw (Data.Complex.$fFloatingComplex_$s$c+ z_ayQ v_ayO); } in
        go_a1k0 1 Foo.foo1;
      GHC.Types.True -> Foo.foo1
    }

comment:3 Changed 6 years ago by nomeata

So, this is what we want to happen, somehow.

We want to transform code of the form

let f = λ x. let h = f (g1 x)
             in λ y... h (g2 y) ...
in ...f 1 2..... f 3 4 ...

into

let f = λ x y. let h = f (g1 x)
               in ... h (g2 y) ...
in ...f 1 2..... f 3 4 ...

If g1 is cheap, this is already done (see [Arity analysis] in CoreArity). But in order to implement foldl in terms of foldr and get list fusion for it (which would be nice), this needs to work also when g1 is expensive.

In a slightly more general setting, the transformation would not be ok, e.g. in

let f = λ x. let h = f (g1 x)
             in λ y... h (g2 y) ...
in ...map (f 1)....

or

let f = λ x. let h = f (g1 x)
             in λ y... map h...
in ...f 1 2..... f 3 4 ...

because the expensive g1 would no longer be shared.

So we want an analysis that

  • looks at the call-sites of f
  • notices that the demand put on f from the body of the let, C(C¹(S)), is actually better than the vanilla demand C(S),
  • makes sure that with this demand put on f, its definition (the rhs) now also has this demand (i.e. no sharing lost in the right hand sid),
  • uses this to mark the λ y as one-shot, which will presumably allow the desired optimization.

The demand analyser currently analyses the function before the body. One could do a second round, but there is a risk of exponential blow-up.

comment:4 Changed 6 years ago by nomeata

The demand analyser currently analyses the function before the body. One could do a second round, but there is a risk of exponential blow-up.

In this case, the let is a recursive let anyways, so we are going to do multiple iterations. Do we really need to avoid re-analysis so hard?

comment:5 Changed 6 years ago by nomeata

I gave it a shot, analysing the body and the rhs of a let in alternation until a fixed point is found with regard to

  • demand on the rhs
  • rhs’s demand type
  • body’s demand type

and as expected, the resulting core looks good:

          $wgo_s1oi [Occ=LoopBreaker]
            :: GHC.Prim.Int#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> Data.Complex.Complex GHC.Types.Double
          [LclId, Arity=3, Str=DmdType <S,U><L,U><L,U>]
          $wgo_s1oi =
            \ (w_s1o3 :: GHC.Prim.Int#)
              (ww1_s1oa :: GHC.Prim.Double#)
              (ww2_s1of :: GHC.Prim.Double#) ->
              case Foo.f (GHC.Types.I# w_s1o3)
              of _ [Occ=Dead] { Data.Complex.:+ ww4_s1nS ww5_s1nX ->
              case ww4_s1nS of _ [Occ=Dead] { GHC.Types.D# ww7_s1re ->
              case ww5_s1nX of _ [Occ=Dead] { GHC.Types.D# ww9_s1rg ->
              case GHC.Prim.tagToEnum#
                     @ GHC.Types.Bool (GHC.Prim.==# w_s1o3 ww_s1om)
              of _ [Occ=Dead] {
                GHC.Types.False ->
                  $wgo_s1oi
                    (GHC.Prim.+# w_s1o3 1)
                    (GHC.Prim.+## ww1_s1oa ww7_s1re)
                    (GHC.Prim.+## ww2_s1of ww9_s1rg);
                GHC.Types.True ->
                  Data.Complex.:+
                    @ GHC.Types.Double
                    (GHC.Types.D# (GHC.Prim.+## ww1_s1oa ww7_s1re))
                    (GHC.Types.D# (GHC.Prim.+## ww2_s1of ww9_s1rg))
              }
              }
              }
              }; } in
        $wgo_s1oi 1 0.0 0.0;

But also as expected, for other code, with more nested lets, compilation time becomes unacceptably large.

Has anyone considered doing the fixed-pointing not locally for every let, but across the whole top-level-binding? I.e. initialize all demands and strictness signatures with their strongest possible value (absDmd and botDmdType), and then analyse the whole expression in repeated passes over all of it, until nothing changes any more. (Or, a variant: Do that initialisation, but also do local fixed-pointing, starting from the stored values, so that when an inner let is re-analysed, but nothing has changed, then there is only one iteration.)

comment:6 Changed 6 years ago by nomeata

(Or, a variant: Do that initialisation, but also do local fixed-pointing, starting from the stored values, so that when an inner let is re-analysed, but nothing has changed, then there is only one iteration.)

Ah, so that is what ae_virigin is all about... doing the same trick in my code helps in keeping compilation times down; but I must be introducing bugs somewhere; compiled with my code GHC itself goes into infinite loops – but that is a problem for another day.

comment:7 Changed 6 years ago by nomeata

Ok, here is a very different approach to solving the original problem of this ticket, namely making foldl a good consumer.

Instead of trying hard to make the compiler sufficiently smart, why not help him a little bit? So here is what I did:

I created a new magic function oneShot (similar to lazy etc.). Semantically, oneShot = λ f λ x. f x, but the unfolding will put a OneShot annotation on the binder for x. So essentially, it allows the programmer to tell the compiler: I promise I will only call this function once (and if I break my promise, I won’t blame you).

So now one would define

foldl k z xs = foldr (\v fn -> oneShot (\z -> fn (k z v))) id xs z

(because at this point, we know that the result of foldr ... id xs is not going to be used multiple times) and voilà – good code:

Foo.$wfoo
  :: GHC.Prim.Int# -> (# GHC.Types.Double, GHC.Types.Double #)
[GblId,
 Arity=1,
 Str=DmdType <L,U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=1, Value=True,
         ConLike=True, WorkFree=True, Expandable=True,
         Guidance=IF_ARGS [0] 249 30}]
Foo.$wfoo =
  \ (ww_s1ox :: GHC.Prim.Int#) ->
    case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.># 1 ww_s1ox)
    of _ [Occ=Dead] {
      GHC.Types.False ->
        letrec {
          $wgo_s1ot [Occ=LoopBreaker]
            :: GHC.Prim.Int#
               -> GHC.Prim.Double#
               -> GHC.Prim.Double#
               -> (# GHC.Types.Double, GHC.Types.Double #)
          [LclId, Arity=3, Str=DmdType <S,U><L,U><L,U>]
          $wgo_s1ot =
            \ (w_s1oc :: GHC.Prim.Int#)
              (ww1_s1oj :: GHC.Prim.Double#)
              (ww2_s1oo :: GHC.Prim.Double#) ->
              case Foo.f (GHC.Types.I# w_s1oc)
              of _ [Occ=Dead] { Data.Complex.:+ ww8_aYl ww9_aYm ->
              case ww8_aYl of _ [Occ=Dead] { GHC.Types.D# ww11_s1sa ->
              case ww9_aYm of _ [Occ=Dead] { GHC.Types.D# ww13_s1sc ->
              case GHC.Prim.tagToEnum#
                     @ GHC.Types.Bool (GHC.Prim.==# w_s1oc ww_s1ox)
              of _ [Occ=Dead] {
                GHC.Types.False ->
                  $wgo_s1ot
                    (GHC.Prim.+# w_s1oc 1)
                    (GHC.Prim.+## ww1_s1oj ww11_s1sa)
                    (GHC.Prim.+## ww2_s1oo ww13_s1sc);
                GHC.Types.True ->
                  (# GHC.Types.D# (GHC.Prim.+## ww1_s1oj ww11_s1sa),
                     GHC.Types.D# (GHC.Prim.+## ww2_s1oo ww13_s1sc) #)
              }
              }
              }
              }; } in
        $wgo_s1ot 1 0.0 0.0;
      GHC.Types.True -> (# Foo.foo1, Data.Complex.$fFloatingComplex1 #)
    }
Last edited 6 years ago by nomeata (previous) (diff)

comment:8 Changed 6 years ago by nomeata

Hmm, it seems that the oneShot-information is thrown away before generating the interface, so the use of oneShot does not help across module boundaries for now; this is the unfolding

  foldl :: forall a b. (b -> a -> b) -> b -> [a] -> b
    {- Arity: 3, HasNoCafRefs, Strictness: <L,C(C1(U))><L,U><S,1*U>,
       Inline: INLINE (sat-args=3),
       Unfolding: InlineRule (3, False, False)
                  (\ @ a @ b k :: b -> a -> b z :: b xs :: [a] ->
                   GHC.Base.foldr
                     @ a
                     @ (b -> b)
                     (\ ds :: a ds1 :: b -> b tpl :: b -> ds1 (k tpl ds))
                     (GHC.Base.id @ b)
                     xs
                     z) -}

comment:9 Changed 6 years ago by nomeata

The trick with oneShot is neat, and it works for foo x = sum $ [f i | i <- [1 .. x]] and foo x = sum $ [f i | i <- [1 .. x] , odd i ] (note the filter), but does not yield optimal code for nested iterations like foo x = sum $ concat [[f i | i <- [1 .. n]]| n <- [1..x]], where we get:

        letrec {
          go_a1mU [Occ=LoopBreaker]
            :: GHC.Prim.Int#
               -> Data.Complex.Complex GHC.Types.Double
               -> Data.Complex.Complex GHC.Types.Double
          [LclId, Arity=1, Str=DmdType <L,U>]
          go_a1mU =
            \ (x_a1mV :: GHC.Prim.Int#) ->
              let {
                n_X1nv
                  :: Data.Complex.Complex GHC.Types.Double
                     -> Data.Complex.Complex GHC.Types.Double
                [LclId, Str=DmdType]
                n_X1nv =
                  case GHC.Prim.tagToEnum#
                         @ GHC.Types.Bool (GHC.Prim.==# x_a1mV ww_s1ro)
                  of _ [Occ=Dead] {
                    GHC.Types.False -> go_a1mU (GHC.Prim.+# x_a1mV 1);
                    GHC.Types.True ->
                      GHC.Base.id @ (Data.Complex.Complex GHC.Types.Double)
                  } } in
              case GHC.Prim.tagToEnum# @ GHC.Types.Bool (GHC.Prim.># 1 x_a1mV)
              of _ [Occ=Dead] {
                GHC.Types.False ->
                  letrec {
                    go1_X1nG [Occ=LoopBreaker]
                      :: GHC.Prim.Int#
                         -> Data.Complex.Complex GHC.Types.Double
                         -> Data.Complex.Complex GHC.Types.Double
                    [LclId, Arity=2, Str=DmdType <L,U><L,1*U(U(U),U(U))>]
                    go1_X1nG =
                      \ (x1_X1nI :: GHC.Prim.Int#)
                        (eta_B1 :: Data.Complex.Complex GHC.Types.Double) ->
                        let {
                          a_s1pu [Dmd=<L,U(U(U),U(U))>]
                            :: Data.Complex.Complex GHC.Types.Double
                          [LclId, Str=DmdType]
                          a_s1pu =
                            case eta_B1 of _ [Occ=Dead] { Data.Complex.:+ ww2_a10M ww3_a10N ->
                            case ww2_a10M of _ [Occ=Dead] { GHC.Types.D# ww5_s1ub ->
                            case ww3_a10N of _ [Occ=Dead] { GHC.Types.D# ww7_s1ud ->
                            case Foo.f (GHC.Types.I# x1_X1nI)
                            of _ [Occ=Dead] { Data.Complex.:+ ww9_a10Z ww10_a110 ->
                            case ww9_a10Z of _ [Occ=Dead] { GHC.Types.D# ww12_s1uf ->
                            case ww10_a110 of _ [Occ=Dead] { GHC.Types.D# ww14_s1uh ->
                            Data.Complex.:+
                              @ GHC.Types.Double
                              (GHC.Types.D# (GHC.Prim.+## ww5_s1ub ww12_s1uf))
                              (GHC.Types.D# (GHC.Prim.+## ww7_s1ud ww14_s1uh))
                            }
                            }
                            }
                            }
                            }
                            } } in
                        case GHC.Prim.tagToEnum#
                               @ GHC.Types.Bool (GHC.Prim.==# x1_X1nI x_a1mV)
                        of _ [Occ=Dead] {
                          GHC.Types.False -> go1_X1nG (GHC.Prim.+# x1_X1nI 1) a_s1pu;
                          GHC.Types.True -> n_X1nv a_s1pu
                        }; } in
                  go1_X1nG 1;
                GHC.Types.True -> n_X1nv
              }; } in
        go_a1mU 1 Foo.foo1;

I guess we’d want n_X1nv to have arity one here, and be eta-expanded, so that it turns into a join-point, do we?

comment:10 Changed 6 years ago by nomeata

Interesting: Even without the oneShot marker surviving the unfolding, nofib shows some extreme changes in allocations:

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
     bernouilli          +0.1%     -4.2%      0.12      0.12     +0.0%
           fft2          +0.0%   +125.4%      0.04      0.05     +0.0%
    gen_regexps          +0.0%     -9.4%      0.00      0.00     +0.0%
      integrate          +0.1%   +144.2%      0.14      0.14     +1.0%
        minimax          +0.0%     -3.8%      0.00      0.00     +0.0%
         simple          +0.1%     -7.5%      0.15      0.15    -14.7%
           x2n1          +0.1%    -45.2%      0.00      0.00    -50.0%
--------------------------------------------------------------------------------
            Min          -0.9%    -45.2%    -27.5%    -27.7%    -50.0%
            Max          +0.2%   +144.2%     +3.1%     +3.1%   +100.0%
 Geometric Mean          +0.0%     +0.8%     -5.9%     -5.7%     -0.3%

comment:11 Changed 6 years ago by nomeata

Heh, in the previous numbers, I passed the arguments to nofib-analyze in the wrong order. So please swap signs.

I now made sure that the oneShot makes it into the unfolding for foldl, which is a clear win. Now, relative to master again, I get this very nice result:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
     bernouilli          -0.1%     +1.5%      0.14      0.14     +0.0%
   cryptarithm2          -0.0%     -0.5%      0.01      0.01     +0.0%
            fem          -0.0%     -2.8%      0.02      0.02     +0.0%
           fft2          -0.1%    -55.9%      0.04      0.04     +0.0%
    gen_regexps          -0.0%    -33.6%      0.00      0.00     +0.0%
      integrate          -0.1%    -59.4%      0.13      0.13     -1.0%
        minimax          -0.0%    -15.6%      0.00      0.00     +0.0%
       nucleic2          +0.9%     +1.7%      0.05      0.05     +0.0%
            scs          -0.0%     -0.9%     +1.0%     +0.5%     +0.0%
         simple          -0.1%     -9.4%      0.14      0.14     +0.0%
           x2n1          -0.1%    -74.5%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          -0.2%    -74.5%     -3.0%     -3.0%    -50.0%
            Max          +0.9%     +1.7%     +4.0%     +3.4%    +10.4%
 Geometric Mean          -0.0%     -3.7%     -0.2%     -0.3%     -0.6%

So it seems to be well worth turning foldl into a good consumer, even if the resulting code is not perfect for complicated cases like foldl .. .. $ concat $ ...

The increase for bernouilli comes from this code line:

sum $ zipWith (*) powers (tail $ tail combs)

where originally, because sum is not a good consumer, no list fusion happens and a sum specialized to Integer is used, while after the change, sum is fused with zipWith, but not further, so now we have

foldr2 (\x y r eta -> r (eta + (x * y))) id powers (tail $ tail combs) 0.

This means that we are have elimiated one intermediate list, but we are allocating PAP ’s on each iteration, which is more expensive (three arguments instead of two!). This is verified by looking at ticky numbers. Maybe foldr2 should have been inlined here. Or we just live with it.

Is this (explicit oneShotness-annotations using a magic function) a path we want to go?

comment:12 Changed 6 years ago by nomeata

I have come up with a stand-alone analysis that tries to find out the “caller arity” of a function. Quite a bit like the usage demand, but different from the demand analyser in a few aspects: It does not bother about the demand put on function arguments by a function (so much less fixed-pointing due to that), and it currently looks only for tail-calls (or almost-tail-calls; some post-processing of return values a after the “tail call” is ok, as long as it does not do any relevant calls).

It is an analysis phase that puts the result in the IdInfo, and the simplifier will eta-expand functions where the „caller arity” is higher than the manifest arity.

It has zero effect on the current code base.

If I implement foldl via foldr (but without the use of oneShot), I get essentially the same as the result as without the analysis, but with the explicit oneShot:

            Min          -0.2%    -74.5%     -3.0%     -3.0%    -50.0%
            Max          +0.9%     +1.8%     +2.5%     +2.5%    +14.8%
 Geometric Mean          -0.0%     -3.7%     -0.3%     -0.3%     -0.5%

It still does not help with bernoulli, and probably nothing will: Implementing foldl as foldr will always yield bad code if list fusion does not work all the way and the build (or in this case, build2) is not completely gone. But if list fusion works, the results can be very good

The oneShot change is much simpler, has little risk of introducing bugs and does not increase compilation times. The “caller arity” pass, OTOH, may optimise other code as well (but at least nofib does not seem to contain much such code), and can likely be made more precise. But the conclusion is that foldl can be made a good consumer!

comment:13 Changed 6 years ago by nomeata

JFTR: Simon discussed this with me and encouraged me to pursue the analysis (but don’t merge it before 7.8 has been branched off).

But even more so I’d like to record a few things about the oneShot approach:

  • Would also be nice, and can still be revived if we find cases where the analysis fails.
  • It should always be inlined when passed two arguments.
  • Could be made to inline very very late, but then it should have a strictness signature.
  • In that case, one would have to make sure it does not get in the way of CPR or other analysis.

comment:14 Changed 6 years ago by Joachim Breitner <mail@…>

In 4a8ffcf55c897851b534c700951a0b5bdd43eb97/base:

go-ify foldr2

This helps with the changes in #7994, but might also generally be a good
idea (ignore the runtime):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
           fft2          -0.1%     -1.5%      0.07      0.07     +0.0%
       fibheaps          +0.0%    -17.2%      0.03      0.03     +0.0%
          fluid          +0.5%     -0.7%      0.01      0.01     +0.0%
      integrate          +0.0%     -0.9%      0.16      0.16     +0.0%
        rewrite          +0.0%     -1.1%      0.02      0.02     +0.0%
--------------------------------------------------------------------------------
            Min          -0.1%    -17.2%     -1.6%     +0.0%     +0.0%
            Max          +0.5%     +0.0%   +107.7%   +106.2%    +11.3%
 Geometric Mean          +0.0%     -0.2%    +23.7%    +23.9%     +0.1%

comment:15 Changed 6 years ago by nomeata

With the patch above, the benefits from foldl-via-buildr and the call arity analysis are

            Min          -0.2%    -74.5%     -9.6%    -10.6%    -50.0%
            Max          +0.9%     +1.7%     +4.1%     +4.1%     +6.3%
 Geometric Mean          -0.0%     -4.0%     -0.6%     -0.7%     -0.7%

with only one increase: nucleic2. It comes from a use of maximum, and it goes away when that is changed from NOINLINE [1] to INLINE [1] (GHC was inlining maximum while foldl was implemented naively, but not any more, so let's give it a push). Then we get the very pleasing result of no single benchmark with allocation increase:

            Min          -0.1%    -74.5%     -6.8%     -8.3%    -50.0%
            Max          +0.2%      0.0%    +38.5%    +38.5%      0.0%
 Geometric Mean          -0.0%     -4.1%     +7.7%     +7.7%     -0.8%

This is the current state in the branch wip/T7994-calledArity of ghc, and the branch wip/T7994 of base.

The effect on compilation times needs to be measured, and quite possibly improved (e.g. not a full simplifier run after it, but only eta-expansion where possible and then a gentle simplification of the RHS).

comment:16 Changed 6 years ago by nomeata

Previously I said that my analysis makes no change in nofib if one does not also change foldl to use foldr. That is not true anymore with the latest code (but almost so), there are a few allocation improvemts, up to -1.3% (in anna, stemming from take 30 [... | ...]) – not a surprise, take is defined with a continuation passing takeFB :: (a -> b -> b) -> b -> a -> (Int# -> b) -> Int# -> b.

comment:17 Changed 6 years ago by nomeata

Despite my initial worries, this analysis is actually quite cheap: A 1.8% increase in compile time (with unchanged libraries), according to nofib (is that the best way to measure it)?

Is it worth improving that at the expense of more code in GHC? E.g replace the simplifier pass after CallArity by custom one that just does eta-expansion, and only gently simplifies the expression therein? Or should I just leave it like this?

comment:18 Changed 6 years ago by nomeata

Another measurement of compile time changes, this time of running make in a clean an configured GHC tree: changes from 31m34 to 32m7 an increase of 1,75%, confirming the numbers measured in nofib.

comment:19 Changed 6 years ago by nomeata

Resolution: fixed
Status: newclosed

Heh, forgot to reference the ticket in the push. This has now been implemented in b63facef165b957183b65604ef99b2b8574747a5/base and requires the calla arity analysis (cdceadf365335fdee5ffa8e6364f7fb00bc4b0f7/ghc).

This should not stop us from using superior approaches like Takano’s when they are ready.

comment:20 Changed 6 years ago by nomeata

JFTR: I’m currently working on a presentation and short paper on this, and did some measurements, and found that while making foldl a good consumer has a nice effect on allocations (-4.1%, the overall effect on the runtime is negligible).

Effect of Call Arity (with the original foldl):

                         min    mean     max
      Allocations      -1.3%   -0.1%    0.0%
      Runtime         -10.8%   -0.9%   +2.0%

Effect of Call Arity + defining foldl with foldr:

                         min    mean     max
      Allocations     -74.5%   -4.1%    0.0%
      Runtime         -10.4%   -0.9%   +1.7%

(Wishful thinking: Maybe there are real-world programs that would benefit more noticeably, but they are just not included in nofib)

comment:21 Changed 6 years ago by nomeata

Updated numbers, now running nofib in slow mode. Now the results look slightly better. Guess that is what benchmarking is about: Tweaking the tests until the results are what we want them to be:

Effect of Call Arity (with the original foldl):

                        min    mean     max
      Allocations     -1.3%   -0.0%    0.0%
      Runtime         -4.0%   -0.0%   +4.9%

Effect of Call Arity + defining foldl with foldr:

                         min    mean     max
      Allocations     -79.0%   -5.2%    0.0%
      Runtime         -47.4%   -1.9%   +3.0%

comment:22 Changed 5 years ago by Joachim Breitner <mail@…>

In c271e32eac65ee95ba1aacc72ed1b24b58ef17ad/ghc:

Add GHC.Prim.oneShot

to allow the programer to explictitly set the oneShot flag. This helps
with #7994 and will be used in left folds. Also see
https://ghc.haskell.org/trac/ghc/wiki/OneShot

This commit touches libraries/base/GHC/Event/Manager.hs (which used to
have a local definition of the name oneShot) to avoid a shadowing error.

Differential Revision: https://phabricator.haskell.org/D392

comment:23 Changed 5 years ago by Joachim Breitner <mail@…>

In 072259c78f77d6fe7c36755ebe0123e813c34457/ghc:

Use oneShot in the definition of foldl etc.

This increases the chance of good code after fusing a left fold. See
ticket #7994 and the new Note [Left folds via right fold]

Differential Revision: https://phabricator.haskell.org/D393
Note: See TracTickets for help on using tickets.