Opened 3 years ago

Closed 21 months ago

#13001 closed bug (fixed)

EnumFromThenTo is is not a good producer

Reported by: George Owned by: akio
Priority: low Milestone: 8.4.1
Component: libraries/base Version: 8.0.1
Keywords: Cc: akio
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2951
Wiki Page:


{-# OPTIONS_GHC -Wall #-}

module Foo where
testFromTo :: Int -> Int
testFromTo n = length ([0..(10^n)] :: [Int])

testFromThenTo :: Int -> Int
testFromThenTo n = length ([0,2..(10^n)] :: [Int])

$ ghci -fobject-code -O 
GHCi, version 8.0.1:  :? for help
Prelude> :load Foo
:load Foo
[1 of 1] Compiling Foo              ( Foo.hs, Foo.o )
Ok, modules loaded: Foo (Foo.o).
Prelude Foo> :set +s
:set +s
Prelude Foo> testFromTo 5
testFromTo 5
(0.02 secs, 97,992 bytes)
Prelude Foo> testFromThenTo 5
testFromThenTo 5
(0.00 secs, 5,694,424 bytes)
Prelude Foo> testFromThenTo 6
testFromThenTo 6
(0.02 secs, 56,095,288 bytes)
Prelude Foo> 

I set the Type to bug rather than feature request as the source code in seems to be trying to do list fusion:

-- efdInt and efdtInt deal with [a,b..] and [a,b..c].
-- The code is more complicated because of worries about Int overflow.

-- See Note [How the Enum rules work]
"efdtInt"       [~1] forall x1 x2 y.
                     efdtInt x1 x2 y = build (\ c n -> efdtIntFB c n x1 x2 y)
"efdtIntUpList" [1]  efdtIntFB (:) [] = efdtInt

Change History (14)

comment:1 Changed 3 years ago by akio

Cc: akio added

I have a fix to this problem locally, but I've never got around to submitting a patch. I believe the problem is that efdtIntFB is not marked INLINE [0].

comment:2 Changed 3 years ago by akio

I tested a a patch and indeed marking *FB functions INLINE[0] seems to help.

However this change increases binary sizes of some programs in nofib:

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
    gen_regexps          -0.3%      0.0%     0.000     0.000      0.0%
         puzzle          +0.8%      0.0%     0.089     0.090      0.0%
        reptile          +0.8%     -0.0%     0.008     0.008      0.0%
            Min          -0.3%     -0.0%     -7.3%     -7.1%      0.0%
            Max          +0.8%     +0.0%     +7.8%     +7.7%     +1.8%
 Geometric Mean          +0.0%     -0.0%     +0.2%     +0.2%     +0.0%

I have only looked at the code for puzzle so far. The increase in the code size seems to come from inlining efdtInt*FB into a derived Enum instance declaration.

I wonder if this change is acceptable.

comment:3 Changed 3 years ago by nomeata

I wonder if this change is acceptable.

I’d say yes. The mean changes by <0.1%, and we really want fusion to be as reliable as possible.

comment:4 Changed 3 years ago by simonpj

It's true that doing more inlining will generally improve things, but we could be a bit more forensic about this. Just sprinkling more inline pragmas, some of which may do no good, is a bit of a blunt instrument.

I checked the code for testFromThenTo, compiled with -O, and got

Foo.$wtestFromThenTo =
  \ (ww_s2vE :: GHC.Prim.Int#) ->
    case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.<# ww_s2vE 0#) of {
      False ->
        case ww_s2vE of wild2_a2uk {
          __DEFAULT ->
            case GHC.Real.$wf1 10# wild2_a2uk of ww4_a2uE { __DEFAULT ->
              @ (Int -> Int)
              (GHC.List.lengthFB @ Int)
              (id @ Int)
          0# -> Foo.testFromThenTo1
      True -> GHC.Real.^2

Fusion has happened (there is a use of the foldr/build rule), but the call to edftIntUpFB remains. That's not necessarily bad. As you'll see, it's not a tiny function:

efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y    -- Be careful about overflow!
 | isTrue# (y <# x2) = if isTrue# (y <# x1) then n else I# x1 `c` n
 | otherwise = -- Common case: x1 <= x2 <= y
               let !delta = x2 -# x1 -- >= 0
                   !y' = y -# delta  -- x1 <= y' <= y; hence y' is representable

                   -- Invariant: x <= y
                   -- Note that: z <= y' => z + delta won't overflow
                   -- so we are guaranteed not to overflow if/when we recurse
                   go_up x | isTrue# (x ># y') = I# x `c` n
                           | otherwise         = I# x `c` go_up (x +# delta)
               in I# x1 `c` go_up x2

BUT the bad thing is that it allocated loads of Int values! Why does it do that? Becuase the function passed to it (c in the above definition) takes an Int as its argument. So if efdtIntUpFB isn't inlined, it must allocate an Int box for every iteration. Bad!

But this is an internal function. Suppose we gave it this signature:

efdtIntUpFB :: (Int# -> r -> r) -> r -> Int# -> Int# -> Int# -> r
            --  ^^^ NB Int# not Int

Now it won't allocate! Can we make the rest of the pieces fit together now? Would you like to have a try?

I also looked at the call to lengthFB in the above optimised code. It's defined like this:

lengthFB :: x -> (Int -> Int) -> Int -> Int
lengthFB _ r = \ !a -> r (a + 1)

Uh-ho! More Int boxes! So I tried rewriting the length moving parts (in GHC.List) like this:

{-# INLINE xlength #-}
xlength                  :: [a] -> Int
xlength xs               = I# (xlenAcc xs 0#)

xlenAcc          :: [a] -> Int# -> Int#
xlenAcc []     n = n
xlenAcc (_:ys) n = xlenAcc ys (n +# 1#)

"xlength" [~1] forall xs . xlenAcc xs = foldr xlengthFB idXlength xs
"xlengthList" [1] foldr xlengthFB idXlength = xlenAcc

-- The lambda form turns out to be necessary to make this inline
-- when we need it to and give good performance.
{-# INLINE [0] xlengthFB #-}
xlengthFB :: x -> (Int# -> Int#) -> Int# -> Int#
xlengthFB _ r = \ a -> r (a +# 1#)

{-# INLINE [0] idXlength #-}
idXlength :: Int# -> Int#
idXlength x = x

That compiles fine. Even if it generates the same code as before, GHC will have to do less work to optimise it, so it's a win.

Would you like to try that change to length and see if it is an improvement?

Maybe you can do the same for efdtIntUpFB?

comment:5 Changed 3 years ago by akio

Hmm, you could eliminate allocation of Ints that way, but the code still has to allocate one function closure for each element in the list, right? It seems to me that inlining efdtIntUpFB is the only way to eliminate this allocation, if the final list consumer is length or foldl'.

comment:6 Changed 3 years ago by simonpj

I think perhaps you mean that in the call

I# x `c` go_up (x +# delta)

since GHC doesn't know c is strict, it'll allocate a thunk for the second argument. Quite true. And that does mean it'd be good to inline efdtIntUpFB, I agree.

But still, fewer Int boxes is better regardless.

comment:7 Changed 3 years ago by akio

since GHC doesn't know c is strict, it'll allocate a thunk for the second argument.

This is true. I was actually referring to another source of allocation: since the call to c is under-saturated, evaluating this call involves allocating a closure for the partial-application of c. This is also the case when the final consumer is foldl' instead of length.

I think this is a general argument for marking *all* FB functions INLINE [0]. If p = build pFB is advertised as a good producer, a user would expect foldl' f z p to allocate no intermediate data structure. However, if pFB is not inlined, this expression will necessarily allocate a few closures for each eliminated cons cell.

comment:8 Changed 3 years ago by simonpj

I think that's a reasonable point. If you follow it up, could you include a Note to explain the reasoning? And refer to the Note from the relevant INLINE pragmas?

comment:9 Changed 3 years ago by akio

Component: Compilerlibraries/base
Differential Rev(s): Phab:D2951
Operating System: MacOS XUnknown/Multiple
Owner: set to akio

comment:10 Changed 3 years ago by Ben Gamari <ben@…>

In 09bce7ac/ghc:

Mark *FB functions INLINE[0] (Fixes #13001)

When fusion rules successfully fire, we are left with calls to
*FB functions. They are higher-order functions, and therefore they
often benefit from inlining. This is particularly important when
then final consumer is a strict fold (foldl', length, etc.), because
not inlining these functions means allocating a function closure
for each element in the list, which often is more costly than what
fusion eliminates.

Nofib shows a slight increase in the binary size:

       Program           Size    Allocs   Runtime   Elapsed  TotalMem
   gen_regexps          -0.3%      0.0%     0.000     0.000      0.0%
        puzzle          +0.8%      0.0%     0.089     0.090      0.0%
       reptile          +0.8%     -0.0%     0.008     0.008      0.0%
           Min          -0.3%     -0.0%     -7.3%     -7.1%      0.0%
           Max          +0.8%     +0.0%     +7.8%     +7.7%     +1.8%
Geometric Mean          +0.0%     -0.0%     +0.2%     +0.2%     +0.0%

Reviewers: simonpj, austin, hvr, bgamari

Reviewed By: simonpj

Subscribers: simonpj, thomie

Differential Revision:

GHC Trac Issues: #13001

comment:11 Changed 3 years ago by bgamari

Milestone: 8.2.1

comment:12 Changed 2 years ago by bgamari


Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

comment:13 Changed 21 months ago by George

although this is low priority, it looks like it is done. Is there any reason not to put it into 8.4.1?

comment:14 Changed 21 months ago by bgamari

Resolution: fixed
Status: newclosed

Indeed it's already in 8.4.1; it looks like the ticket was simply never closed. Thanks George!

Note: See TracTickets for help on using tickets.