Opened 3 years ago

Closed 3 years ago

#12354 closed bug (fixed)

Word foldl' isn't optimized as well as Int foldl'

Reported by: kjslag Owned by:
Priority: normal Milestone: 8.0.2
Component: Compiler Version: 8.0.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2376
Wiki Page:

Description

import Data.List

test :: Int -> Int
test n = foldl' (+) 0 [1..n]

main :: IO ()
main = do
  print $ test $ 10^8

GHC optimizes the above code to the point that the garbage collector doesn't even have to do anything:

$ ghc -rtsopts -O2 testInt && ./testInt +RTS -s
[1 of 1] Compiling Main             ( testInt.hs, testInt.o )
Linking testInt ...
5000000050000000
          51,752 bytes allocated in the heap
           3,480 bytes copied during GC
          44,384 bytes maximum residency (1 sample(s))
          17,056 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.101s  (  0.101s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.103s  (  0.102s elapsed)

  %GC     time       0.1%  (0.1% elapsed)

  Alloc rate    511,162 bytes per MUT second

  Productivity  99.8% of total user, 100.9% of total elapsed

However, if I change the type of test to test :: Word -> Word, then a lot of garbage is produced and the code runs 40x slower:

ghc -rtsopts -O2 testWord && ./testWord +RTS -s
[1 of 1] Compiling Main             ( testWord.hs, testWord.o )
Linking testWord ...
5000000050000000
  11,200,051,784 bytes allocated in the heap
       1,055,520 bytes copied during GC
          44,384 bytes maximum residency (2 sample(s))
          21,152 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     21700 colls,     0 par    0.077s   0.073s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    4.551s  (  4.556s elapsed)
  GC      time    0.077s  (  0.073s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    4.630s  (  4.630s elapsed)

  %GC     time       1.7%  (1.6% elapsed)

  Alloc rate    2,460,957,186 bytes per MUT second

  Productivity  98.3% of total user, 98.3% of total elapsed

I expected the performance to be nearly identical. I'm using GHC version 8.0.1 on x86_64 Arch Linux.

I asked about this on stackoverflow, and the issue appears to be related to rewrite rules: http://stackoverflow.com/a/38113639/6531137

Change History (11)

comment:1 Changed 3 years ago by nomeata

As pointed out by dfeuer on stackexchange, the Enum instance for Int is better than the one for Word:

Int:

instance  Enum Int  where
    {-# INLINE enumFromTo #-}
    enumFromTo (I# x) (I# y) = eftInt x y

{-# RULES
"eftInt"        [~1] forall x y. eftInt x y = build (\ c n -> eftIntFB c n x y)
"eftIntList"    [1] eftIntFB  (:) [] = eftInt
 #-}
{- Note [How the Enum rules work]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Phase 2: eftInt ---> build . eftIntFB
* Phase 1: inline build; eftIntFB (:) --> eftInt
* Phase 0: optionally inline eftInt
-}

{-# NOINLINE [1] eftInt #-}
eftInt :: Int# -> Int# -> [Int]
-- [x1..x2]
eftInt x0 y | isTrue# (x0 ># y) = []
            | otherwise         = go x0
               where
                 go x = I# x : if isTrue# (x ==# y)
                               then []
                               else go (x +# 1#)

{-# INLINE [0] eftIntFB #-}
eftIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> r
eftIntFB c n x0 y | isTrue# (x0 ># y) = n
                  | otherwise         = go x0
                 where
                   go x = I# x `c` if isTrue# (x ==# y)
                                   then n
                                   else go (x +# 1#)
                        -- Watch out for y=maxBound; hence ==, not >
        -- Be very careful not to have more than one "c"
        -- so that when eftInfFB is inlined we can inline
        -- whatever is bound to "c"

Now Word actually uses the implementation for Integer

    enumFromTo n1 n2       = map integerToWordX [wordToIntegerX n1 .. wordToIntegerX n2]

which uses

instance  Enum Integer  where
    enumFromTo x lim       = enumDeltaToInteger x 1     lim

Now enumDeltaToInteger has rewrite rules set up, but it turns out that Word’s enumFromTo is never inlined, so this setup has no chance of fusing here.

Inlining this function into my test code causes fold/build to fire, cutting down allocation severely, but the conversion from and to Integer remains.

One could of course write similar hand-written code such as for Int also for Word. But what about Word8, Word16, Word32 and Word64 then? Where does it stop?

comment:2 Changed 3 years ago by simonpj

it turns out that Word’s enumFromTo is never inlined, so this setup has no chance of fusing here

So the short term solution is to add an INLINE for Word's enumFromTo?

Where does it stop?

Well you might hope that you could write eftInt, eftIntFB once, at type a and SPECIALISE them. That would save copying them manually.

Writing rewrite rules that also specialise is something I have not thought much about though. I suppose that, conceivably, all the fusion could happen generically (i.e. on the class-overloaded functions), before we specialise to the particular type. To achieve that, we'd have to delay the class-op selection e.g. eunmFrom dEnumInt --> enumFrom_Int. But that ought to be possible.

A good project here.

comment:3 Changed 3 years ago by bgamari

I think it's reasonable to expect that Word and Int optimize roughly as well. I might even go so far to say that the same should be expected of the fixed width Word and Int variants.

Last edited 3 years ago by bgamari (previous) (diff)

comment:4 Changed 3 years ago by bgamari

Well you might hope that you could write eftInt, eftIntFB once, at type a and SPECIALISE them. That would save copying them manually.

Indeed, unfortunately this would require that eftInt be runtime-representationally polymorphic since Word# :: RuntimeRep WordRep and Int# :: RuntimeRep IntRep. This is a rather unfortunate limitation since we would like to ensure that the polymorphism is resolved at compile-time via inlining.

Last edited 3 years ago by bgamari (previous) (diff)

comment:5 Changed 3 years ago by simonpj

I think the generic versions could work with boxed values and we could fuse there. After fusion the unboxed versions could kick in.

Short term: just replicate Int stuff for Word. Medium term: let's hope someone takes up the challenge.

comment:6 Changed 3 years ago by bgamari

Differential Rev(s): Phab:D2376
Status: newpatch

Here is a quick attempt. Still need to test.

Last edited 3 years ago by bgamari (previous) (diff)

comment:7 Changed 3 years ago by Ben Gamari <ben@…>

In 0bd7c4b/ghc:

Enum: Ensure that operations on Word fuse

Test Plan: Validate, verify fusion

Reviewers: austin, hvr

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2376

GHC Trac Issues: #12354

comment:8 Changed 3 years ago by bgamari

Merged to master. Seems like this would be an easy thing to merge for 8.0.2.

comment:9 Changed 3 years ago by bgamari

Milestone: 8.0.2
Status: patchmerge

comment:10 Changed 3 years ago by Ben Gamari <ben@…>

In f26e58b/ghc:

Enum: Ensure that operations on Word fuse

Test Plan: Validate, verify fusion

Reviewers: austin, hvr

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2376

GHC Trac Issues: #12354

(cherry picked from commit 0bd7c4b4240a27d4e26290741394b31b48db7671)

comment:11 Changed 3 years ago by bgamari

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.