Opened 6 years ago

Last modified 4 years ago

#8313 new task

Poor performance of higher-order functions with unboxing

Reported by: dolio Owned by:
Priority: low Milestone:
Component: Test Suite Version: 7.6.3
Keywords: slow unboxed higher order Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: #6084 Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I was testing out some code to see how GHC handled some unboxed higher-order functions, and was suprised by the results I found. Here is some sample code:

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MagicHash #-}

import GHC.Exts
import System.Environment

rel# :: Int# -> Int# -> Int# -> Bool
rel# i# j# k# = (i# +# j# +# k#) ># 100000000#

rel :: Int -> Int -> Int -> Bool
rel (I# i#) (I# j#) (I# k#) = rel# i# j# k#

manual :: (Int# -> Int# -> Int# -> Bool) -> (Int, Int, Int)
manual r# = go 0# 0# 0#
 where
 go i# j# k# | r# i# j# k# = (I# i#, I# j#, I# k#)
             | otherwise   = go j# k# (i# +# 1#)
{-# NOINLINE manual #-}

auto :: (Int -> Int -> Int -> Bool) -> (Int, Int, Int)
auto r = go 0 0 0
 where
 go !i !j !k | r i j k   = (i, j, k)
             | otherwise = go j k (i+1)
{-# NOINLINE auto #-}

main = getArgs >>= \case
  "manual" : _ -> print $ manual rel# -- This case is significantly slower.
  "auto"   : _ -> print $ auto rel    -- Why?

A loop that has to box its loop parameters to call a predicate turns out to be significantly faster than one that uses a predicate that takes unboxed values directly.

The answer turns out to be (I believe) in ghc/utils/genapply/GenApply.hs. applyTypes has an entry [P,P,P], but only [N]. This means that the manual loop has to use a slower calling convention than the boxed loop.

I'm not sure whether this should be 'fixed,' as my encounter with it was experimental in nature, and I don't have a real use case. The comment on applyTypes says its cases cover 99% of uses, and mine is not a real use. This ticket may serve as documentation at least, though.

Change History (6)

comment:1 Changed 6 years ago by simonmar

Apply the patch from #6084 and see if that helps. If it's a win, let's commit it.

comment:2 Changed 6 years ago by simonmar

Blocked By: 6084 added

comment:3 Changed 6 years ago by simonmar

The optimisation from #6084 is now committed.

Todo: turn this program into a performance test, so that we can tell if the optimisation stops firing for any reason.

comment:4 Changed 6 years ago by nomeata

Checking if this is really fixed, but here, manual is still slower than auto, so it does not seem to be fixed (although it might have been even slower before). Also, manual allocates much more – is that the symptom of this problem, or is it something else?

I had slightly change the test due to [f6e2398adb63f5c35544333268df9c8837fd2581/base] to

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MagicHash #-}

import GHC.Exts
import System.Environment

rel# :: Int# -> Int# -> Int# -> Int#
rel# i# j# k# = (i# +# j# +# k#) ># 100000000#

rel :: Int -> Int -> Int -> Bool
rel (I# i#) (I# j#) (I# k#) = tagToEnum# (rel# i# j# k#)

manual :: (Int# -> Int# -> Int# -> Int#) -> (Int, Int, Int)
manual r# = go 0# 0# 0#
 where
 go i# j# k# | tagToEnum# (r# i# j# k#) = (I# i#, I# j#, I# k#)
             | otherwise                = go j# k# (i# +# 1#)
{-# NOINLINE manual #-}

auto :: (Int -> Int -> Int -> Bool) -> (Int, Int, Int)
auto r = go 0 0 0
 where
 go !i !j !k | r i j k   = (i, j, k)
             | otherwise = go j k (i+1)
{-# NOINLINE auto #-}

main = getArgs >>= \case
  "manual" : _ -> print $ manual rel# -- This case is significantly slower.
  "auto"   : _ -> print $ auto rel    -- Why?

and I get these numbers:

$ ./T8313 manual +RTS -t
(33333333,33333334,33333334)
<<ghc: 7200055256 bytes, 13828 GCs, 36364/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 3.03 MUT (3.05 elapsed), 0.03 GC (0.03 elapsed) :ghc>>
$ ./T8313 auto +RTS -t
(33333333,33333334,33333334)
<<ghc: 4800054800 bytes, 9192 GCs, 36364/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.43 MUT (1.43 elapsed), 0.02 GC (0.02 elapsed) :ghc>>

comment:5 Changed 5 years ago by thomie

After compiling the program from comment:4 with -O2, manual is now 4x faster.

$ ghc-7.8.3 -O2 T8313.hs

$ ./T8313 manual +RTS -t
(33333333,33333334,33333334)
<<ghc: 54888 bytes, 1 GCs, 44312/44312 avg/max bytes residency (1 samples), 1M in use, 0.00 INIT (0.00 elapsed), 0.65 MUT (0.67 elapsed), 0.00 GC (0.00 elapsed) :ghc>>

$ ./T8313 auto +RTS -t
(33333333,33333334,33333334)
<<ghc: 4800054576 bytes, 9192 GCs, 36364/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 2.21 MUT (2.21 elapsed), 0.01 GC (0.06 elapsed) :ghc>>

Same results with ghc-7.9.20141115. Still needs that performance test.

comment:6 Changed 4 years ago by thomie

Component: CompilerTest Suite

Bug is fixed. Just needs a test.

Note: See TracTickets for help on using tickets.