Opened 6 years ago
Last modified 4 years ago
#8313 new task
Poor performance of higher-order functions with unboxing
Reported by: | dolio | Owned by: | |
---|---|---|---|
Priority: | low | Milestone: | ⊥ |
Component: | Test Suite | Version: | 7.6.3 |
Keywords: | slow unboxed higher order | Cc: | |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | #6084 | Blocking: | |
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
I was testing out some code to see how GHC handled some unboxed higher-order functions, and was suprised by the results I found. Here is some sample code:
{-# LANGUAGE BangPatterns #-} {-# LANGUAGE LambdaCase #-} {-# LANGUAGE MagicHash #-} import GHC.Exts import System.Environment rel# :: Int# -> Int# -> Int# -> Bool rel# i# j# k# = (i# +# j# +# k#) ># 100000000# rel :: Int -> Int -> Int -> Bool rel (I# i#) (I# j#) (I# k#) = rel# i# j# k# manual :: (Int# -> Int# -> Int# -> Bool) -> (Int, Int, Int) manual r# = go 0# 0# 0# where go i# j# k# | r# i# j# k# = (I# i#, I# j#, I# k#) | otherwise = go j# k# (i# +# 1#) {-# NOINLINE manual #-} auto :: (Int -> Int -> Int -> Bool) -> (Int, Int, Int) auto r = go 0 0 0 where go !i !j !k | r i j k = (i, j, k) | otherwise = go j k (i+1) {-# NOINLINE auto #-} main = getArgs >>= \case "manual" : _ -> print $ manual rel# -- This case is significantly slower. "auto" : _ -> print $ auto rel -- Why?
A loop that has to box its loop parameters to call a predicate turns out to be significantly faster than one that uses a predicate that takes unboxed values directly.
The answer turns out to be (I believe) in ghc/utils/genapply/GenApply.hs. applyTypes has an entry [P,P,P], but only [N]. This means that the manual loop has to use a slower calling convention than the boxed loop.
I'm not sure whether this should be 'fixed,' as my encounter with it was experimental in nature, and I don't have a real use case. The comment on applyTypes says its cases cover 99% of uses, and mine is not a real use. This ticket may serve as documentation at least, though.
Change History (6)
comment:1 Changed 6 years ago by
comment:2 Changed 6 years ago by
Blocked By: | 6084 added |
---|
comment:3 Changed 6 years ago by
The optimisation from #6084 is now committed.
Todo: turn this program into a performance test, so that we can tell if the optimisation stops firing for any reason.
comment:4 Changed 6 years ago by
Checking if this is really fixed, but here, manual
is still slower than auto
, so it does not seem to be fixed (although it might have been even slower before). Also, manual
allocates much more – is that the symptom of this problem, or is it something else?
I had slightly change the test due to [f6e2398adb63f5c35544333268df9c8837fd2581/base] to
{-# LANGUAGE BangPatterns #-} {-# LANGUAGE LambdaCase #-} {-# LANGUAGE MagicHash #-} import GHC.Exts import System.Environment rel# :: Int# -> Int# -> Int# -> Int# rel# i# j# k# = (i# +# j# +# k#) ># 100000000# rel :: Int -> Int -> Int -> Bool rel (I# i#) (I# j#) (I# k#) = tagToEnum# (rel# i# j# k#) manual :: (Int# -> Int# -> Int# -> Int#) -> (Int, Int, Int) manual r# = go 0# 0# 0# where go i# j# k# | tagToEnum# (r# i# j# k#) = (I# i#, I# j#, I# k#) | otherwise = go j# k# (i# +# 1#) {-# NOINLINE manual #-} auto :: (Int -> Int -> Int -> Bool) -> (Int, Int, Int) auto r = go 0 0 0 where go !i !j !k | r i j k = (i, j, k) | otherwise = go j k (i+1) {-# NOINLINE auto #-} main = getArgs >>= \case "manual" : _ -> print $ manual rel# -- This case is significantly slower. "auto" : _ -> print $ auto rel -- Why?
and I get these numbers:
$ ./T8313 manual +RTS -t (33333333,33333334,33333334) <<ghc: 7200055256 bytes, 13828 GCs, 36364/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 3.03 MUT (3.05 elapsed), 0.03 GC (0.03 elapsed) :ghc>> $ ./T8313 auto +RTS -t (33333333,33333334,33333334) <<ghc: 4800054800 bytes, 9192 GCs, 36364/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 1.43 MUT (1.43 elapsed), 0.02 GC (0.02 elapsed) :ghc>>
comment:5 Changed 5 years ago by
After compiling the program from comment:4 with -O2
, manual
is now 4x faster.
$ ghc-7.8.3 -O2 T8313.hs $ ./T8313 manual +RTS -t (33333333,33333334,33333334) <<ghc: 54888 bytes, 1 GCs, 44312/44312 avg/max bytes residency (1 samples), 1M in use, 0.00 INIT (0.00 elapsed), 0.65 MUT (0.67 elapsed), 0.00 GC (0.00 elapsed) :ghc>> $ ./T8313 auto +RTS -t (33333333,33333334,33333334) <<ghc: 4800054576 bytes, 9192 GCs, 36364/44312 avg/max bytes residency (2 samples), 1M in use, 0.00 INIT (0.00 elapsed), 2.21 MUT (2.21 elapsed), 0.01 GC (0.06 elapsed) :ghc>>
Same results with ghc-7.9.20141115
. Still needs that performance test.
Apply the patch from #6084 and see if that helps. If it's a win, let's commit it.