#5237 closed bug (fixed)
Inefficient code generated for x^2
Reported by: | scpmw | Owned by: | daniel.is.fischer |
---|---|---|---|
Priority: | normal | Milestone: | 7.4.1 |
Component: | libraries/base | Version: | 7.0.3 |
Keywords: | Cc: | ||
Operating System: | Linux | Architecture: | x86_64 (amd64) |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
Okay, this one is quite surprising to me. Suppose the following trivial Haskell program:
main = print (2.0 ^ 2)
This innocent example actually ends up generating multiple pages of Core! To see what's going on we just have to look at a few type signatures:
$wg1_r11f :: Double# -> Type.Integer -> Double# -> Double# $wf1 :: Double# -> Type.Integer -> Double#
The function $wf1
is what is called to perform the calculation. First thing to note is that it actually compiles ^2
into a loop (that's what $wg1_r11f
is for). Second, to make matters worse, it ends up using Integer of all things as the index type (defaulting at work, probably).
This has obviously drastic performance implications. A more complicated example program actually got about 4 times faster just by replacing all instances of x^2
with x*x
.
I guess this probably is not easy to fix in the general case (speculative loop unrolling?). But couldn't we have, say, a rule that catches the worst cases?
Attachments (2)
Change History (25)
Changed 8 years ago by
Attachment: | 0001-Fix-5237.patch added |
---|
comment:1 Changed 8 years ago by
comment:2 Changed 8 years ago by
Milestone: | → 7.4.1 |
---|---|
Status: | new → patch |
comment:3 Changed 8 years ago by
Component: | Compiler → libraries/base |
---|
comment:4 Changed 8 years ago by
I've tested the rules in the last days, with mixed success. For Int and Integer, where the rules are in the same place as (^)
, they fire okay, unless specialisation gets in the way (so far, that affects Integer -> Integer -> Integer
, Integer -> Int -> Integer
and Int -> Int -> Int
).
To get the rules for Word
exponents to fire, I had to specify the exponent as W# n##
, so those rules seem pretty useless.
For the types we have specialisations for, specialising is done before rule-rewriting, apparently. I think it might be good to have a rule-rewriting pass before any other optimisations are tried, based on the assumption that rules generally are written to give large benefits, while general optimisations often give only a small benefit. If it's easy enough to add a rule-scanning pass before other simplifications, it's worth thinking about it.
comment:5 Changed 8 years ago by
The usual approach is to say {-# NOINLINE [1] #-}
on functions that you want to match on the LHS of rules. Did you try that? Or, can you be more specific about what isn't firing?
Thanks for looking at this.
Simon
comment:6 Changed 8 years ago by
Not sure I understand correctly. What we want is that occurrences of expr ^ 2
or expr ^ (2 :: Type)
in source code are rewritten to expr * expr
(similarly for exponents 3 and 4).
With the rules, that works fine if the exponent isn't given an explicit type or the type is Int
or Integer
- unless the type of expr
is Integer
(or Int
if the exponent is specified as Int
). For those types, (^)
is specialised, and in the core replaced by the specialised version, then GHC.Real.^_^2 expr 2
doesn't match the rule anymore^{1}.
Making (^)
{-# NOINLINE [1] #-}
instead of {-# INLINABLE #-}
would cost a lot in other places, so that's not a good option; out of sheer morbid curiosity I tried it anyway, it didn't help. But if INLINABLE
supports phase control (that isn't mentioned in the users guide, the syntax is accepted, however), perhaps (^)
should become {-# INLINABLE [1] #-}
?
With Word
exponents, the problem seems to be that the exponent is moved to a top-level binding and the rule-matcher sees expr ^ lvl_fooBar
instead of expr ^ (3 :: Word)
.
But I don't think specifying the exponent as Word
is common, mostly the exponent will not be given an explicit type and defaulted to Integer
or specified as Int
for better performance.
So I'd just leave out the rules for Word
.
^{1} Perhaps the compiler just considers the specialisation more specific than the rules and hence that wins? Compiling with extra rules for the specialised types to check.
comment:7 Changed 8 years ago by
Indded, it seems the specialisations were considered a better match at those types. Extra rules for the specialised types fire.
comment:8 Changed 8 years ago by
I'm still unclear what the problem is. Could you boil out a standalone case that illustrates the problem you are trying to solve. The numeric system is complicated and my brain is small. Thanks.
Simon
comment:9 Changed 8 years ago by
The problem is that if you write expr ^ 2
in the source, that's exactly what you get. But it's not what you want.
Consider the programme
{-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) fun :: Double -> Double fun x = go 0 0.5 where go !acc z | x < z = acc | otherwise = go (acc + z^2) (z+0.25) main :: IO () main = getArgs >>= mapM_ (print . fun . read)
Compiling it with ghc-7.2.1, I get nearly 32K of Core and an executable delivering
dafis@schwartz:~/Haskell/BeginnersTesting> ./squareTest721 +RTS -s -RTS 1.2e7 2.3040000720013998e21 2,304,132,792 bytes allocated in the heap 186,544 bytes copied during GC 28,992 bytes maximum residency (1 sample(s)) 26,288 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 4411 colls, 0 par 0.01s 0.01s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0004s 0.0004s INIT time 0.00s ( 0.00s elapsed) MUT time 2.53s ( 2.53s elapsed) GC time 0.02s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.54s ( 2.54s elapsed) %GC time 0.6% (0.6% elapsed) Alloc rate 910,944,778 bytes per MUT second Productivity 99.4% of total user, 99.4% of total elapsed
Compiling it with ghc-7.3.20110926 with rewrite rules, I get 6.6K of Core and an executable delivering
dafis@schwartz:~/Haskell/BeginnersTesting> ./squareTest73R +RTS -s -RTS 1.2e7 2.3040000720013998e21 132,184 bytes allocated in the heap 3,304 bytes copied during GC 44,200 bytes maximum residency (1 sample(s)) 17,240 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s INIT time 0.00s ( 0.00s elapsed) MUT time 0.07s ( 0.07s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.07s ( 0.07s elapsed) %GC time 0.1% (0.1% elapsed) Alloc rate 1,767,189 bytes per MUT second Productivity 99.5% of total user, 103.1% of total elapsed
Since (^)
is {-# INLINABLE #-}
, GHC will most of the time create a type-specialised version of the exponentiation-by-repeated-squaring algorithm (a wrapper to test that the exponent isn't negative and to unpack the arguments if applicable, and two loops for the work). That function is then called with an exponent of 2. This creates a lot of code, much of which is never used (one loop runs twice, the other not at all). And it is an out-of-line function call, which can cost a lot of time in a loop.
Having GHC rewrite expr ^ 2
into expr * expr
, a) no code for (^)
has to be generated (or linked), b) for many types you get an inline multiplication instead of a function call.
Theoretically, a compiler could in such a situation, when one argument is statically known at compile time, try to evaluate the function a few steps to see what gives (the speculative loop unrolling mentioned by scpmw in the ticket). In this case, it'd find
2 < 0 ? No => calculate f x 2 even 2 ? Yes => calculate f (x*x) 1 even 1 ? No => 1 == 1 ? Yes => result is (x*x)
I would expect it to be tremendously hard to implement such a speculative evaluation in a way that would often yield useful results and not unduly increase compile times, though.
So while no true magic is available, let's add a few rewrite rules to catch the cases where using (^)
hurts most:
{-# RULES "^2/Integer" forall x. x ^ (2 :: Integer) = x*x "^3/Integer" forall x. x ^ (3 :: Integer) = (x*x)*x "^4/Integer" forall x. x ^ (4 :: Integer) = let y = x*x in y*y #-}
Fine. Now scpmw's 2.0 ^ 2
gets rewritten to 2.0 * 2.0
(and that is then evaluated to 4.0
), generally, occurrences of expr ^ 2
get rewritten as desired.
Unless the type of expr
is Integer
, because in GHC.Real
, (^)
is specialised for the type
Integer -> Integer -> Integer
. Then GHC has two rules to choose from, ^2/Integer
with matching exponent (and type), and the specialisation with matching argument and exponent type.
Specialisation wins in this case. Since Integer
multiplication isn't a primop but a function call, the effect isn't nearly as dramatic, but for the analogue of the above programme, I find a reduction of allocation and running time by a factor of roughly 2.3.
The more specific rewrite rule with type of base and exponent specified as Integer
fires, though, since it's more specific than the specialisation.
Now, if you're compiling with -Wall
, GHC will tell you
Warning: Defaulting the following constraint(s) to type `Integer' (Integral b0) arising from a use of `^' at squareTest.hs:11:32 (Num b0) arising from the literal `2' at squareTest.hs:11:33
and you might make it ^(2 :: Int)
to have warning-free code and besides, Int
is more efficient than Integer
, isn't it? At least I have done that.
But now the above rules don't match and you get the expensive (^)
again. So add rules for Int
exponents. Since there are also specialisations of (^)
for the types Integer -> Int -> Integer
and Int -> Int -> Int
, we also need extra rules for these types to win over specialisation.
Finally, what if the defaulting is resolved by making it ^(2 :: Word)
, after all, the exponent mustn't be negative, so why use a type with negative values?
Therefore the first patch included rules for Word
exponents. Since Word
isn't available in GHC.Real
, I added them to GHC.Word
, but when testing, the only way to get them to fire was giving the exponent in the form W# n##
, so I left them out of the second patch.
Afterthought: What about (^^)
?
comment:10 Changed 8 years ago by
Status: | patch → new |
---|
So if I understand correctly, the problem is that with q.hs
:
{-# RULES "^^^2/Integer" forall x. x ^^^ (2 :: Integer) = x * x #-} {-# SPECIALISE (^^^) :: Integer -> Integer -> Integer #-} {-# NOINLINE (^^^) #-} (^^^) :: (Num a, Integral b) => a -> b -> a x ^^^ y = 1 v :: Integer v = 8 ^^^ 2 main :: IO () main = print v
we get the specialisation matching and not the rule, and thus the result is 1 rather than 64:
$ ghc -O -ddump-rule-firings q.hs [1 of 1] Compiling Main ( q.hs, q.o ) Rule fired: Class op fromInteger Rule fired: SPEC Main.^^^ Rule fired: Class op show Linking q ... $ ./q 1
If we comment out the specialisation then we get the desired:
$ ghc -O -ddump-rule-firings q.hs [1 of 1] Compiling Main ( q.hs, q.o ) Rule fired: ^^^2/Integer Rule fired: Class op * Rule fired: timesInteger Rule fired: Class op show Linking q ... $ ./q 64
(as a sidenote, I think the rule should be
{-# RULES "^^^2/Integer" forall x. x ^^^ (2 :: Integer) = let v = x in v * v #-}
in case x is a large expression)
comment:11 Changed 8 years ago by
Yes. And with
{-# RULES "^^^2/Both" forall (x :: Integer). x ^^^ (2 :: Integer) = x #-}
alongside the two others, that fires.
Re sidenote: yes, it should be for safety. That occurred to me today too. I've done a few tests and so far GHC always was clever enough to share the expression, but it's better not to rely on it.
With regard to the rule/specialisation competition, what if the {-# SPECIALISE #-}
pragma is removed? Due to the {-# INLINABLE #-}
, with optimisations, specialisations will (probably) be created at the call sites, won't they? So then we wouldn't need the extra rules. However, I think that the specialisations would be created in each module and not once per programme/package, and that would mean code-bloat. So what would be worse, more rules or the code-bloat?
Changed 8 years ago by
Attachment: | 0001-Add-rules-for-powers-with-small-exponents-fixes-5237.patch added |
---|
comment:13 Changed 8 years ago by
Is adding the more specific rules the right thing to do?
Or would it make more sense for the optimiser to always choose rules over specialisations?
comment:14 Changed 8 years ago by
I think it would be better if the optimiser chose rules over specialisations. If that's easily enough done, great. Otherwise adding the more specific rules is the right thing.
comment:15 Changed 8 years ago by
Actually I realise that we already have a good mechanism for prioritisation: namely the simplifer's "phases".
- If you want a RULE to match on a function, you had better make sure the function doesn't inline too early. Typically use a NOINLINE pragma:
{-# RULE forall x. f [x] = blah #-} {-# NOINLINE [1] f #-} f x = ...blah2...
- The simplifier works in phases, currently phase 2, then 1, then 0. The RULE is always active. The NOINLINE stops f inlining until phase 1. So the rule gets a chance to fire in the earlier phase 2.
- You can attach phases to rules too, which say in which phases they are active. Ditto SPECIALISE pragmas
- GHC automatically generates specialisations for overloaded functions. The consequent rules are active precisely when the original function's INLINE pragma (if any) is active. So if you say
NONLINE [1] f
, then any automatically generated specialisations of f will be active only in phase 1 and later.
Does that give you enough to go on?
Simon
comment:16 Changed 8 years ago by
I'm afraid no. Take Ian's programme from above, we have a {-# RULES #-}
, {-# SPECIALISE #-}
and {-# NOINLINE #-}
pragma. Both, rule and specialisation, match. We want the rule to fire, but it's always SPEC
that fires. If adding phase control to specialise pragmas works without making it {-# SPECIALISE [NO]INLINE [k] #-}
, I haven't found out how. Every variation of that I tried, together with {-# NOINLINE [k] #-}
or without, still had the specialisation firing.
The only way to get the rule to fire was to remove the SPECIALISE
pragma. But if we remove that from GHC.Real
, we get code bloat because a specialisation will be generated in every module using (^)
.
Interestingly, adding a phase to the RULES
pragma prevented that from firing even without the SPECIALISE
.
comment:17 Changed 8 years ago by
I finally get it. What we want to say is
{-# RULES "^^^2/Integer" forall x. x ^^^ (2 :: Integer) = x * x #-} {-# SPECIALISE [1] (^^^) :: Integer -> Integer -> Integer #-}
Note the [1]
for SPECIALISE, saying "only run the specialise rule in phase 1 and later. So the RULE gets a chance to run in the earlier phase 2.
This is the right way to solve this, not by making rules override specialisation; that's far too fragile.
But on looking at GHC I see that SPECIALISE pragmas don't let you specify a phase in which they become active. That's just a stupid oversight; like any RULE or INLINE pragma, it should definitely have a phase control. That would let you solve the problem, right?
Simon
comment:19 Changed 8 years ago by
commit 814d864125bdd03d8bc6c3fc551f393b21942c6c
Author: Simon Peyton Jones <simonpj@microsoft.com> Date: Thu Nov 24 12:35:33 2011 +0000 Support "phase control" for SPECIALISE pragmas This featurelet allows Trac #5237 to be fixed. The idea is to allow SPECIALISE pragmas to specify the phases in which the RULE is active, just as you can do with RULES themselves. {-# SPECIALISE [1] foo :: Int -> Int #-} This feature is so obvious that not having it is really a bug. There are, needless to say, a few wrinkles. See Note [Activation pragmas for SPECIALISE] in DsBinds compiler/deSugar/DsBinds.lhs | 60 +++++++++++++++++++++++++++++++++++- compiler/parser/Parser.y.pp | 11 ++++--- compiler/parser/RdrHsSyn.lhs | 3 +- docs/users_guide/glasgow_exts.xml | 40 ++++++++++++++++++++++++ 4 files changed, 106 insertions(+), 8 deletions(-)
comment:20 Changed 8 years ago by
Owner: | set to daniel.is.fischer |
---|
Daniel: right, it's done. Can you now fix the original bug using the new phase-control on specialisation? Thanks.
comment:22 Changed 8 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Works great.
Fixed by
commit 088c9661321de044b4092756e2b563ffd23c6f96 Author: Daniel Fischer <daniel.is.fischer@googlemail.com> Date: Fri Nov 25 03:46:46 2011 +0100 Rules for powers with small exponents (fixes #5237) Calculating small powers by direct multiplication is more efficient than using (^). For small exponents known at compile time, there are now rewrite rules.
Regression test: perf/should_run/T5237.hs
Some rules for small exponents. Maybe rules for more exponents and exponent types would be good, but I didn't want to increase the number of rules too much.
Yes, if the type isn't otherwise restricted, that's what happens.