Opened 12 years ago

Closed 9 years ago

# floor, ceiling, round :: Double -> Int are awesomely slow

Reported by: Owned by: dons daniel.is.fischer low 7.2.1 libraries/base 7.1 performance, math, double alexey.skladnoy@…, michal.terepeta@… Unknown/Multiple Unknown/Multiple None/Unknown

### Description

We have super-naive implementations of the RealFrac class for Double.

Consider:

```{-# RULES "truncate/Double->Int" truncate = double2Int #-}
instance  RealFrac Double  where

properFraction x
= case (decodeFloat x)      of { (m,n) ->
let  b = floatRadix x     in
if n >= 0 then
(fromInteger m * fromInteger b ^ n, 0.0)
else
case (quotRem m (b^(negate n))) of { (w,r) ->
(fromInteger w, encodeFloat r n)
}
}

floor x     = case properFraction x of
(n,r) -> if r < 0.0 then n - 1 else n

```

So now, we *do* have a good rule for truncate, but floor, ceiling and round turn out to be awesomely slow.

```main = print . sumU
. mapU (floor :: Double -> Int)
\$ enumFromToFracU 0 100000000

```

Runs in 1 minute, 10 seconds:

```\$ time ./henning
5000000050000000
./henning  70.25s user 0.17s system 99% cpu 1:10.99 total
```

Now, if we just replace that with a ccall to math.h:floor, we get:

```main = print . sumU
. mapU (floor' :: Double -> Int)
\$ enumFromToFracU 0 100000000

floor' :: Double -> Int
floor' x = (truncate :: Double -> Int) (c_floor x)
{-# INLINE floor' #-}

foreign import ccall unsafe "math.h floor"
c_floor :: Double -> Double

```

Which runs in 1.8 seconds:

```\$ time ./henning
5000000050000000
./henning  1.88s user 0.00s system 99% cpu 1.884 total

```

Similar results for ceiling and round (see the main ticket for RealFrac, http://hackage.haskell.org/trac/ghc/ticket/1434)

## Action

Use math.h versions of round, floor and ceiling for Double and Float?

### comment:1 Changed 12 years ago by dons

An old complaint:

Was due to this issue.

The original poster was complaining about:

quant8
Double -> Word8 quant8 x = floor \$ x * 0xFF

being too slow.

Of course, it turns into a properFraction call,

```quant8_rha :: Double -> Word8

quant8_rha =
\ (x_ahf :: Double) ->
case properFraction5
@ Word8
\$f32
(case x_ahf of wild_aVL { D# x1_aVN ->
D# (*## x1_aVN 255.0)
})
of wild_aUN { (n_aUP, r_aUQ) ->
case r_aUQ of wild1_aVn { D# x1_aVp ->
case <## x1_aVp 0.0 of wild11_aUS {
False -> n_aUP;
True ->
case n_aUP of wild2_aVx { W8# x#_aVz ->
W8# (narrow8Word# (minusWord# x#_aVz __word 1))
}
}
```

Rather than an expected unboxed double floor(). The properFraction call there is going to kill things. We'd need a C floor call to get something like:

```\$wccall_r182 :: Double#
-> State# RealWorld
-> (# State# RealWorld, Double# #)

\$wccall_r182 =
{__ccall floor Double#
-> State# RealWorld
-> (# State# RealWorld, Double# #)}_dRy

quant8_rha :: Double -> Word8

quant8_rha =
\ (x_ahf :: Double) ->
case x_ahf of wild_aUI { D# x1_aUK ->
case \$wccall_r182 (*## x1_aUK 255.0) realWorld#
of wild1_XB { (# ds_dRx, ds1_dRw #) ->
W8#
(narrow8Word# (int2Word# (double2Int# ds1_dRw)))
}
```

With the result we can turn this program:

```import Data.Word
import Data.Array.Vector

main = print . sumU
. mapU (quant8 :: Double -> Word8)
\$ enumFromToFracU 0 100000000

```

From running in 1 minute, 14 seconds, to one that runs in 4.1 seconds. But note, with the pure Haskell quant, we get a fold of this form:

```\$wfold :: Word# -> Double# -> Word#
```

While with the FFI, we get:

```\$wfold :: Word# -> Double# -> Word8
```

Is this related to Neil's issue with FFI and unboxing results?

So, anyway, more support that the Double floor/ceiling functions need to be turned into ccalls if we're going to use them for anything with meaningful performance.

The full program

```{-# LANGUAGE ForeignFunctionInterface #-}

import Data.Word
import Data.Array.Vector

main = print . sumU
. mapU (quant8 :: Double -> Word8)
\$ enumFromToFracU 0 100000000

quant8 :: Double -> Word8
quant8 x = floor' \$ x * 0xFF

floor' :: Double -> Word8
floor' x = (fromIntegral . (truncate :: Double -> Int)) \$ c_floor x

foreign import ccall unsafe "math.h floor"
c_floor :: Double -> Double
```

### comment:2 Changed 12 years ago by simonpj

difficulty: → Unknown

Thanks for following this up, Don. But can you be more specific?

• Do you have a clear idea about what exactly changes are required? Or are there alternatives to consider?
• Can you make concrete proposals?
• Or, if it's clear what to do, can you offer a patch?

thanks

Simon

### comment:3 Changed 12 years ago by igloo

Milestone: → 6.10.1

### comment:4 Changed 12 years ago by igloo

Component: Compiler → libraries/base

### comment:5 Changed 11 years ago by simonmar

Architecture: Unknown → Unknown/Multiple

### comment:6 Changed 11 years ago by simonmar

Operating System: Unknown → Unknown/Multiple

### comment:7 Changed 11 years ago by igloo

Milestone: 6.10.1 → 6.10.2

### comment:8 Changed 11 years ago by igloo

Milestone: 6.10.2 → 6.12.1

### comment:10 Changed 10 years ago by igloo

Milestone: 6.12.1 → 6.12 branch dons@… deleted → None/Unknown

### comment:11 Changed 10 years ago by igloo

Milestone: 6.12 branch → 6.12.3

### comment:12 Changed 10 years ago by igloo

Milestone: 6.12.3 → 6.14.1 normal → low

### comment:13 Changed 9 years ago by daniel.is.fischer

I have put together a package to test possible implementations of the RealFrac methods for Double and Float (base-2 IEEE754).

On the one hand, pure Haskell implementations, on the other hand implementations calling out to rint[f], trunc[f], floor[f] and ceil[f] from math.h.

Both ways go via Integer by default, with a specialised faster implementation for Int (and narrower types, but those RULES haven't yet been written) enabled by a rewrite rule.

Overall, the pure Haskell implementations don't fare badly on my computer. All give a speedup compared to the current implementation, for most conversions, pure Haskell is on par with or faster than the C-call (although that would probably change if the C functions were made primops).

The FFI calls are significantly faster for properFraction :: Double -> (Integer, Double) and for round (except round :: Integral a => Float -> a when compiled via C, then native and FFI are on par).

Sample results for the speedups against the current implementation (note: for truncate :: x -> Int, the Prelude value is fst . properFraction, not the rewritten float2Int or double2Int) are included in the tarball.

I would appreciate feedback from your tests/benchmarks on other platforms, especially 64-bit platforms (mine is x86 linux, 32 bit).

To run the QuickCheck tests, you need QuickCheck-2.*, to run the benchmarks, criterion.

Thanks, Daniel

### Changed 9 years ago by daniel.is.fischer

Benchmarks and QC tests for faster methods

### comment:14 Changed 9 years ago by daniel.is.fischer

Owner: set to daniel.is.fischer

Okay, GHC's new CodeGen kicks some serious behind. For Float we only get a speedup for the Int-special versions (and for round, we should call out to C). For Double, a few of the general versions (via Integer) run slower than with 6.12, but everything is faster than the current implementation. I still need to do a bit of Core-digging to see if I can fix the regressions vs. 6.12.

Patch should be ready tomorrow or Wednesday.

### comment:15 Changed 9 years ago by daniel.is.fischer

Version: 6.8.2 → 7.1

I'm almost on the verge of giving up. I have the code, the code works and is faster than the old code, but I cannot get the rewrite rules to fire. When I compile some code, for example

```main :: IO ()
main = print (properFraction (encodeFloat (2^23) 16 :: Float) :: (Int, Float))
```

-ddump-rule-firings says

```Rule fired: SPEC GHC.Real.^
Rule fired: Class op encodeFloat
Rule fired: Class op properFraction
Rule fired: SPEC \$cproperFraction [GHC.Types.Int]
Rule fired: Class op show
Rule fired: <#
Rule fired: Class op showsPrec
Rule fired: Class op showsPrec
```

```Rule fired: properFraction/Float->Int
```

Note that there's no SPECIALISE nor INLINE pragma anywhere concerning the RealFrac instance.

What's going on? I'm currently building with explicit NOINLINE pragmas, if that doesn't help, I'm lost.

### Changed 9 years ago by daniel.is.fischer

patch for faster RealFrac methods

### comment:16 Changed 9 years ago by daniel.is.fischer

Status: new → patch

Works now :)

Please test on 64 bits to make sure I haven't bungled the CPP stuff.

I also have a lot of rewrite rules for !IntN and !WordN, but I'm not sure where to put them. Probably GHC.Int and GHC.Word, but we'd have to import GHC.Float and perhaps GHC.Float.RealFracMethods then.

### comment:17 follow-up:  27 Changed 9 years ago by daniel.is.fischer

Oh, btw. due to the rewrite rules, in arith005, we get different nonsense for the overflowing values with optimisations than a) before, b) without optimisations.

### comment:18 Changed 9 years ago by daniel.is.fischer

Status: patch → infoneeded

Great, I just was notified that there's a bug in rintf for some versions of glibc (apparently only on x86_64): http://sources.redhat.com/bugzilla/show_bug.cgi?id=10089

And thus round :: Float -> Int gets some values wrong there.

```Cround/Int
*** Failed! Falsifiable (after 2333 tests):
2788790.8
Failure {usedSeed = 678847334 2020505879, usedSize = 2332, reason =
"Falsifiable", labels = []}

Now the funny part is that this seems to be rintf() bug in glibc
(only x86_64), see
http://sources.redhat.com/bugzilla/show_bug.cgi?id=10089
(I've write a small C program to check the output for this number and it
does return the wrong result, i.e. 2788790).

So I thought that I'll let you know about it before anyone else runs
into the same issue. All the other tests passed just fine, btw.
```

I'll see whether I can cook up a fast bit-fiddler in C then.

### comment:19 Changed 9 years ago by daniel.is.fischer

Going via Double and using rint doesn't work either: http://bugs.centos.org/view.php?id=3326

Who would've thought they'd ship glibc's with such bugs :(

### comment:20 Changed 9 years ago by michalt

The patch has a simple mistake in type signatures that makes GHC fail to build on x86_64. It's enough to change

```uncheckedIShiftRA64# :: Int# -> Int#

uncheckedIShiftL64# :: Int# -> Int#
```

to

```uncheckedIShiftRA64# :: Int# -> Int# -> Int#

uncheckedIShiftL64# :: Int# -> Int# -> Int#
```

Oops, thanks.

### Changed 9 years ago by daniel.is.fischer

new patch with our own rint versions

### comment:22 Changed 9 years ago by daniel.is.fischer

Status: infoneeded → patch

The new patch contains implementations of `rint` and `rintf`, so we don't depend on the installed glibc version. The double version is somewhat slower than glibc's `rint`, but still a lot faster than what we have now. The float version is even a bit faster than glibc's `rintf` here.

Since the bugs for `double rint(double)` seem to have been fixed pretty quickly, in contrast to the `rintf` bug, we could call out to math.h's `rint` instead of using our own if we're willing to risk a very few users with buggy versions and also the rare cases where the platforms rounding mode has been set to something other than round to nearest, ties to even.

I've tested and benchmarked the patch on my box (32-bit, i686), test also on 64-bits and other platforms before merging.

Re the rewrite rules for IntN and WordN, any recommendations where to stick them?

### comment:23 Changed 9 years ago by simonpj

Great!

Rewrite rules are best placed either (a) with the types they concern or (b) more typically, with the functions that they are a rewrite function for.

Simon

### comment:24 Changed 9 years ago by daniel.is.fischer

(b) is not a very good option here. I tried it, but if you include GHC.Int and GHC.Word in GHC.Float, you get an import cycle

```GHC.Int/Word < GHC.Read < GHC.Float < GHC.Int/Word
```

Could be broken with .hs-boot, of course, but I'm not too keen on that.

So GHC.Int/Word, where the types are defined, seems the best place, although adding GHC.Float to their imports seems a bit strange.

### comment:25 Changed 9 years ago by simonpj

Can you describe an example proposed rule, and the import cycle it creates?

### comment:26 Changed 9 years ago by daniel.is.fischer

Either

```"truncate/Float->Int8"
forall x. truncate x = (fromIntegral :: Int -> Int8) (truncateFloatInt x)
```

or

```"truncate/Float->Int8"
forall x. truncate x = (fromIntegral :: Int -> Int8) ((truncate :: Float -> Int) x)
```

Well, I guess we could leave the type signature off truncate in the latter and rely on type inference.

The import cycle is not created by the rules per se, but to define the rules, we need the involved types and functions in scope. IntN and WordN aren't in scope in GHC.Float, bringing them into scope there creates the import cycle. So the other obvious solution is to bring Double, Float and the functions into scope in GHC.Int and GHC.Word.

### comment:27 in reply to:  17 Changed 9 years ago by michalt

With the new patch GHC builds without any problems. I've run the testsuite for HEAD with and without the patch and the only difference is failure in arith005 (more about it below). I haven't played with the QC tests, as quickcheck apparently doesn't build with HEAD atm (maybe I'll do it when I have more time).

Oh, btw. due to the rewrite rules, in arith005, we get different nonsense for the overflowing values with optimisations than a) before, b) without optimisations.

I think the difference that the tests were catching was the fact that `-0.0` is handled differently by `properFraction`:

```module Main where

main = do
print \$ (properFraction (-0.0) :: (Int,Float))
print \$ (properFraction (-0.0) :: (Integer,Float))
print \$ (properFraction (-0.0) :: (Int,Double))
print \$ (properFraction (-0.0) :: (Integer,Double))
```

with the patch:

```> ~/develop/ghc-mod/inplace/bin/ghc-stage2 --make -fforce-recomp Test
[1 of 1] Compiling Main             ( Test.hs, Test.o )
> ./Test
(0,0.0)
(0,0.0)
(0,0.0)
(0,0.0)
> ~/develop/ghc-mod/inplace/bin/ghc-stage2 -O2 --make -fforce-recomp Test
[1 of 1] Compiling Main             ( Test.hs, Test.o )
> ./Test
(0,-0.0)
(0,0.0)
(0,-0.0)
(0,0.0)

```

Without the patch the result is always `(0,0.0)`.

### comment:28 Changed 9 years ago by daniel.is.fischer

QuickCheck doesn't know about the new SplittableGen class in System.Random yet, so split is not in scope in a couple of modules.

Good catch, the `-0.0`. I mimicked `rint[f]`'s behaviour, didn't think about that. Easy change. We don't have good support for `-0.0` in any conversion functions anyway.

I was referring to what happens with out-of-range values in arith005, though. `double2Int` and `float2Int` return minBound for those. You can't see that in the tests because there are no values outside 64-bit int range there, only outside 32-bit range :)

### comment:29 Changed 9 years ago by daniel.is.fischer

Ah, no. The `-0.0` doesn't come from the rounding, it's because of

```    case float2Int# x of
n -> (I# n, F# (x `minusFloat#` int2Float# n))
```

(analogous for Double) and (-0.0) - 0.0 gives (-0.0). The extra check for zero will cost a bit of performance :(

### Changed 9 years ago by daniel.is.fischer

patch avoiding negative zero

### comment:30 Changed 9 years ago by daniel.is.fischer

More than a bit, unfortunately. The test and branch makes the properFractionXXXInt methods slower by a factor of about 1.7. Is it actually a problem having a negative zero as part of the result?

### comment:31 Changed 9 years ago by simonpj

So the other obvious solution is to bring Double, Float and the functions into scope in GHC.Int and GHC.Word.

That looks best, yes. It's best if a rule appears in the same module as at least one thing on the LHS of the rule. And if you look at these rules, they look like

```  truncate Float Int8 x = (fromIntegral :: Int -> Int8) (truncate (x::Float))
```

and indeed `Int8` appears on the left.

Simon

### comment:32 Changed 9 years ago by michalt

I've built GHC with the new patch on x86_64 and it doesn't introduce any failures in the testsuite. :)

As for the negative zero I have absolutely no idea whether it's really important. Though functions having different results with and without optimisations make me feel uneasy.

### comment:33 Changed 9 years ago by daniel.is.fischer

There are a few functions which give different results with and without optimisations. It's not nice, but what can one do? You can't change the primops to have the non-optimised behaviour (without sacrificing performance) and changing the non-optimised functions to match the primops is perhaps not always possible (`realToFrac :: Float -> Double` would be very hard to make `fromRational . toRational` match `float2Double`) and practically always make the non-optimised functions unusably slow.

Regarding -0.0, since 0.0 == (-0.0), I don't see a strong reason to keep the tests just to match prior behaviour in this case.

Patch for the rewrite rules is at #1434 btw.

### comment:34 Changed 9 years ago by igloo

Milestone: 7.0.1 → 7.0.2

### comment:35 Changed 9 years ago by igloo

Milestone: 7.0.2 → 7.2.1

### comment:36 Changed 9 years ago by igloo

Resolution: → fixed patch → closed

Sorry for the delay. Now applied, thanks.

Note: See TracTickets for help on using tickets.