Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#10067 closed feature request (fixed)

The Read Integer instance is too slow

Reported by: redneb Owned by:
Priority: normal Milestone: 8.0.1
Component: Core Libraries Version: 7.11
Keywords: Cc: hvr, ekmett, core-libraries-committee@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D645
Wiki Page:

Description

The current implementation of the Read Integer instance has quadratic complexity and thus performs badly on large inputs. On my system, it takes 65 seconds to perform the following:

read (take 1000000 $ cycle "1234567890") :: Integer

Note that we already provide an ad-hoc instance for Show Integer, so maybe we can do the same for Read Integer.

Change History (17)

comment:1 Changed 5 years ago by redneb

Differential Rev(s): Phab:D645

comment:2 Changed 5 years ago by redneb

Status: newpatch

comment:3 Changed 5 years ago by simonpj

Cc: core-libraries-committee@… added
Component: libraries/baseCore Libraries

comment:4 Changed 5 years ago by ekmett

I have no principled objection to dramatically reducing the asymptotic cost of parsing integers assuming of course in good faith that the patch works.

That said, it seems like it'd be a good idea to apply the same reasoning to the new Natural type that Herbert added in 7.10 as well.

comment:5 Changed 5 years ago by redneb

I only ommited Natural because the read instance for it relies on the read instance of Integer and therefore it will befenefit from my patch anyway. But I agree that it would be a good idea to start treating Natural as a first class citizen.

So what's the policy here? Do I need to CPP/#ifdef this, or is that unnecessary since they are from the same package anyway?

comment:6 Changed 5 years ago by redneb

One problem with Natural is that it does not follow the conventions of other numeric types. For example, the Show instances of all other types are given in GHC.Show and the Read instances in GHC.Read, while for Natural all instances are declared in GHC.Natural.

Herbert, do you have any objection moving the Read Natural instance to GHC.Read? This will make it easier to treat Natural like any other type in the future.

Otherwise, there will be an module import cycle and will require an hs-boot file.

comment:7 Changed 5 years ago by ekmett

I'm pretty sure we'd be willing to bend over backwards to avoid adding another import cycle, given all that we've done over the course of 7.10 to break the previous ones up. ;)

comment:8 Changed 5 years ago by thoughtpolice

Milestone: 7.10.1
Priority: lowhigh

Uh, my first impression honestly is this has the potential to be really bad - I'd guess most people constrain Read to Int in cases like this, but I could fathom a case where e.g. a webserver took an integer somewhere in an HTTP request and used Read, which could lead to an easy denial of service.

I'm fine with shuffling to avoid boot files/cycles and whatnot. Do update the patch please.

comment:9 in reply to:  6 Changed 5 years ago by hvr

Replying to redneb:

Herbert, do you have any objection moving the Read Natural instance to GHC.Read? This will make it easier to treat Natural like any other type in the future.

I have no hard objection, only a soft one:

Right now, GHC.Natural is a neat independent leaf-module in the import-graph. By defining the Natural instance in GHC.Read, it would lose that property, and I'm not sure what else you'd need to transform in the import graph to make that happen.

Is there a real benefit from moving the instance from the data-type module to the class module? Or is this just about aesthetics?

Otherwise, there will be an module import cycle and will require an hs-boot file.

In any case, if possible make the Natural change a separate commit if possible so we can see more easily what import-related shuffling was needed to accomplish that.

comment:10 Changed 5 years ago by redneb

So let me explain what the problem is. My patch modifies the module Text.Read.Lex. This module defines among other thinks a ReadP parser that is used to implement the Read instances of all integral types. The actual instances though, are given in GHC.Read (except for Natural of course). My patch works by providing an optimized version of an internal function of Text.Read.Lex. If we also want to provide a version of that function for Natural, then we need to import GHC.Natural. But this creates multiple import cycles, exactly because that module was designed to be a leaf module. Namely, all of the following imports of GHC.Natural are problematic: GHC.Read, Data.Data, GHC.Exception, Data.Int, and Data.Word. So now we have 3 options:

  1. Use my patch as it is, without the Natural specialization. Remeber that the Read Natural instance relies on the Read Integer instance and it will benefit too from my patch.
  2. Use a simple hs-boot file for GHC.Natural.
  3. Break the cycle by moving stuff out of GHC.Natural. This requires extensive changes (just look at the list of offending imports above) and honestly, my patch is not the only thing in base that does not treat Natural as an equal to Integer right now; for example the Show Integer has an ad-hoc implementation that has not been extended for Natural (instead Show Natural relies on Show Integer again).

Given all of the above, I think 2 is the best option here. I really don't like hs-boot files, but I do think that the way Naturals have been implemented and possible large scale changes to that, is largely unrelated to my patch and should be discussed separately.

comment:11 Changed 5 years ago by Austin Seipp <austin@…>

In a5a4c25626e11e8b4be6687a9af8cfc85a77e9ba/ghc:

Provide a faster implementation for the Read Integer instance

Summary:
The current implementation of the Read Integer instance has quadratic
complexity and thus performs badly on large inputs. This patch provides a
rather simple sub-quadratic algorithm. For small inputs, we use the old
algorithm (there is a small penalty for that). The gains for large
inputs can be dramatic: on my system, the time to perform
    read (take 1000000 $ cycle "1234567890") :: Integer
drops from 65 seconds to less than a second.

Note that we already provide an ad-hoc instance for Show Integer, so this
patch essentially does the same thing for Read Integer.

Test Plan: Check that read :: String -> Integer returns correct results for inputs of various sizes.

Reviewers: austin, hvr

Reviewed By: austin, hvr

Subscribers: ekmett, thomie

Differential Revision: https://phabricator.haskell.org/D645

GHC Trac Issues: #10067

comment:12 Changed 5 years ago by thoughtpolice

Status: patchnew

Merged to ghc-7.10 (via 0e0a0b4c9b2bd79f675014769e1a4777fc0e96f4). (Moving back to new state to continue discussion.)

comment:13 Changed 5 years ago by ekmett

For now I think option 1, which basically just merged is probably the path forward, conversion between the new Natural and Integer is quite fast.

Introducing more hs-boot files just makes more work for us going forward. We're trying, steadily, to eliminate them from base to enable a more granular system in the end.

Last edited 5 years ago by ekmett (previous) (diff)

comment:14 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to the 7.12.1 milestone, as these tickets won't be fixed in time for the 7.10.1 release (unless you, the reader, help write a patch :)

comment:15 Changed 4 years ago by bgamari

Priority: highnormal
Resolution: fixed
Status: newclosed

As far as I can tell the critical issue of quadratic parsing of integers is by Phab:D645, which has been merged. The only work that remains here is to specialize for Natural but as Edward points out this is now a reasonable cheap operation as we can show,

import GHC.Natural
import Criterion.Main

main = defaultMain
    [ bench "integer" $ nf (read :: String -> Integer) (take 1000000 $ cycle "1234567890")
    , bench "natural" $ nf (read :: String -> Natural) (take 1000000 $ cycle "1234567890")
    ]
$ ghc Test.hs -O -fforce-recomp
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...
$ ./Test
benchmarking integer
time                 435.5 ms   (410.6 ms .. 484.4 ms)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 441.3 ms   (434.3 ms .. 445.5 ms)
std dev              6.383 ms   (0.0 s .. 7.230 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking natural
time                 428.2 ms   (415.5 ms .. 446.6 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 434.6 ms   (432.0 ms .. 436.0 ms)
std dev              2.261 ms   (0.0 s .. 2.387 ms)
variance introduced by outliers: 19% (moderately inflated)

For this reason I'm going to close this issue..

Last edited 4 years ago by bgamari (previous) (diff)

comment:16 Changed 4 years ago by redneb

Yes, as I mentioned above, the Read instance of Natural relies on the Read instance of Integer so we get a speed-up for Natural as well. Unfortunately, this is not the case for the lexers from Text.Read.Lex, namely readIntP, readOctP, readDecP, readHexP: they only use the new algorithm for Integer but not for Natural.

comment:17 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

Note: See TracTickets for help on using tickets.