Opened 6 years ago

Last modified 3 years ago

#8814 new bug

7.8 optimizes attoparsec improperly

Reported by: joelteon Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.8.1-rc1
Keywords: Inlining Cc: bos@…, johan.tibell@…, idhameed@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8763 Differential Rev(s):
Wiki Page:

Description

{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T

testParser :: Text -> Either String Int
testParser f = fmap length
    . A.parseOnly (many (A.char 'b' <|> A.anyChar))
    $ f

main :: IO ()
main = print . testParser $ T.replicate 50000 "a"

On GHC 7.6.3 with -O2:

real    0m0.062s
user    0m0.022s
sys     0m0.007s

On GHC 7.8 tip with -O2:

real    0m12.700s
user    0m12.504s
sys     0m0.165s

On GHC 7.6.3 with -O0:

real    0m0.077s
user    0m0.025s
sys     0m0.007s

On GHC 7.8 tip with -O0:

real    0m0.055s
user    0m0.026s
sys     0m0.007s

This seems to be related to the use of <|>; if I change the program so that the second branch (A.anyChar) is never taken, 7.8 behavior is roughly the same as 7.6 under any optimization level.

Change History (20)

comment:1 Changed 6 years ago by tibbe

Cc: bos@… johan.tibell@… added

comment:2 Changed 6 years ago by tibbe

That's a huge regression. Are you sure all libraries in the GHC tree where compiled with optimization turned on?

comment:3 Changed 6 years ago by joelteon

Yeah it is pretty serious. I just cleared out ~/.ghc and reinstalled with "optimization: 2" in my cabal config. Same result.

comment:4 Changed 6 years ago by joelteon

Talked with some people in #ghc. -fno-full-laziness eliminates the issue.

comment:5 Changed 6 years ago by simonpj

Can you identify on which module -fno-full-laziness makes the difference? Is it just the module above, or do you need to set that flag for attoparsec itself?

comment:6 Changed 6 years ago by joelteon

Using it when compiling the module above is enough.

comment:7 in reply to:  6 Changed 6 years ago by tibbe

Replying to joelteon:

Using it when compiling the module above is enough.

Could you dump the core and see if you can spot what difference the flag makes?

comment:8 Changed 6 years ago by simonpj

I have not had any time to devote to this. I tried

ghc -O T8814.hs -ddump-simpl -o T8814

with and without -fno-full-laziness. Indeed I see the perf difference.

The Core from -ddump-simpl looks very different. Inside Main.$wa you'll see a call to runSTRep. The function to which runSTRep is applied looks very different.

  • Without full laziness, it consists of a call to newArray# followed by a couple of memcpy calls
  • With full laziness, it has a rather complicated local recursive function that allocates a LOT of memory.

I have no idea why. I think it must be to do with optimisations being done by RULES in the text library. If I add -ddump-rule-firings and grep for TEXT in the rule names, I get

-- With full laziness
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> unfused
Rule fired: TEXT tail -> unfused
Rule fired: TEXT tail -> unfused

-- Without full laziness
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> fused
Rule fired: TEXT append -> unfused
Rule fired: TEXT append -> unfused
Rule fired: TEXT append -> unfused
Rule fired: TEXT tail -> unfused
Rule fired: TEXT tail -> unfused
Rule fired: TEXT append -> unfused

So there is clearly a difference. Should that difference have such a massive performance impact? Ask the author of the text library! Why does full laziness have the effect? Well if you have (\x. map (f x) (map g ys)), say, full laziness may float out the map g ys and then the map/map fusion won't happen.

At this point I hope that someone else will take over debugging to find out more.

Simon

comment:9 Changed 6 years ago by darchon

comment:10 Changed 6 years ago by bos

I wrote a workaround for this problem that makes this particular performance regression vanish.

comment:11 Changed 6 years ago by simonpj

We're a bit stalled here. There is something mysterious going on, which a single INLINE pragma (Bryan's patch) fixes. But why? My guess is that it's something to do with the interaction between inlning and attoparsec's RULES. For example if a rule optimises (f (g x)), and g gets inlined, the rule won't fire. Or, if

h x = f x + 1

then the expression h (g x) won't fire the rule unless h is first inlined.

The "phase" annotations on INLINE pragmas and RULES let you control this stuff. So the bug might not be in GHC; it might just be a missing phase annotation. Or there might be a bug in GHC. We won't know until someone digs further.

Apart from lack of time, the difficulty is that I have no clue how attoparsec's RULES are supposed to work. So I think we are stalled unless/until someone feels able to do some digging to isolate what is going on.

Thanks!

Simon

comment:12 Changed 6 years ago by joelteon

I still think this bug is in GHC. It shouldn't take 15GB of RAM to compile a program with a large chain of <|>.

Edit: just realized I never added this to the bug. Compiling the thyme package makes GHC consume all available RAM (15.96GB for me) and stall. I have to kill -9 it. Offending section is here: https://github.com/liyang/thyme/blob/master/src/Data/Thyme/Format.hs#L591-L693. I cloned the package, replaced the function body with undefined, and it compiled normally.

Last edited 6 years ago by joelteon (previous) (diff)

comment:13 Changed 6 years ago by simonpj

But this ticket is about the run-time of the particular program given in the description of this ticket.

There may well be another bug, to do with the compile-time of an entirely different program. If so, could you open a ticket for that, including a way to reproduce? (Having checked that there isn't one already.)

It's confusing to mix up two bugs into one ticket!

Thanks

Simon

comment:14 Changed 6 years ago by joelteon

My apologies. I've created #8852 for the compile-time bug.

comment:15 in reply to:  11 Changed 6 years ago by bos

Replying to simonpj:

We're a bit stalled here. There is something mysterious going on, which a single INLINE pragma (Bryan's patch) fixes. But why? My guess is that it's something to do with the interaction between inlining and attoparsec's RULES.

I assume you're referring to text's RULES, as attoparsec doesn't contain any (it does contain a lot of inlining, though).

The goal behind the RULES in text is to expose opportunities to perform stream fusion and, if the rewrite phases do not reveal any, to drop from the fusion style of programming back to thwacking directly on an array (which is typically a lot faster).

For example, here are the two main RULES that you spotted in the output above:

{-# RULES
"TEXT append -> fused" [~1] forall t1 t2.
    append t1 t2 = unstream (S.append (stream t1) (stream t2))
"TEXT append -> unfused" [1] forall t1 t2.
    unstream (S.append (stream t1) (stream t2)) = append t1 t2
 #-}

If we see an unadorned use of append in an early phase, we rewrite it so that a fusion rule can have a chance to transform it. Once we get closer to the end of our phases, if we find that a fusion-style append hasn't yet been gobbled up, we transform it back to the direct style.

There's a tangential wrinkle here: for the longest time, the definition of append had an INLINE annotation, which I just removed. I don't believe this is implicated in the slowdown.

Please let me know what else it would be helpful to explain. As you know, forests of rewrite rules are rather fragile affairs, so it's entirely possible that there's been a longstanding bug in those rules (and perhaps not in the compiler) that is merely now being exposed in 7.8.

comment:16 Changed 6 years ago by simonpj

What would be really helpful would be if you, or someone else, could diagnose exactly why the slow-down is happening. If you compile with -ticky you can very quickly zero in on the code that is taking more time, and compare one with the other.

It would be remarkable if fusion was really responsible for such an enormous change in time, but perhpas it is. Maybe it would be worth commenting out RULES one at a time, to see which (if any) is responsible, and to reduce accidental differences between the two. Maybe the test case can be boiled down quite a lot further.

You can switch off ALL rule rewrites with -fno-enable-rewrite-rules.

Switching off full laziness might also be a good thing to try -fno-full-laziness.

Reducing the inlining threshold (which doesn't affect INLINE pragmas) would mean less looking at inlined code. -funfolding-use-threshold=N (default is 60)

It's a bit fiddly and time consuming, which is why I was appealing for help. I'll gladly explain anything you (or whoever) wants to know about the GHC end of things.

comment:17 Changed 6 years ago by simonpj

See also #8835, which we believe to be a dup of this one.

comment:18 Changed 6 years ago by ihameed

Cc: idhameed@… added

comment:19 Changed 5 years ago by thomie

Operating System: MacOS XUnknown/Multiple

comment:20 Changed 3 years ago by mpickering

Keywords: Inlining added
Note: See TracTickets for help on using tickets.