Opened 4 years ago

Last modified 4 years ago

#11029 new bug

Performance loss due to eta expansion

Reported by: NeilMitchell Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.2
Keywords: Cc: ndmitchell@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by NeilMitchell)

Given the attached file, at both -O2 and -O0, GHC translates the function:

test1 [1,2,3,4,5,6,7,8,9,10] = \x -> x
test1 _ = \x -> negate x

To be equivalent to:

test0 [1,2,3,4,5,6,7,8,9,10] x = x
test0 _ x = negate x

When applied in a loop with something like:

map (test1 [1..]) [1..1000]

The eta-expanded variant is 3x slower. Adding a trace breaks that transformation, and then the code goes 3x faster. Specifically:

test2 [1,2,3,4,5,6,7,8,9,10] = \x -> x
test2 _ = trace "here" $ \x -> negate x

Timings, as reported by Criterion under O2 with GHC 7.10.2, are:

benchmarking test0 = 40.99 ns   (40.96 ns .. 41.02 ns)
benchmarking test1 = 41.09 ns   (41.06 ns .. 41.14 ns)
benchmarking test2 = 17.74 ns   (17.68 ns .. 17.81 ns)

Attachments (1)

Main.hs (561 bytes) - added by NeilMitchell 4 years ago.

Download all attachments as: .zip

Change History (8)

Changed 4 years ago by NeilMitchell

Attachment: Main.hs added

comment:1 Changed 4 years ago by NeilMitchell

Description: modified (diff)

comment:2 Changed 4 years ago by nomeata

Given the attached file, at both -O2 and -O0

does this imply that it is not happening with -O (which I thought is the most common optimization level)?

comment:3 Changed 4 years ago by simonpj

I have a large collection of eta-related tickets. See "Arity" in Status/SLPJ-Tickets.

This one is awkward. If we have

f True = \x -> ..
f False = \y -> ..

and f is usually called applied to two arguments, it's really much better to eta-expand. But if we have map (f v) xs it probably isn't better to eta-expand. Generally, GHC will eta expand if the only duplication is pattern matching; but as you point out that is not always right. Currently there is no notion of how much pattern matching is duplicated, and that might not be hard to add.

Another avenue would be to give programmers a way to control the behaviour, along the lines of the existing oneShot docs.

I wish I knew a really good way to deal with this.

comment:4 Changed 4 years ago by nomeata

JFTR, the relevant code is this bit in CoreArity.lhs:

arityType env (Case scrut _ _ alts)
  | exprIsBottom scrut || null alts
  = ABot 0     -- Do not eta expand
               -- See Note [Dealing with bottom (1)]
  | otherwise
  = case alts_type of
     ABot n  | n>0       -> ATop []    -- Don't eta expand
             | otherwise -> ABot 0     -- if RHS is bottomming
                                       -- See Note [Dealing with bottom (2)]

     ATop as | not (ae_ped_bot env)    -- See Note [Dealing with bottom (3)]
             , ae_cheap_fn env scrut Nothing -> ATop as
             | exprOkForSpeculation scrut    -> ATop as
             | otherwise                     -> ATop (takeWhile isOneShotInfo as)
    alts_type = foldr1 andArityType [arityType env rhs | (_,_,rhs) <- alts]

and you can prevent this behaviour with -fpedantic-bottoms.

And it looks that I made this eta-expansion more aggressive in changeset:2931d19e90d2366f2ce308d65a36333336ca6059 when trying to fix #2915.

comment:5 Changed 4 years ago by NeilMitchell

@nomeata: I confirm -O1 has exactly the same behaviour.

As for fixes, my first thought is that eta-expansion is valuable even if you are duplicating more than patterns, and sometimes duplicating patterns is too much. My hypothetical solution would be:

  • Make each function have multiple versions at different arities, so you'd generate core for test1/1 and test1/2. (The functions might refer to each other, or not be generated if it didn't seem valuable, to keep the size down.)
  • Have the call site select between test1/1 and test1/2 based on the number of arguments available.
  • After that, you might want to change the desugarer, so that the two functions above produce the same Core. Pattern matching on the first argument could happen after the first argument is supplied, before the second argument. It would be nice if the 2 argument version still went 3x faster when used partially applied.

Trying to figure out the arity of a function at it's definition site seems difficult, and can never be optimal for all uses.

For context, I got lead down this path starting at view patterns, which have similar considerations but with arbitrarily expensive "patterns", which GHC never duplicates: In Shake someone submitted a patch to refactor and move a lambda to the left of an =, which somewhat surprisingly can be much slower.

Adding a oneShot equivalent would be very useful, to make such optimisations explicit and robust.

comment:6 Changed 4 years ago by NeilMitchell

Cc: ndmitchell@… added

comment:7 Changed 4 years ago by nomeata

Code duplication is an easy way out of many optimization problems; unfortunately, you easily get into exponential blow ups.

Note: See TracTickets for help on using tickets.