Opened 10 years ago

Closed 9 years ago

#3736 closed bug (fixed)

GHC specialising instead of inlining

Reported by: guest Owned by: igloo
Priority: normal Milestone: 7.0.1
Component: Compiler Version: 6.10.4
Keywords: Cc: ghc@…
Operating System: Linux Architecture: x86
Type of failure: Runtime performance bug Test Case: T3736
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

In the attached module I demonstrate the following:

mainMonolithic1Generator performs the same computation as mainMonolithic1Compose but the former is more than two times slower than latter. This is serious since in more complex signal processing programs this factor seems to multiply. I assume that the problem is that mixGen is not inlined. Instead GHC seems to have decided to specialise mixGen. In contrast to mainMonolithic1Compose, mainMonolithic1Generator uses a data type with existential quantification. But this alone is not the problem, since mainMonolithic0 and mainMonolithic0Generator run with the same speed.

http://code.haskell.org/storablevector/speedtest/SpeedTestChorus.hs

How can I tell GHC to prefer inlining to specialisation? How about a pragma like {-# INLINE FORCE #-} or so?

Attachments (2)

SpeedTestChorus.hs (10.3 KB) - added by guest 10 years ago.
Main.hs (5.9 KB) - added by igloo 10 years ago.

Download all attachments as: .zip

Change History (16)

Changed 10 years ago by guest

Attachment: SpeedTestChorus.hs added

comment:1 Changed 10 years ago by simonpj

I think agree that this should just work.

I can't reproduce this because I can't find Data.StorableVector.Lazy.Builder.

Can you make the test case more self-contained? It's not the end of the world if it depends on a Hackage package, but if so say which version. And I think in this case it depends on something you have not supplied.

Thanks

Simon

comment:2 Changed 10 years ago by guest

Thank you for looking into this issue!

The example works with Darcs-HEAD of

http://code.haskell.org/storablevector/

This version is named 0.2.5. I may also tag and upload this version to Hackage in order to make the problem reproducable in future.

The StorableVector construction using mainChunky0Builder is also three times slower than its counterpart mainChunky0 and I have no idea why. Maybe I should file a separate ticket for that, but I'm uncertain whether this is my fault or GHC's.

Btw. don't miss to listen to the 40 MB of cool sound that those functions generate using

play -r 44100 speed.f32

(play is part of the SoX project)

:-)

comment:3 Changed 10 years ago by guest

I assume that JHC's SUPERINLINE pragma means the same as my suggested INLINE FORCE: http://repetae.net/computer/jhc/manual.html#pragmas

comment:5 Changed 10 years ago by simonpj

Can you make a standalone case? At the moment I just get stuck.

  • The storablevector package you point to depends on utility-ht and transformers.
  • In turn transformers depends on special-functors
  • special-functors requires base < 2, and I don't have that, since base is now up to version 4.

I'm sure that most of this clutter is irrelevant to the problem you have. The easy way to remove unnecessary dependencies is to make stub modules with definitions like

 thingy :: Int -> Int
 thingy = error "urk"

and than you don't need a separate package to define thingy.

Simon

comment:6 Changed 10 years ago by guest

Sorry for the inconvenience! I'll prepare a package with minimal dependencies.

Only for clarification: special-functors is a compatibility package that provides Control.Applicative and the like for base<2. 'transformers' should only depend on it for base<2, but it may be that Cabal tries 'special-functors' because 'transformers' is not prepared for the current version of 'base'. I assume that extending the version range for base in transformer.cabal would solve this problem.

comment:7 Changed 10 years ago by guest

I uploaded a package that should be self-contained: http://code.haskell.org/storablevector/ghc-profile/dist/storablevector-profile-0.2.tar.gz

The run-time didn't change in any of the examples from GHC-6.10.4 to GHC-6.12.1. That made me uncertain, but I double checked that I actually used GHC-6.12.1 in the newer tests.

(Trac did not let me add this archive as attachment, thus I have uploaded it to code.haskell.org)

Changed 10 years ago by igloo

Attachment: Main.hs added

comment:8 Changed 10 years ago by igloo

If I understand correctly Main.hs is a self-contained file containing the code you're talking about, and can be built with just bootlibs:

$ time ./Main 1
./Main 1  2.88s user 0.15s system 99% cpu 3.030 total
$ time ./Main 2
./Main 2  1.07s user 0.16s system 99% cpu 1.233 total

comment:9 Changed 10 years ago by guest

Yes, this seems to do the same. In general it is problematic to have more than one test case in an executable, since GHC decides between specialisation and inlining of a function depending on the number of calls. But in this case it looks like GHC makes the same decision.

Btw. how did you manage to create this self-contained module? Manually? I remember there was a program that performs assembling one module from several ones in order to get whole program analysis.

comment:10 in reply to:  9 Changed 10 years ago by igloo

Replying to guest:

Btw. how did you manage to create this self-contained module? Manually?

Yes, helped by -Wall -Werror to easily find unnecessary code.

comment:11 Changed 10 years ago by igloo

Milestone: 6.14.1

comment:12 Changed 10 years ago by simonpj

Great stuff. Thank you for cutting down the test case Ian. That made it easy to find the bug. And it was easy to fix too:

Fri Mar  5 17:27:59 GMT 2010  simonpj@microsoft.com
  * Fix Trac #3736: do not preInlineUnconditionally with INLINE
  
  preInlineUnconditionally was, in effect, nuking an INLINE pragma, with
  very bad effect on runtime in this program.  Fortunately the fix is
  very simple.
  
  See Note [InlineRule and preInlineUnconditionally] in SimplUtils.

    M ./compiler/simplCore/SimplUtils.lhs +24
  • Can you see if it is indeed fixed, and if so close this ticket?
  • I believe this may in fact fix #3737 as well. Can one of you try?
  • I'm not sure it's worth making a test case, but Ian's Main.hs is small enough to try. Ian, I'm not sure how to make a regression test that says "does ./main 1 run faster than ./main 2" or whatever. If you think it's worth it, pls add one.

Do not merge to 6.12; it's all 6.14 code.

Simon

comment:13 Changed 10 years ago by igloo

The two versions seem to perform the same now, but the stable branch is faster:

$ ghc-head --make Main -O
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
$ time ./Main 1
./Main 1  1.19s user 0.12s system 99% cpu 1.314 total
$ time ./Main 2
./Main 2  1.18s user 0.15s system 100% cpu 1.323 total

$ ghc-stable --make Main -O
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
$ time ./Main 1
./Main 1  3.02s user 0.19s system 99% cpu 3.218 total
$ time ./Main 2
./Main 2  0.33s user 0.16s system 99% cpu 0.493 total

When looking at the core generated, I noticed that HEAD was making a binding:

a_r1Ld :: GHC.Types.Float
[GblId, Caf=NoCafRefs, Str=DmdType m]
a_r1Ld = GHC.Types.F# __float 0.6

whereas stable made no such binding. Here's a much smaller example:

module Main (main) where

import System.IO.Unsafe (unsafePerformIO)

main :: IO ()
main = (fst $ unfoldrN (fst initPhase2)) `seq` return ()

{-# INLINE initPhase2 #-}
initPhase2 :: (Float, Float)
initPhase2 = (0.2, 0.6)

unfoldrN :: a -> ((), a)
unfoldrN x0 = unsafePerformIO $ createAndTrim x0

createAndTrim :: b -> IO ((), b)
createAndTrim f = return ((), f)
$ ghc-head -Wall --make -O q.hs -fforce-recomp -ddump-simpl | grep '0\.6'
[1 of 1] Compiling Main             ( q.hs, q.o )
a_ri8 = GHC.Types.F# __float 0.6
         Tmpl= (GHC.Types.F# __float 0.2, GHC.Types.F# __float 0.6)}]
Linking q ...
$ ghc-stable -Wall --make -O q.hs -fforce-recomp -ddump-simpl | grep '0\.6'
[1 of 1] Compiling Main             ( q.hs, q.o )
Linking q ...

I don't know if this is actually related to the slowdown, but it seems suspicious anyway.

comment:14 Changed 9 years ago by simonpj

Owner: set to igloo

I looked again at this. The HEAD is slower than 6.12 because we have

{-# INLINE arguments2 #-}
arguments2 :: (a -> b -> x) -> a -> b -> x
arguments2 f a b = (f $! a) $! b

...let go = arguments2 $ \p i -> \x -> blah

In 6.12, arguments2 would inline regardless of the number of arguments, whereas in HEAD things marked INLINE only inline when they have as many args as are on the LHS in the source program. So we need

{-# INLINE arguments2 #-}
arguments2 :: (a -> b -> x) -> a -> b -> x
arguments2 f = \ a b -> (f $! a) $! b

I think this is overall an improvement in control, but you do have to be careful. With that change the HEAD is just as fast either way round. Here are user times in seconds:

               6.12          7.0
------------------------------------
./T3736 1      1.84          0.32
./T3736 2      0.32          0.32

I don't know whether it's worth making a performance-regression test to make sure the original problem doesn't re-surface. Ian, you decide!

Simon

comment:15 Changed 9 years ago by igloo

Resolution: fixed
Status: newclosed
Test Case: T3736

Test added.

Note: See TracTickets for help on using tickets.