Opened 3 years ago

Closed 3 years ago

#13025 closed bug (fixed)

Type family reduction irregularity (change from 7.10.3 to 8.0.1)

Reported by: acowley Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 8.0.1
Keywords: TypeFamilies Cc: kosmikus, int-index, RyanGlScott, goldfire
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: simplCore/should_compile/T13025
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I was recently made aware that vinyl's performance dramatically deteriorated with GHC 8.0.1 when compared to GHC 7.x. Vinyl is an extensible records library that's been around for about four years; the aspect of its design relevant here is the basic HList-style indexed GADT. Care was taken in the past so that working with, say, a Storable Vector of records would suffer no overhead from vinyl when compared with standard records, and we have a benchmark suite to spot check this.

In the move to GHC-8.0.1, it turns out that we do introduce overhead. Inspecting the Core reveals that the benchmark inner loop includes a case match on an HEq_sc value whose presence I would not expect, and that is not present when compiling with GHC-7.10.3:

case HEq_sc
         @ Nat
         @ Nat
         @ ('S 'Z)
         @ (RIndex
              '("normal", V3 Float)
              '['("tex", V2 Float), '("normal", V3 Float)])
         ($s$fRElemar:S_$s$fRElemar:S_$cp1RElem
          `cast` (N:~[0] <Nat>_N <'S 'Z>_N <RIndex
                                              '("normal",
                                                V3 Float)
                                              '['("tex",
                                                  V2 Float),
                                                '("normal",
                                                  V3 Float)]>_N
                  :: (('S 'Z :: Nat)
                      ~
                      (RIndex
                         '("normal", V3 Float)
                         '['("tex", V2 Float),
                           '("normal",
                             V3 Float)] :: Nat) :: Constraint)
                     ~R#
                     (('S 'Z :: Nat)
                      ~~
                      (RIndex
                         '("normal", V3 Float)
                         '['("tex", V2 Float),
                           '("normal",
                             V3 Float)] :: Nat) :: Constraint)))
  of cobox0 { __DEFAULT ->

I have since made an effort to reproduce the issue, and discovered more fragility than I expected. I am attaching two modules: Rec.hs defines a kind of record type, Main.hs is a test program that I will reproduce here,

{-# LANGUAGE DataKinds #-}
module Main where
import Rec

type MyRec = Rec '[ '("A",Int), '("B",Int), '("C",Int) ]

getC :: MyRec -> Int
getC = getField (Proxy::Proxy '("C",Int))

doubleC :: MyRec -> MyRec
doubleC r = setC (2 * (getC r)) r
  where setC = set . (Field :: Int -> Field '("C",Int))

main :: IO ()
main = print (getC (Field 1 :& Field 2 :& Field 3 :& Nil :: MyRec))

If the doubleC definition is present, the Core emitted (with -O2) includes an HEq_sc case in the RHS of getC. If doubleC is commented out, that case HEq_sc ... is no longer present. In this example, the offending piece of Core is,

case HEq_sc
       @ Nat
       @ Nat
       @ (Index '("C", Int) '['("B", Int), '("C", Int)])
       @ ('S 'Z)
       ($s$fHasFieldkr:S_$s$fHasFieldkr:S_$cp1HasField
        `cast` (N:~[0] <Nat>_N <Index
                                  '("C", Int) '['("B", Int), '("C", Int)]>_N <'S 'Z>_N
                :: ((Index '("C", Int) '['("B", Int), '("C", Int)] :: Nat)
                    ~
                    ('S 'Z :: Nat) :: Constraint)
                   ~R#
                   ((Index '("C", Int) '['("B", Int), '("C", Int)] :: Nat)
                    ~~
                    ('S 'Z :: Nat) :: Constraint)))
of cobox1 { __DEFAULT ->

If the contents of Rec.hs are included in Main.hs, the case HEq_sc ... is not present in the resulting Core.

The result of what looks like a failure to normalize the Index type family (or RIndex in vinyl) manifests as a 2x slowdown in the benchmark available in the vinyl repository.

Attachments (2)

Rec.hs (1.1 KB) - added by acowley 3 years ago.
Main.hs (378 bytes) - added by acowley 3 years ago.

Download all attachments as: .zip

Change History (19)

Changed 3 years ago by acowley

Attachment: Rec.hs added

Changed 3 years ago by acowley

Attachment: Main.hs added

comment:1 Changed 3 years ago by kosmikus

Cc: kosmikus added

comment:2 Changed 3 years ago by int-index

Cc: int-index added

comment:3 Changed 3 years ago by RyanGlScott

Cc: RyanGlScott added

comment:4 Changed 3 years ago by RyanGlScott

Cc: simonpj added

Simon, this regression was caused by commit ff752a1a5eb69955c0e4fda8647f495409a2c384. Thoughts?

comment:5 Changed 3 years ago by RyanGlScott

Cc: goldfire added; simonpj removed

Ack, I was looking at the wrong commit previously. Apologies.

It turns out the real culprit behind this regression is 1722fa106e10e63160bb2322e2ccb830fd5b9ab3 (by Richard).

comment:6 Changed 3 years ago by RyanGlScott

acowley, you mentioned that there was a "2x slowdown in the benchmark available in the vinyl repository". But I see two benchmarks. Which benchmark suffered the slowdown? I'd like to verify the change in runtime speed before and after commit 1722fa106e10e63160bb2322e2ccb830fd5b9ab3.

comment:7 Changed 3 years ago by acowley

Thank you for taking a look, @RyanGIScott! It is the first benchmark there, with the unfortunate name bench-builder-all.

comment:8 Changed 3 years ago by RyanGlScott

There does appear to be a difference in the bench-builder-all benchmark suite. The full results are available at https://gist.github.com/RyanGlScott/386da12bc71fd2fadae2362b257c41bc. The most noticeable differences come in the vinyl and vinyl-plus benchmarks. On GHC HEAD, I get:

benchmarking vinyl
time                 106.9 µs   (106.7 µs .. 107.0 µs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 106.7 µs   (106.5 µs .. 106.9 µs)
std dev              653.4 ns   (506.9 ns .. 847.7 ns)

benchmarking vinyl-lens
time                 83.98 µs   (83.97 µs .. 84.00 µs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 83.97 µs   (83.96 µs .. 83.98 µs)
std dev              40.09 ns   (33.48 ns .. 47.87 ns)

After reverting 1722fa106e10e63160bb2322e2ccb830fd5b9ab3, you get:

benchmarking vinyl
time                 62.27 µs   (62.25 µs .. 62.28 µs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 62.28 µs   (62.26 µs .. 62.30 µs)
std dev              58.65 ns   (47.95 ns .. 71.78 ns)

benchmarking vinyl-lens
time                 62.07 µs   (62.01 µs .. 62.13 µs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 62.02 µs   (61.98 µs .. 62.05 µs)
std dev              122.0 ns   (106.2 ns .. 144.8 ns)

comment:9 Changed 3 years ago by acowley

All the benchmarks in that suite should result in similar times. The idea is that it tests different representations and ways of accessing fields of records. The times you have there look somewhat suspicious to me. Here are my results,

With GHC-8.0.1,

benchmarking flat
time                 6.851 μs   (6.658 μs .. 7.040 μs)
                     0.995 R²   (0.993 R² .. 0.996 R²)
mean                 6.912 μs   (6.757 μs .. 7.093 μs)
std dev              577.7 ns   (480.4 ns .. 799.9 ns)
variance introduced by outliers: 82% (severely inflated)
             
benchmarking vinyl
time                 7.081 μs   (6.912 μs .. 7.240 μs)
                     0.995 R²   (0.992 R² .. 0.997 R²)
mean                 7.116 μs   (6.970 μs .. 7.308 μs)
std dev              555.6 ns   (437.9 ns .. 710.6 ns)
variance introduced by outliers: 80% (severely inflated)
             
benchmarking vinyl-lens
time                 17.55 μs   (17.15 μs .. 18.01 μs)
                     0.995 R²   (0.994 R² .. 0.997 R²)
mean                 17.84 μs   (17.42 μs .. 18.47 μs)
std dev              1.709 μs   (1.244 μs .. 2.326 μs)
variance introduced by outliers: 84% (severely inflated)
             
benchmarking reasonable
time                 7.250 μs   (7.069 μs .. 7.403 μs)
                     0.996 R²   (0.994 R² .. 0.997 R²)
mean                 7.070 μs   (6.931 μs .. 7.245 μs)
std dev              502.5 ns   (428.5 ns .. 586.9 ns)
variance introduced by outliers: 77% (severely inflated)

While with GHC-7.10.3, I get,

benchmarking flat
time                 7.107 μs   (6.964 μs .. 7.265 μs)
                     0.995 R²   (0.993 R² .. 0.997 R²)
mean                 7.241 μs   (7.087 μs .. 7.431 μs)
std dev              592.3 ns   (498.5 ns .. 754.0 ns)
variance introduced by outliers: 82% (severely inflated)
             
benchmarking vinyl
time                 7.426 μs   (7.295 μs .. 7.586 μs)
                     0.995 R²   (0.993 R² .. 0.997 R²)
mean                 7.404 μs   (7.230 μs .. 7.612 μs)
std dev              648.7 ns   (530.0 ns .. 800.1 ns)
variance introduced by outliers: 83% (severely inflated)
             
benchmarking vinyl-lens
time                 7.256 μs   (7.097 μs .. 7.414 μs)
                     0.995 R²   (0.992 R² .. 0.998 R²)
mean                 7.281 μs   (7.119 μs .. 7.418 μs)
std dev              507.4 ns   (410.6 ns .. 651.8 ns)
variance introduced by outliers: 76% (severely inflated)
             
benchmarking reasonable
time                 7.231 μs   (7.055 μs .. 7.385 μs)
                     0.995 R²   (0.993 R² .. 0.997 R²)
mean                 7.221 μs   (7.069 μs .. 7.388 μs)
std dev              540.9 ns   (453.3 ns .. 635.3 ns)
variance introduced by outliers: 79% (severely inflated)

The vinyl-lens time going from 7.2 μs to 17.84 μs is representative of what I referred to as a 2x slowdown. This is on a 1.3 GHz Intel Core i5. It doesn't seem likely that you are running on that much of a slower CPU, but it is possible.

EDIT: I see from your complete results that the other values are more what I'd expect. If running this benchmark is a better tool than the reproduction case included here, I can set up a branch in the vinyl repo to establish a version of the code for benchmarking purposes to make sure we're running the same thing.

Last edited 3 years ago by acowley (previous) (diff)

comment:10 Changed 3 years ago by simonpj

This HEq_sc stuff is thoroughly bogus. Nice small test case thank you. I'll investigate.

comment:11 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In b4c3a66/ghc:

Push coercions in exprIsConApp_maybe

Trac #13025 showed up the fact that exprIsConApp_maybe isn't
clever enough: it didn't push coercions through applicatins, and that
meant we weren't getting as much superclass selection as we should.

It's easy to fix, happily.

See Note [Push coercions in exprIsConApp_maybe]

comment:12 Changed 3 years ago by simonpj

Milestone: 8.0.3
Status: newmerge
Test Case: simplCore/should_compile/T13025

Fixed. Probably worth pushing the fix to 8.0.3.

comment:13 Changed 3 years ago by goldfire

I don't think your patch is quite right, Simon. Here is the main payload:

pushCoArg :: Coercion -> CoreArg -> Maybe (CoreArg, Coercion)
-- We have (fun |> co) arg, and we want to transform it to
--         (fun arg) |> co
-- This may fail, e.g. if (fun :: N) where N is a newtype
-- C.f. simplCast in Simplify.hs

pushCoArg co arg
  = case arg of
      Type ty | isForAllTy tyL
        -> ASSERT2( isForAllTy tyR, ppr co $$ ppr ty )
           Just (Type ty, mkInstCo co (mkNomReflCo ty))

      _ | isFunTy tyL
        , [co1, co2] <- decomposeCo 2 co
              -- If   co  :: (tyL1 -> tyL2) ~ (tyR1 -> tyR2)
              -- then co1 :: tyL1 ~ tyR1
              --      co2 :: tyL2 ~ tyR2
        -> ASSERT2( isFunTy tyR, ppr co $$ ppr arg )
           Just (mkCast arg (mkSymCo co1), co2)

      _ -> Nothing
  where
    Pair tyL tyR = coercionKind co

This implements two of the push rules from every FC paper. But note that forall-types can now be heterogeneous. That is, fun and fun |> co might expect types of different kinds. Your patch does not take this possibility into account. I recommend this:

case arg of
  Type ty | isForAllTy tyL
    -> ASSERT2( isForAllTy tyR, ppr co $$ ppr ty )
       -- co :: (forall (a1 :: k1). ty1) ~ (forall (a2 :: k2). ty2)
       let co1 = mkSymCo (mkNthCo 0 co)
           -- co1 :: k2 ~ k1

           ty' = ty `mkCastTy` co1
           
           co2 = mkInstCo co (mkCoherenceLeftCo (mkNomReflCo ty) co1)
           -- co2 :: ty1[ty |> co1 / a1] ~ ty2[ty / a2]

       in Just (Type ty', co2)

Note that NthCo can extract an equality between the kinds of the types related by a coercion between forall-types. See the NthCo case in CoreLint.

Last edited 3 years ago by goldfire (previous) (diff)

comment:14 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In b4f2afe7/ghc:

Fix the implementation of the "push rules"

Richard pointed out (comment:12 of Trac #13025) that my
implementation of the coercion "push rules", newly added
in exprIsConAppMaybe by commit b4c3a66, wasn't quite right.

But in fact that means that the implementation of those same
rules in Simplify.simplCast was wrong too.

Hence this commit:

* Refactor the push rules so they are implemented in just
  one place (CoreSubst.pushCoArgs, pushCoTyArg, pushCoValArg)
  The code in Simplify gets simpler, which is nice.

* Fix the bug that Richard pointed out (to do with hetero-kinded
  coercions)

Then compiler performance worsened, which led mt do discover
two performance bugs:

* The smart constructor Coercion.mkNthCo didn't have a case
  for ForAllCos, which meant we stupidly build a complicated
  coercion where a simple one would do

* In OptCoercion there was one place where we used CoherenceCo
  (the data constructor) rather than mkCoherenceCo (the smart
  constructor), which meant that the the stupid complicated
  coercion wasn't optimised away

For reasons I don't fully understand, T5321Fun did 2% less compiler
allocation after all this, which is good.

comment:15 Changed 3 years ago by simonpj

OK I fixed comment:13. Well spotted Richard!

comment:16 Changed 3 years ago by bgamari

Milestone: 8.0.38.2.1

At this point it is rather unlikely that there will be an 8.0.3. Re-milestoning.

comment:17 Changed 3 years ago by bgamari

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.