Opened 18 months ago

Closed 17 months ago

Last modified 17 months ago

#14955 closed bug (fixed)

Musings on manual type class desugaring

Reported by: mpickering Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.2.2
Keywords: SpecConstr Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I recently wrote a short post explaining why manual type class desugaring was different to actually writing a type class because of how they are optimised. http://mpickering.github.io/posts/2018-03-20-recordsvstypeclasses.html

I implement 4 different equivalent programs which are all optimised differently. I paste the whole file below as it is not very big.

Implementation 1 is in terms of a type class. Implementation 2 is in terms of explicit dictionary passing. Implementation 3 wraps a dictionary in a type class Implementation 4 wraps a dictionary in a type class with an additional dummy argument.

Naively, a user would expect all 4 implementations to be as fast as each other.

{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE AllowAmbiguousTypes #-}
module Prop where

import Prelude (Bool(..), (||), (&&))

-- Implementation 1

class Prop r where
  or :: r -> r -> r
  and :: r -> r -> r
  true :: r
  false :: r

instance Prop Bool where
  or = (||)
  and = (&&)
  true = True
  false = False

-- Implementation 2

data PropDict r = PropDict {
  dor :: r -> r -> r
  , dand :: r -> r -> r
  , dtrue :: r
  , dfalse :: r
  }

boolDict = PropDict {
  dor = (||)
  , dand = (&&)
  , dtrue = True
  , dfalse = False }

-- Implementation 3

class PropProxy r where
  propDict :: PropDict r

instance PropProxy Bool where
  propDict = boolDict

-- Implementation 4

class PropProxy2 r where
  propDict2 :: PropDict r
  dummy :: ()

instance PropProxy2 Bool where
  propDict2 = boolDict
  dummy = ()


ors :: Prop r => [r] -> r
ors [] = true
ors (o:os) = o `or` ors os
{-# INLINABLE ors #-}

dors :: PropDict r -> [r] -> r
dors pd [] = dtrue pd
dors pd (o:os) = dor pd o (dors pd os)

pors :: PropProxy r => [r] -> r
pors [] = dtrue propDict
pors (o:os) = dor propDict o (pors os)
{-# INLINABLE pors #-}

porsProxy :: PropProxy2 r => [r] -> r
porsProxy [] = dtrue propDict2
porsProxy (o:os) = dor propDict2 o (porsProxy os)
{-# INLINABLE porsProxy #-}

Then using the 4 different implementations of ors in another module implementations 1 and 4 are fast whilst 2 and 3 are slow.

https://github.com/mpickering/rtcwrao-benchmarks/blob/master/Prop2.hs

benchmarking tc/Implementation 1
time                 3.510 ms   (3.509 ms .. 3.512 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.976 ms   (2.886 ms .. 3.060 ms)
std dev              241.1 μs   (195.4 μs .. 293.1 μs)
variance introduced by outliers: 51% (severely inflated)

benchmarking tc/Implementation 2
time                 25.05 ms   (21.16 ms .. 30.43 ms)
                     0.912 R²   (0.849 R² .. 0.984 R²)
mean                 19.18 ms   (16.20 ms .. 21.45 ms)
std dev              5.627 ms   (4.710 ms .. 6.618 ms)
variance introduced by outliers: 89% (severely inflated)

benchmarking tc/Implementation 3
time                 20.06 ms   (15.33 ms .. 23.57 ms)
                     0.856 R²   (0.755 R² .. 0.934 R²)
mean                 18.43 ms   (16.92 ms .. 19.85 ms)
std dev              3.490 ms   (3.003 ms .. 4.076 ms)
variance introduced by outliers: 74% (severely inflated)

benchmarking tc/Implementation 4
time                 3.498 ms   (3.484 ms .. 3.513 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.016 ms   (2.935 ms .. 3.083 ms)
std dev              205.7 μs   (162.6 μs .. 261.8 μs)
variance introduced by outliers: 42% (moderately inflated)

I compiled the module with -O2. If I turn off -fno-worker-wrapper and -fno-spec-constr then implementation 3 is also fast. Implementation 2 is always slow.

This ticket is querying what could be done to improve the robustness of these different refactorings.

Change History (5)

comment:1 Changed 18 months ago by Simon Peyton Jones <simonpj@…>

In 034c32f/ghc:

Improve shortOutIndirections slightly

I found (when investigating Trac #14955) a binding looking like

   Rec { exported_id = ....big...lcl_id...
       ; lcl_id = exported_id }

but bizarrely 'lcl_id' was chosen as the loop breaker, and never
inlined.  It turned out to be an unintended consequence of the
shortOutIndirections code in SimplCore.  Easily fixed.

comment:2 Changed 18 months ago by simonpj

(2) is slow becuase we don't (yet) do cross-module SpecConstr. If dors was INLINEABLE then in principle SpecConstr could specialise it in Main, in a similar way to Specialise. But that is a whole new thing. (A very good thing, incidentally: that is Trac #10346.

(3) jolly well ought to be fast. That's a bug. Here is what is happening.

The INLINABLE pragma on pors (for (3)) or porsProxy (for (4)) should arrange that these functions can be speicalised in a separate module. But in fact pors is worker/wrappered by the demand analyser! We get this in the .hi file:

  pors :: PropProxy r => [r] -> r
  {- Arity: 2, HasNoCafRefs,
     Strictness: <S,1*U(C(C1(U)),A,1*U,A)><S,1*U>, Inline: [0],
     Unfolding: InlineRule (2, True, False)
                (\ @ r (w :: PropProxy r) (w1 :: [r]) ->
                 case w `cast`
                      (N:PropProxy[0] <r>_N) of ww { PropDict ww1 ww2 ww3 ww4 ->
                 $wpors @ r ww1 ww3 w1 }) -}

  porsProxy :: PropProxy2 r => [r] -> r
  {- Arity: 2, HasNoCafRefs,
     Strictness: <S(SL),U(U(C(C1(U)),A,U,A),A)><S,1*U>, Inline:,
     Unfolding(loop-breaker): <stable> (\ @ r
                                          ($dPropProxy2 :: PropProxy2 r)
                                          (ds :: [r]) ->
                                        case ds of wild {
                                          []
                                          -> case propDict2
                                                    @ r
                                                    $dPropProxy2 of wild1 { PropDict ds1 ds2 ds3 ds4 ->
                                             ds3 }
                                          : o os
                                          -> case propDict2
                                                    @ r
                                                    $dPropProxy2 of wild1 { PropDict ds1 ds2 ds3 ds4 ->
                                             ds1 o (porsProxy @ r $dPropProxy2 os) } }) -}

So porsProxy can be specialised in the calling file -- good. And so can pors -- but the specialised version of its code is silly, just a call to the un-specialised $wpors; still a higher order function, which is bad bad bad.

But I fixed this more 6 years ago. See this Note in WwLib:

Note [Do not unpack class dictionaries]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If we have
   f :: Ord a => [a] -> Int -> a
   {-# INLINABLE f #-}
and we worker/wrapper f, we'll get a worker with an INLINABLE pragma
(see Note [Worker-wrapper for INLINABLE functions] in WorkWrap), which
can still be specialised by the type-class specialiser, something like
   fw :: Ord a => [a] -> Int# -> a

BUT if f is strict in the Ord dictionary, we might unpack it, to get
   fw :: (a->a->Bool) -> [a] -> Int# -> a
and the type-class specialiser can't specialise that.  An example is
Trac #6056.

Moreover, dictionaries can have a lot of fields, so unpacking them can
increase closure sizes.

Conclusion: don't unpack dictionaries.

So why isn't this preventing the w/w for pors? Because the bit that prevents the unpacking is here:

deepSplitProductType_maybe fam_envs ty
  | let (co, ty1) = topNormaliseType_maybe fam_envs ty
                    `orElse` (mkRepReflCo ty, ty)
  , Just (tc, tc_args) <- splitTyConApp_maybe ty1
  , Just con <- isDataProductTyCon_maybe tc
  , not (isClassTyCon tc)  -- See Note [Do not unpack class dictionaries]
  , let arg_tys = dataConInstArgTys con tc_args
        strict_marks = dataConRepStrictness con
  = Just (con, tc_args, zipEqual "dspt" arg_tys strict_marks, co)

Ha! We do topNormaliseType_maybe before checking isClassTyCon. And when the class is a newtype, that topNormaliseType_maybe unwraps it.

It's easy to fix: just move the test earlier. Patch coming -- but maybe after next week when I'm on holiday.

comment:3 Changed 17 months ago by Simon Peyton Jones <simonpj@…>

In 3d38e82/ghc:

Do not unpack class dictionaries with INLINABLE

Matthew Pickering uncovered a bad performance hole in the way
that single-method dictionaries work, described in Trac #14955.

See Note [Do not unpack class dictionaries] in WwLib.

I tried to fix this 6 years ago, but got it slightly wrong.  This patch
fixes it, which makes a dramatic improvement in the test case.

Nofib highlights: not much happening:

  Program           Size    Allocs   Runtime   Elapsed  TotalMem
-----------------------------------------------------------------
      VSM          -0.3%     +2.7%     -7.4%     -7.4%      0.0%
cacheprof          -0.0%     +0.1%     +0.3%     +0.7%      0.0%
  integer          -0.0%     +1.1%     +7.5%     +7.5%      0.0%
      tak          -0.1%     -0.2%     0.024     0.024      0.0%
-----------------------------------------------------------------
      Min          -4.4%     -0.2%     -7.4%     -7.4%     -8.0%
      Max          +0.6%     +2.7%     +7.5%     +7.5%      0.0%
Geom Mean          -0.1%     +0.0%     +0.1%     +0.1%     -0.2%

I investigated VSM.  The patch unpacks class dictionaries a bit more
than before (i.e. does so if there is no INLINABLE pragma). And that
gives better code in VSM (less dictionary selection etc), but one closure
gets one word bigger.

I'll accept these changes in exchange for more robust performance.

Some ghci.debugger output wobbled around (order of bindings
being displayed). I have no idea why; but I accepted the changes.

comment:4 Changed 17 months ago by simonpj

Resolution: fixed
Status: newclosed

Thanks for highlighting this, Matthew.

comment:5 Changed 17 months ago by Simon Peyton Jones <simonpj@…>

In 56bbe1e/ghc:

Add missing stdout file for T14955

Accidentally omitted from Trac #14955 commit.
Note: See TracTickets for help on using tickets.