Opened 2 years ago

Closed 2 years ago

Last modified 16 months ago

#13623 closed bug (fixed)

join points produce bad code for stream fusion

Reported by: choenerzs Owned by:
Priority: highest Milestone: 8.2.1
Component: Compiler Version: 8.2.1-rc1
Keywords: JoinPoints Cc: lukemauer, asr, george, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case: perf/should_runt/T13623
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by choenerzs)

Below, I am generating to stream fusion streams xs and ys. Both parameterized on k l. The two streams are then concatenated. Finally I do a strict left fold.

This example needs the 'vector' package but nothing else.

module Test where

import Data.Vector.Fusion.Stream.Monadic as S


foo :: Int -> Int -> IO Int
foo = \i j -> S.foldl' (+) 0 $ xs i j S.++ ys i j
  where xs k l = S.enumFromStepN k l 2
        ys k l = S.enumFromStepN k l 3
        {-# Inline xs #-}
        {-# Inline ys #-}
{-# Inline foo #-}

With ghc-8.0.1 I get nice core:

$wfoo_r1Ai
$wfoo_r1Ai =
  \ ww_s1q5 ww1_s1q9 w_s1q2 ->
    letrec {
      $s$wfoldlM'_loop_s1xc
      $s$wfoldlM'_loop_s1xc =
        \ sc_s1x7 sc1_s1x5 sc2_s1x6 sc3_s1x4 ->
          case tagToEnum# (># sc2_s1x6 0#) of _ {
            False -> (# sc_s1x7, I# sc3_s1x4 #);
            True ->
              $s$wfoldlM'_loop_s1xc
                sc_s1x7
                (+# sc1_s1x5 ww1_s1q9)
                (-# sc2_s1x6 1#)
                (+# sc3_s1x4 sc1_s1x5)
          }; } in
    letrec {
      $s$wfoldlM'_loop1_s1x3
      $s$wfoldlM'_loop1_s1x3 =
        \ sc_s1x2 sc1_s1x0 sc2_s1x1 sc3_s1wZ ->
          case tagToEnum# (># sc2_s1x1 0#) of _ {
            False -> $s$wfoldlM'_loop_s1xc sc_s1x2 ww_s1q5 3# sc3_s1wZ;
            True ->
              $s$wfoldlM'_loop1_s1x3
                sc_s1x2
                (+# sc1_s1x0 ww1_s1q9)
                (-# sc2_s1x1 1#)
                (+# sc3_s1wZ sc1_s1x0)
          }; } in
    $s$wfoldlM'_loop1_s1x3 w_s1q2 ww_s1q5 2# 0#

Now the same with ghc-8.2-rc1. Here, Stream.++ function is not fully optimized away (Left and Right constructors!). Instead we have a join point that executes either of the two parts (xs or ys) based on a case w2_s1U2 of {Left -> ; Right ->}.

$wfoo_r23R
$wfoo_r23R
  = \ ww_s1Ue ww1_s1Ui w_s1Ub ->
      let {
        x1_a1tj
        x1_a1tj = I# ww_s1Ue } in
      let {
        tb_a1wC
        tb_a1wC = (x1_a1tj, lvl1_r23Q) } in
      let {
        lvl2_s1Yh
        lvl2_s1Yh = Right tb_a1wC } in
      joinrec {
        $wfoldlM'_loop_s1U8
        $wfoldlM'_loop_s1U8 w1_s1U0 ww2_s1U6 w2_s1U2 w3_s1U3
          = case w1_s1U0 of { __DEFAULT ->
            case w2_s1U2 of {
              Left sa_a1yP ->
                case sa_a1yP of { (w4_a1zr, m1_a1zs) ->
                case m1_a1zs of { I# x2_a1zw ->
                case tagToEnum# (># x2_a1zw 0#) of {
                  False -> jump $wfoldlM'_loop_s1U8 SPEC ww2_s1U6 lvl2_s1Yh w3_s1U3;
                  True ->
                    case w4_a1zr of { I# y_a1xT ->
                    jump $wfoldlM'_loop_s1U8
                      SPEC
                      (+# ww2_s1U6 y_a1xT)
                      (Left (I# (+# y_a1xT ww1_s1Ui), I# (-# x2_a1zw 1#)))
                      w3_s1U3
                    }
                }
                }
                };
              Right sb_a1z3 ->
                case sb_a1z3 of { (w4_a1zr, m1_a1zs) ->
                case m1_a1zs of { I# x2_a1zw ->
                case tagToEnum# (># x2_a1zw 0#) of {
                  False -> (# w3_s1U3, I# ww2_s1U6 #);
                  True ->
                    case w4_a1zr of { I# y_a1xT ->
                    jump $wfoldlM'_loop_s1U8
                      SPEC
                      (+# ww2_s1U6 y_a1xT)
                      (Right (I# (+# y_a1xT ww1_s1Ui), I# (-# x2_a1zw 1#)))
                      w3_s1U3
                    }
                }
                }
                }
            }
            }; } in
      jump $wfoldlM'_loop_s1U8 SPEC 0# (Left (x1_a1tj, lvl_r23P)) w_s1Ub

For my stream-fusion heavy code, this yields a slowdown of approximately x4 (10 seconds with ghc-8.2-rc1, 2.5 seconds with ghc-8.0.1).

===

ghc-options:

-O2 -ddump-to-file -ddump-simpl -dsuppress-all -dshow-passes

Change History (15)

comment:1 Changed 2 years ago by bgamari

Cc: lukemauer added
Description: modified (diff)
Milestone: 8.2.1
Priority: normalhigh

comment:2 Changed 2 years ago by bgamari

Keywords: JoinPoints added

comment:3 Changed 2 years ago by choenerzs

Description: modified (diff)

Added the ghc-options I used in the (trivial) cabal file.

Last edited 2 years ago by choenerzs (previous) (diff)

comment:4 Changed 2 years ago by choenerzs

Pure conjecture: $wfoldlM'_loop_s1U8 takes SPEC and should probably then not be a join point. On the other hand, the vector people probably need join points to happen if they want to remove Skip from from the Step data type.

comment:5 Changed 2 years ago by RyanGlScott

For the sake of convenience, here's a version which brings in the relevant code from vector to avoid dependencies:

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE CPP #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE RankNTypes #-}
module Test where

import GHC.Types (SPEC(..))

foo :: Int -> Int -> IO Int
foo = \i j -> sfoldl' (+) 0 $ xs i j +++ ys i j
  where xs k l = senumFromStepN k l 2
        ys k l = senumFromStepN k l 3
        {-# Inline xs #-}
        {-# Inline ys #-}
{-# Inline foo #-}

-------------------------------------------------------------------------------
-- vector junk
-------------------------------------------------------------------------------

#define PHASE_FUSED [1]
#define PHASE_INNER [0]

#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER

data Stream m a = forall s. Stream (s -> m (Step s a)) s

data Step s a where
  Yield :: a -> s -> Step s a
  Skip  :: s -> Step s a
  Done  :: Step s a

senumFromStepN :: (Num a, Monad m) => a -> a -> Int -> Stream m a
{-# INLINE_FUSED senumFromStepN #-}
senumFromStepN x y n = x `seq` y `seq` n `seq` Stream step (x,n)
  where
    {-# INLINE_INNER step #-}
    step (w,m) | m > 0     = return $ Yield w (w+y,m-1)
               | otherwise = return $ Done

sfoldl' :: Monad m => (a -> b -> a) -> a -> Stream m b -> m a
{-# INLINE sfoldl' #-}
sfoldl' f = sfoldlM' (\a b -> return (f a b))

sfoldlM' :: Monad m => (a -> b -> m a) -> a -> Stream m b -> m a
{-# INLINE_FUSED sfoldlM' #-}
sfoldlM' m w (Stream step t) = foldlM'_loop SPEC w t
  where
    foldlM'_loop !_ z s
      = z `seq`
        do
          r <- step s
          case r of
            Yield x s' -> do { z' <- m z x; foldlM'_loop SPEC z' s' }
            Skip    s' -> foldlM'_loop SPEC z s'
            Done       -> return z

infixr 5 +++
(+++) :: Monad m => Stream m a -> Stream m a -> Stream m a
{-# INLINE_FUSED (+++) #-}
Stream stepa ta +++ Stream stepb tb = Stream step (Left ta)
  where
    {-# INLINE_INNER step #-}
    step (Left  sa) = do
                        r <- stepa sa
                        case r of
                          Yield x sa' -> return $ Yield x (Left  sa')
                          Skip    sa' -> return $ Skip    (Left  sa')
                          Done        -> return $ Skip    (Right tb)
    step (Right sb) = do
                        r <- stepb sb
                        case r of
                          Yield x sb' -> return $ Yield x (Right sb')
                          Skip    sb' -> return $ Skip    (Right sb')
                          Done        -> return $ Done

comment:6 Changed 2 years ago by asr

Cc: asr added

comment:7 Changed 2 years ago by George

Cc: george added

comment:8 Changed 2 years ago by michalt

Cc: michalt added

comment:9 Changed 2 years ago by bgamari

Priority: highhighest

I think we should regard this as highest as it breaks existing stream fusion users.

comment:10 Changed 2 years ago by Simon Peyton Jones <simonpj@…>

In 9e47dc45/ghc:

Fix loss-of-SpecConstr bug

This bug, reported in Trac #13623 has been present since

  commit b8b3e30a6eedf9f213b8a718573c4827cfa230ba
  Author: Edward Z. Yang <ezyang@cs.stanford.edu>
  Date:   Fri Jun 24 11:03:47 2016 -0700

      Axe RecFlag on TyCons.

SpecConstr tries not to specialise indefinitely, and had a
limit (see Note [Limit recursive specialisation]) that made
use of info about whether or not a data constructor was
"recursive".  This info vanished in the above commit, making
the limit fire much more often -- and indeed it fired in this
test case, in a situation where specialisation is /highly/
desirable.

I refactored the test, to look instead at the number of
iterations of the loop of "and now specialise calls that
arise from the specialisation".  Actually less code, and
more robust.

I also added record field names to a couple of constructors,
and renamed RuleInfo to SpecInfo.

comment:11 Changed 2 years ago by simonpj

Status: newmerge
Test Case: perf/should_runt/T13623

Nice catch! This has been wrong for eight months.

Now fixed.

Simon

comment:12 Changed 2 years ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:13 Changed 2 years ago by choenerzs

With ghc 8.2.-rc2 my original program is now fast again. However, compared to ghc-8.0.1 it is now necessary to explicitly give -fspec-constr-count=100 as otherwise I end up with insufficient specializations.

(The 100 is rather arbitrary, but the default is definitely not enough)

comment:14 Changed 2 years ago by simonpj

Yes, it's hard for GHC to predict what "sufficient specialisation" is, and its heuristics are a bit ad hoc. Improvements there would be welcome.

I'm not quite sure what changed since 8.0, but it's a squishy area so I'm not surprised that something has. Seeing where and why GHC 8.2 stopped sort of doing the full job, and what heuristic might help it continue, would be a useful piece of work.

comment:15 Changed 16 months ago by Ben Gamari <ben@…>

In 1126e69/ghc:

testsuite: Fix overflow in T13623 on 32-bit machines

We simply truncate the result to 32-bits to ensure that the test passed
under both environments.

Test Plan: Validate on 32-bit

Subscribers: thomie, carter

GHC Trac Issues: #13623

Differential Revision: https://phabricator.haskell.org/D4615
Note: See TracTickets for help on using tickets.