Opened 4 years ago

Closed 4 years ago

Last modified 15 months ago

#11284 closed bug (duplicate)

Lambda-lifting fails in simple Text example

Reported by: bgamari Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.3
Keywords: Cc: sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #5945, #11318 Differential Rev(s):
Wiki Page:

Description (last modified by sgraf)

Consider the example (which uses Text; I'm working on finding a more minimal example),

{-# LANGUAGE BangPatterns #-}
module T11284 where

import Data.Char (isSpace)
import Data.List (foldl')
import GHC.Exts (build)
import qualified  Data.Text as T
import qualified  Data.Text.Array as A

longestWord :: T.Text -> Int
longestWord t = foldl' max 0 $ map T.length $ fusedWords t

fusedWords :: T.Text -> [T.Text]
fusedWords t0 = build $ \cons nil ->
  let go !t
        | T.null t  = nil
        | otherwise = let (w, rest) = T.span (not . isSpace) t
                      in cons w (go $ T.dropWhile isSpace rest)
  in go t0

-- For reference
data Text = Text
    {-# UNPACK #-} !A.Array          -- payload (Word16 elements)
    {-# UNPACK #-} !Int              -- offset (units of Word16, not Char)
    {-# UNPACK #-} !Int              -- length (units of Word16, not Char)

longestWord here produces the simplified Core,

Ticket.$wgo :: GHC.Prim.ByteArray# -> GHC.Prim.Int# -> GHC.Prim.Int# -> [T.Text]
Ticket.$wgo = ...

-- > $wgo1 xs n = foldl' (\a b -> max a $ T.length b) n xs
Ticket.$wgo1 :: [T.Text] -> GHC.Prim.Int# -> GHC.Prim.Int#
Ticket.$wgo1 =
  \ (w_s4GJ :: [T.Text]) (ww_s4GN :: GHC.Prim.Int#) ->
    case w_s4GJ of _ {
      [] -> ww_s4GN;
      : y_a4vC ys_a4vD ->
        case y_a4vC
        of _ { Data.Text.Internal.Text dt_a4jP dt1_a4jQ dt2_a4jR ->
        let {
          a_a4jO :: GHC.Prim.Int#
          a_a4jO = GHC.Prim.+# dt1_a4jQ dt2_a4jR } in
        letrec {
          -- For the love of all that is good, why must you allocate?
          --
          -- This loop is essentially `T.length`, the first argument being
          -- the length accumulator and the second being an index into the
          -- ByteArray#
          $wloop_length_s4GI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
          $wloop_length_s4GI =
            \ (ww1_s4Gz :: GHC.Prim.Int#) (ww2_s4GD :: GHC.Prim.Int#) ->
              -- Have we reached the end of the Text?
              case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.>=# ww2_s4GD a_a4jO)
              of _ {
                False -> {
                  ...
                  -- in this body there are few cases analyses which
                  -- classify the code-points we encounter. The branches
                  -- are recursive calls of the form
                  $wloop_length_s4GI (GHC.Prim.+# ww1_s4Gz 1) (GHC.Prim.+# ww2_s4GD 1)
                  ...
                True -> ww1_s4Gz
              }; } in
        case $wloop_length_s4GI 0 dt1_a4jQ of ww1_s4GH { __DEFAULT ->
        case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.<=# ww_s4GN ww1_s4GH)
        of _ {
          False -> Ticket.$wgo1 ys_a4vD ww_s4GN;
          True -> Ticket.$wgo1 ys_a4vD ww1_s4GH
        }
        }
        }
    }

longestWord :: T.Text -> Int
longestWord =
  \ (w_s4GT :: T.Text) ->
    case w_s4GT
    of _ { Data.Text.Internal.Text ww1_s4GW ww2_s4GX ww3_s4GY ->
    case Ticket.$wgo1 (Ticket.$wgo ww1_s4GW ww2_s4GX ww3_s4GY) 0
    of ww4_s4H2 { __DEFAULT ->
    GHC.Types.I# ww4_s4H2
    }
    }

Notice $wloop_length_s4GI: It should be a nice tight loop counting Unicode characters in the array dt_a4jP until it arrives at its end. However, GHC fails to lambda-lift this closure, thereby turning it into an allocating operation! Oh no!

Change History (21)

comment:1 Changed 4 years ago by bgamari

Description: modified (diff)
Type of failure: None/UnknownRuntime performance bug

comment:2 Changed 4 years ago by bgamari

Description: modified (diff)

comment:3 Changed 4 years ago by bgamari

Description: modified (diff)

comment:4 Changed 4 years ago by bgamari

Description: modified (diff)

comment:5 Changed 4 years ago by bgamari

Description: modified (diff)

comment:7 Changed 4 years ago by bgamari

Description: modified (diff)

comment:8 Changed 4 years ago by bgamari

Description: modified (diff)

comment:9 Changed 4 years ago by bgamari

Description: modified (diff)

comment:10 Changed 4 years ago by bgamari

Description: modified (diff)

comment:11 Changed 4 years ago by bgamari

For what it's worth, this binding is caught by the AnnRec pattern of SetLevels.lvlBinds by the not (profitableFloat ...) guard. dest_lvl = <1,4> and bind_lvl = <1,5>. -ffloat-all-lams and -ffloat-lam-args=10 here doesn't seem to make any difference on the produced Core; while the binding gets floated out as one would expect, later simplifier phases later push it back in. Even after float-out, however, we still have an inner binding,

poly_$wloop_length_s7fV
  :: GHC.Prim.ByteArray# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
poly_$wloop_length_s7fV =
  \ (dt_a6OV :: GHC.Prim.ByteArray#)
    (a_a6OU :: GHC.Prim.Int#)
    (ww_s7bR :: GHC.Prim.Int#)
    (ww_s7bV :: GHC.Prim.Int#) ->
    (letrec {
       $wloop_length_s7fT :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
       $wloop_length_s7fT =
         \ (ww_X7c8 :: GHC.Prim.Int#) (ww_X7cd :: GHC.Prim.Int#) ->
           case GHC.Prim.tagToEnum# @ Bool (GHC.Prim.>=# ww_X7cd a_a6OU)
           of wild1_a6P4 {
             False -> ...
             True  -> ww_X7c8
           }; } in
     $wloop_length_s7fT)
      ww_s7bR ww_s7bV

Which later gets inlined back into $wgo.

Looking at this Core, I suppose it's plausible that there really are too many free variables in this function to beneficially lambda-lift. It would be nice to have a second opinion here.

Version 0, edited 4 years ago by bgamari (next)

comment:12 Changed 4 years ago by bgamari

Description: modified (diff)

comment:13 Changed 4 years ago by bgamari

This looks very similar to #5945.

comment:14 Changed 4 years ago by bgamari

It turns out that even just Data.Text.length along allocates. See #11318.

Edit: I had confused myself.

Last edited 4 years ago by bgamari (previous) (diff)

comment:15 Changed 4 years ago by simonpj

See #9476 (late lambda-lifting) which is all about this.

Last edited 4 years ago by simonpj (previous) (diff)

comment:16 Changed 4 years ago by bgamari

Resolution: duplicate
Status: newclosed

Closing as a duplicate given that this is a consequence of #9476.

comment:17 Changed 16 months ago by simonpj

Keywords: LateLamLift added

comment:18 Changed 16 months ago by sgraf

Description: modified (diff)

comment:19 Changed 16 months ago by sgraf

Keywords: LateLamLift removed

There's nothing to lift here anymore, all occuring functions turned into join points and the only actual let bindings in STG output are thunks, which can't be lifted.

Here's an example run on /usr/share/dict/words (with an appropriate main):

$ ./Main +RTS -s < /usr/share/dict/words
23
      33,832,464 bytes allocated in the heap
          24,088 bytes copied during GC
       2,011,952 bytes maximum residency (2 sample(s))
         163,024 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        28 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0001s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.012s  (  0.012s elapsed)
  GC      time    0.001s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.013s  (  0.013s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    2,822,899,212 bytes per MUT second

  Productivity  95.0% of total user, 95.5% of total elapsed

Strange enough, the Data.Text.Lazy variant of this is much slower and allocates much more, when I actually thought it would cope better with getContents. Whatever, this doesn't seem relevant to LateLamLift anymore.

comment:20 Changed 15 months ago by simonpj

I suppose we didn't have join points 3 yrs ago. Suppose $wloop wasn't a join point; would LLF lift it? I expect so, and I suppose we could find out by switching off join points. (We don't have a way to do that right now, alas.)

comment:21 Changed 15 months ago by sgraf

Cc: sgraf added

Sorry, I didn't pick your reply up in the ticket noise.

I think we would pick it up. This is the STG binding for $wloop_length: https://gist.github.com/sgraf812/fcfda9e55004d19881314e31fdea4423

As this turned into a join point anyway, no closure mentions $wloop, so there will be no positive closure growth (e.g. penalties in allocations). $wloop also has only 3 free variables, which adds with the 2 arguments to a fortunate 5, so there would be enough hardware registers for the lift.

I don't see any reason we wouldn't have lifted it.

Note: See TracTickets for help on using tickets.