Opened 12 months ago

Last modified 12 months ago

#16040 new bug

Unboxing-Related Performance Issue with Polymorphic Functions

Reported by: _recursion Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.6.2
Keywords: CPRAnalysis Cc: sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by _recursion)

My team has observed a 2x performance degradation in code that makes use of StateT that appears to be related to strictness and unboxing, even when built with -O2. Our code makes heavy use of the state monad, and when GHC fails to optimise this usage, the performance becomes untenably slow.

We've managed to minimise the behaviour to the following reproducer (that ignores state entirely), consisting of two files. It depends on criterion


{-# LANGUAGE BangPatterns #-}

module Lib where

-- A type to take the place of state
data X a = X { runX :: !a }

test1 :: Int -> Int
test1 = \(!i) -> go i where
    go = \(!i) -> if i > 0
        then go $! i - 1
        else i
{-# NOINLINE test1 #-}

test2 :: Int -> Int
test2 = \(!i) -> runX (go i) where
    go = \(!i) -> if i > 0
        then go $! i - 1
        else X i
{-# NOINLINE test2 #-}


{-# LANGUAGE Strict #-}
module Main where

import Lib
import Criterion
import Criterion.Main

main :: IO ()
main = defaultMain
    [ bgroup "main"
        [ bgroup "small"
            [ bench "without state" $ whnf test1 100000000
            , bench "with state"    $ whnf test2 100000000

Run as above, the code takes twice as long to execute test2 as it does test1. However, when the signature for runX is changed to runX :: !Int, the code in test2 exhibits identical performance to test1.

It has been reproduced across multiple linux64 machines, but not tested on any other architecture or operating system.

Please find the full (stack - yes, I know) project as an attachment. You can simply stack run to observe the issue. If you have any further queries please let me know.

Attachments (3)

bug.tar.gz (1.1 KB) - added by _recursion 12 months ago.
test1.log (949 bytes) - added by _recursion 12 months ago.
test1 allocation report
test2.log (949 bytes) - added by _recursion 12 months ago.
test2 allocation report

Download all attachments as: .zip

Change History (14)

Changed 12 months ago by _recursion

Attachment: bug.tar.gz added

comment:1 Changed 12 months ago by simonpj

I have not yet tried with criterion, but I tried with

main = print (test1 10000000)

or test2, and ran with +RTS -s. I got the same amount of allocation either way. Do you see that too? If so, it's not a heap-allocation issue.

comment:2 Changed 12 months ago by _recursion

Good point! I forgot to check that. Allocations are very close to each other but not identical. Both of the outputs are attached.

They use test* 10000000.

Last edited 12 months ago by _recursion (previous) (diff)

Changed 12 months ago by _recursion

Attachment: test1.log added

test1 allocation report

Changed 12 months ago by _recursion

Attachment: test2.log added

test2 allocation report

comment:3 Changed 12 months ago by mpickering

I think this would be fixed by nested cpr, see #1600.

comment:4 Changed 12 months ago by _recursion

A cursory read looks like it would, yes!

Nevertheless, this is pretty concerning. I now have no idea why most of our codebase isn't getting hit by this issue, and suddenly I've run into a place where it occurs. I'm a little afraid for the performance stability of our code across GHC upgrades now.

Is it something that could only be fixed by nested CPR or could a band-aid fix be put in place, do you think?

comment:5 Changed 12 months ago by simonpj

Interesting. For test1 (the fast one) we get

Rec {
$wgo_r2xK :: GHC.Prim.Int# -> GHC.Prim.Int#
$wgo_r2xK = \ (ww_s2w4 :: GHC.Prim.Int#) ->
            case GHC.Prim.># ww_s2w4 0# of {
              __DEFAULT -> ww_s2w4;
              1# -> $wgo_r2xK (GHC.Prim.-# ww_s2w4 1#) }
end Rec }

T16040.$wtest1 :: GHC.Prim.Int# -> GHC.Prim.Int#
T16040.$wtest1 = \ (ww_s2wd :: GHC.Prim.Int#) -> $wgo_r2xK ww_s2wd

test1 :: Int -> Int
test1 = \ (w_s2wa :: Int) ->
        case w_s2wa of { GHC.Types.I# ww1_s2wd ->
        case T16040.$wtest1 ww1_s2wd of ww2_s2wh { __DEFAULT ->
        GHC.Types.I# ww2_s2wh  } }

Notice that loop $wgo does not allocation at all.

For test2 we get

Rec {
$wgo1_r2xL :: GHC.Prim.Int# -> (# Int #)
$wgo1_r2xL = \ (ww_s2wm :: GHC.Prim.Int#) ->
             case GHC.Prim.># ww_s2wm 0# of {
               __DEFAULT -> (# GHC.Types.I# ww_s2wm #);
               1#        -> $wgo1_r2xL (GHC.Prim.-# ww_s2wm 1#) }
end Rec }

T16040.$wtest2 :: GHC.Prim.Int# -> Int
T16040.$wtest2 = \ (ww_s2wv :: GHC.Prim.Int#) ->
                 case $wgo1_r2xL ww_s2wv of { (# ww2_s2wy #) -> ww2_s2wy }

test2 :: Int -> Int
test2 = \ (w_s2ws :: Int) ->
        case w_s2ws of { GHC.Types.I# ww1_s2wv -> T16040.$wtest2 ww1_s2wv }

Here the $wgo1 loop does no allocation on its hot path, but does allocate an I# ww box as it returns.

I think that $wgo1 is doing a heap-check on every iteration, at the start of the function. It would be better to do the check only on the last iteration, in the DEFAULT branch. I bet these redundant heap checks are what is taking the time.

We have long-standing tickets about this; see the summary in comment:2 of Trac #14791.

I would love someone to work on this. To catch cases like this should not be very hard. By "like this" I mean

  • A primitive case
  • Which has no allocation "upstream" (i.e. before it)
  • And at least one alternative does no allocation.

Matthew is right that Nested CPR would also fix this. The trouble is that the recursive go from test2 returns an X Int, whose representation has two levels of box; eg X (I# 3#). The current CPR transform can't optimise that. Nested-CPR can, and would make neither branch of $wgo2 allocate

Tantalizingly, we have a nested-cpr patch nearly ready to go. (See Trac #1600.) But it needs someone to pay it some sustained attention.

I don't see an obvious band-aid. This is a long-standing issue, not a regression.

comment:6 Changed 12 months ago by _recursion

Thank you for the detailed explanation, Simon! I took a look at the core but missed seeing the allocation previously.

Regarding fixing this or CPR, would nested CPR fix all cases 'like this', as you put it, or even if the nested CPR patch lands would there still be a need for functionality to catch cases like this?

The reason I ask is that I'd be quite willing to take a look at either over the holiday in some free time, assuming I can find some guidance, and I'm wondering which would be best to attack.

Regarding the nested-cpr patch, what kind of attention is needed? Do you think that, with said attention, it could be landed for 8.8?

Last edited 12 months ago by _recursion (previous) (diff)

comment:7 Changed 12 months ago by simonpj

Regarding fixing this or CPR, would nested CPR fix all cases 'like this', as you put it, or even if the nested CPR patch lands would there still be a need for functionality to catch cases like this?

I think we want both nested CPR and better placement of heap checks. They do different things -- it just so happens that in this case either would solve it.

Great that you can work on it. I'd suggest the heap-check placement one, because it's a well-contained problem, I think. Nested CPR may have a longer on-ramp.

Version 0, edited 12 months ago by simonpj (next)

comment:8 Changed 12 months ago by _recursion

“You think” - sounds promising!

But yes if both are going to be useful then I think I’ll try my hand at the heap-check placemement one when I next find some time! It sounds like a decently well-contained problem, and I’ve been meaning to contribute more than just filing tickets for a while.

comment:9 Changed 12 months ago by _recursion

Description: modified (diff)

Unfinished sentence in the report because I am sometimes unobservant.

comment:10 Changed 12 months ago by simonpj

Keywords: CPRAnalysis added

comment:11 Changed 12 months ago by sgraf

Cc: sgraf added
Note: See TracTickets for help on using tickets.