Opened 7 years ago

Last modified 4 years ago

#6040 new bug

Adding a type signature changes heap allocation into stack allocation without changing the actual type

Reported by: tibbe Owned by: simonpj
Priority: normal Milestone:
Component: Compiler Version: 7.4.1
Keywords: Cc: fox@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

According to Milan Straka, changing

insert :: Ord k => k -> a -> Map k a -> Map k a
insert = go
  where
    STRICT_1_OF_3(go)
    go kx x Tip = singleton kx x
    go kx x (Bin sz ky y l r) = ...

to

insert :: Ord k => k -> a -> Map k a -> Map k a
insert = go
  where
    go :: Ord k => k -> a -> Map k a -> Map k a
    STRICT_1_OF_3(go)
    go kx x Tip = singleton kx x
    go kx x (Bin sz ky y l r) = ...

changes how GHC allocates the argument, from heap to stack. Here's the relevant commit: https://github.com/haskell/containers/commit/32d84ba5eb82f34dbb8a8fabf07077d848cdb408

It includes this comment:

-- [Note: Type of local 'go' function]
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- If the local 'go' function uses an Ord class, it must be given a type
-- which mentions this Ord class. Otherwise it is not passed as an argument and
-- it is instead heap-allocated at the entry of the outer method.

I find this quite alarming. The type of k above is already Ord k, so the extra type signature shouldn't make a difference in my opinion.

Change History (9)

comment:1 Changed 7 years ago by simonpj

difficulty: Unknown

Can you give a repo case that shows a performance effect?

I'm guessing, but I think that the difference is something like this:

f1 x = g 100                   f2 x = g x 0
  where                          where
    g 0 = x                        g x 0 = x
    g n = g (n-1)                  g x n = g x (n-1)
  • In f1 we'll heap-allocate a function closure capturing the free variable x, but the recursive calls have just one argument (plus one for the function closure itself).
  • In f2 we'll pass x as an argument. Indeed g will be floated out to be a top-level function. (And in the example you give I don't undersatnd why go and insert are separate functions, but that's a stylistic thing.)

If there were 100 free varaibles instead of 1, it might well be a good plan to use f1, to save passing 100 arguments in each call. But since there is only one it'd be better to turn the free variable into an argument, by lambda lifting. This is the reverse of the "static argument transformation".

The transformation should be done right at the end, because in general turning a free varaible into an argument is a bad idea (loss of information).

I've been meaning to do this for some time; if you have data to show it's important I could up the priority!

comment:2 Changed 7 years ago by igloo

Milestone: 7.8.1
Owner: set to simonpj

comment:3 Changed 7 years ago by tibbe

Cc: fox@… added

I can't reproduce this with a smaller example so I'm deferring this to Milan Straka, who first discovered the issue.

comment:4 Changed 7 years ago by milan

I was asked by Johan to demonstrate the issue.

I would like to note that I am not sure it is even a bug -- I understand what is going on and I am not sure what a better behaviour would be -- but maybe the transformation Simon suggests is a good idea.

Consider two methods delete1 and delete2 which delete all occurences of a given key in the list:

module Test ( delete1, delete2 ) where

delete1 :: Eq k => k -> [k] -> [k]
delete1 = go
  where go _ [] = []
        go k (x:xs) | k == x = go k xs
                    | otherwise = x : go k xs

delete2 :: Eq k => k -> [k] -> [k]
delete2 = go
  where go :: Eq k => k -> [k] -> [k]
        go _ [] = []
        go k (x:xs) | k == x = go k xs
                    | otherwise = x : go k xs

The difference is that in the first case, go is a closure which captures the Eq dictionary of the outer function. In the second case, go is being given the Eq dictionary as a parameter and therefore floats out of delete2. STG illustrates this nicely:

==================== STG syntax: ====================
Test.delete1 =
    \r [$dEq_sql eta_sqs eta1_sqt]
        let {
          go_sqq =
              sat-only \r [ds_sqm ds1_sqh]
                  case ds1_sqh of wild_sqO {
                    [] -> [] [];
                    : x_sqn xs_sqp ->
                        case GHC.Classes.== $dEq_sql ds_sqm x_sqn of wild1_sqP {
                          GHC.Types.False ->
                              let { sat_sqN = \u [] go_sqq ds_sqm xs_sqp;
                              } in  : [x_sqn sat_sqN];
                          GHC.Types.True -> go_sqq ds_sqm xs_sqp;
                        };
                  };
        } in  go_sqq eta_sqs eta1_sqt;
SRT(Test.delete1): []
Test.delete2_go =
    \r [$dEq_sqB ds_sqC ds1_sqx]
        case ds1_sqx of wild_sqQ {
          [] -> [] [];
          : x_sqD xs_sqF ->
              case GHC.Classes.== $dEq_sqB ds_sqC x_sqD of wild1_sqS {
                GHC.Types.False ->
                    let { sat_sqR = \u [] Test.delete2_go $dEq_sqB ds_sqC xs_sqF;
                    } in  : [x_sqD sat_sqR];
                GHC.Types.True -> Test.delete2_go $dEq_sqB ds_sqC xs_sqF;
              };
        };
SRT(Test.delete2_go): []
Test.delete2 =
    \r [$dEq_sqK eta_sqL eta1_sqM]
        Test.delete2_go $dEq_sqK eta_sqL eta1_sqM;
SRT(Test.delete2): []

The result is that delete1 has to heap-allocate the dictionary while delete2 does not.

 Test_delete1_info()
         { update_frame: <none>
           label: Test_delete1_info
           rep:HeapRep static { Fun {arity: 3 fun_type: ArgSpec 20} }
         }
     crz:
         Hp = Hp + 8;
         if (Hp > HpLim) goto crC;
         I32[Hp - 4] = sqq_info;
         I32[Hp + 0] = I32[Sp + 0];
         R1 = Hp - 2;
         Sp = Sp + 4;
         jump sqq_info ();
 ...

The performance difference is very minor -- delete1 needs one more two-word heap allocation. The time difference will not probably be measurable. I notised this only because I monitor heap allocations in containers code as they cause ghc perf tests which monitor heap allocation to fail.

BTW, we use identical insert and go functions in containers because the outer is marked INLINE in pre-7.0 GHC.

comment:5 Changed 7 years ago by simonpj

Right. This is an example of my first comment. If we do a final lambda-floating phase at the end the two would become identical. Thanks for confirming.

comment:6 Changed 5 years ago by thoughtpolice

Milestone: 7.8.37.10.1

Moving to 7.10.1.

comment:7 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:8 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:9 Changed 4 years ago by thomie

Milestone: 8.0.1
Note: See TracTickets for help on using tickets.