Opened 18 months ago

Last modified 18 months ago

#14951 patch task

SpecConstr needs two runs when one should suffice

Reported by: nomeata Owned by:
Priority: normal Milestone:
Component: Compiler Version: 8.2.2
Keywords: SpecConstr Cc: sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #14844 Differential Rev(s): Phab:D4519
Wiki Page:

Description (last modified by simonpj)

This is a spin-off of #14844, which is a spin-off of #14068, but applies on its own.

Consider this code:

module T14844Example (topLvl) where

topLvl large = (bar1, bar2, foo)
  where
    foo :: Integer -> (a -> b -> Bool) -> (a,b) -> Bool
    foo 0 _ _ = False
    foo s f t = l s' t
       where
         l 0 t = False
         l 1 t = case t of (x,y) -> f x y
         l n (x,y) = l (n-1) (x,y)
         s' = large s

    bar1 :: Integer -> (a -> b -> Bool) -> a -> b -> Bool
    bar1 s f x y = foo s f (x,y)

    bar2 :: Integer ->  (a -> b -> Bool) -> a -> b -> Bool
    bar2 s f x y = foo (s + 1) f (x,y)

Status quo: l gets specialized, because of the two call patterns

  • l s' t and
  • l (n-1) (x,y)

The second one is interesting *and* its second argument gets scrutinized (the scu_occs field reports ScrutOcc for t). But foo does not get specialized: It does have an interesting call pattern, but scu_occs reports UnkOcc, because foo’s parameters are just passed to t.

When we decide to SpecConstr l, we know that one of the calls to l is of the shape s' t0. This is a boring call, and we do not create a specialization for it. But we create a specialization for l using the the other call pattern. This means we know that it would be beneficial if t0 were a constructor. So can we, at this point, decide to include t0 ↦ ScrutOcc in scu_occs?

First experiments look good, so I am working on this.

Change History (19)

comment:1 Changed 18 months ago by sgraf

That sounds like you want to associate 'argument occurence signatures' with functions, analogous to demand signatures, that transport ArgOcc information to arguments of a call.

I think that would be a great idea, especially if it would work reliably for recursive functions. Although that will probably run into the same limitations as Let Up (usage from body decides over what to specialise for) vs. Let Down (signature needed where the function is in scope).

comment:2 Changed 18 months ago by sgraf

Cc: sgraf added

comment:3 Changed 18 months ago by nomeata

You are thinking one step ahead. But it does not need something analogous to demand signatures: The SpecConstr code already gathers all calls to f and remembers them when it specializes f, so we can add the ArgOcc information there, without having the LetUp vs. LetDown dilemma.

comment:4 Changed 18 months ago by simonpj

Description: modified (diff)

comment:5 Changed 18 months ago by simonpj

So can we, at this point, decide to include t0 ↦ ScrutOcc in scu_occs?

I'm not at all sure. There might be many specialisations of l, with many RULES. Which of them would you like to use when gathering occurrence info foo's argument.

Sebastian's thought is interesting though. Perhaps we want the fact that l's argument is scrutinesed to flow from l to l's call sites, via some kind of SpecConstr signature. It might be as if we'd inlined one "layer" of l at the call site.

I wonder about fixpointing such signatures. Consider

f True x y = case x of (p,q) -> ...
f False x y = f True y x

We'd get a specialisation for x, but (I think) not for y. But one for y would be good!

SpecConstr is jolly compilicated. Splitting it into an analysis followed by exploitation woudl be a good thing. Just like demand analysis.

comment:6 Changed 18 months ago by simonpj

BTW, what is the connection to the title of the ticket? Why does two runs help?

comment:7 Changed 18 months ago by nomeata

See `wip/T14951` for what I have in mind. If fixes the problem with the example code in the ticket.

BTW, what is the connection to the title of the ticket? Why does two runs help?

Ah, right: If you simplify, then l gets inlined, and suddenly the body of foo _does_ scrutinize t and a second run of SpecConstr would specialize l.

Last edited 18 months ago by nomeata (previous) (diff)

comment:8 Changed 18 months ago by sgraf

Indeed we can only add t ↦ ScrutOcc if we know that a matching specialisation will apply. Consider what would happen if we regarded (n-1) as a constructor form (it isn't , of course, but imagine inductive nats), too:

Then we would have [ScrutOcc [UnkOcc], ScrutOcc [UnkOcc, UnkOcc] for ls RHS and would have a more specific specialisation for l (n-1) (x,y). If we bubble out a usage of t ↦ ScrutOcc within foo from the l s' t call, this will attempt to do a specialisation of foo when it hits a call site like in bar1. But now there's no matching specialisation of l anymore: The one which was a candidate before also assumes n-1 as an argument. This can potentially make things much worse. So the whole signature thing seems like not such a good idea after all.

comment:9 Changed 18 months ago by nomeata

This can potentially make things much worse.

What do you mean with worse? We can construct cases where we would create specializations of the outer functions that are not beneficial, but it wound’t make things worse, right?

In the end it's all heuristic. But I’d like to be able to make small steps forwards to unblock loopification rather than wait for a complete rewrite of SpecConstr (as much as I appreciate such a thing, the module is a beast :-))

And in general we try hard to make our transformations as idempotent as possible, so we should do that here as well.

comment:10 Changed 18 months ago by sgraf

What do you mean with worse?

Well, assuming we would specialise foo for foo s f (x,y) without calling a specialised l, then we would just defer constructing the pair. So, you're right: not actually worse at all, but unnecessary code bloat, still.

Come to think of it, have you tried to weave in GHC.Types.SPEC into foos signature instead? That essentially makes SpecConstr forget to find matching ArgOccs and blindly specialises for call sites and is probably what would help here, too. Probably at the expense of even worse code bloat when there are many different call patterns, but there is only one in your specific case.

comment:11 Changed 18 months ago by nomeata

That sounds like it requires user intervention; that’s certainly not an option here, where I am trying to fix regressions due to loopification.

comment:12 Changed 18 months ago by sgraf

You could thread the SPEC value through your RHS. This will require an additional INLINE binding, unless you want to fix up call sites, too. Assuming you loopified foo:

foo :: Integer -> (a -> b -> Bool) -> (a,b) -> Bool
foo = $wfoo SPEC
  where
    $wfoo !_ 0 _ _ = False
    $wfoo !_ f t = l s' t
       where
         l 0 t = False
         l 1 t = case t of (x,y) -> f x y
         l n (x,y) = l (n-1) (x,y)
         s' = large s
{-# INLINE foo #-}

bar1 :: Integer -> (a -> b -> Bool) -> a -> b -> Bool
bar1 s f x y = foo s f (x,y)

bar2 :: Integer ->  (a -> b -> Bool) -> a -> b -> Bool
bar2 s f x y = foo (s + 1) f (x,y)

Provided the simplifier inlines foo before SpecConstr runs, this should make sure that $wfoo will be specialised for its call pattern. This would not need SpecConstr to change at all, I think.

comment:13 Changed 18 months ago by nomeata

If I understand this correctly, then you are proposing that loopification should tell SpecConstr to very aggressively specialize a loopified binding, independent of whether its arguments are actually scrutinzed anywhere? But isn’t that strictly less precise than my proposal, which only specializes foo if there is some indication that it will be beneficial?

comment:14 Changed 18 months ago by sgraf

That's what I was referring to by 'at the expense of even worse code bloat'. The difference is that you might have a better time being less precise for loopified bindings (have more bloat there) than to have potentially little more bloat for every non-recursive binding.

Also depending on how far you are with your changes to SpecConstr, this could be simpler.

comment:15 Changed 18 months ago by nomeata

potentially little more bloat for every non-recursive binding.

It’s not just non-recursive bindings. It’s gonna bloat

  • every binding
  • that has interesting calls
  • that itself calls a local function
  • which itself is being specialized
  • and where the specialization of the inner function match the calls to the outer
  • but matches them only partially (the case where they match completely is the one we are interested in).

which seems pretty narrow to me.

Also depending on how far you are with your changes to SpecConstr, this could be simpler.

They were pretty simple it needs just a bit more cleanup, see wip/T14951. I am basically waiting for perf.haskell.org to report the results.

Last edited 18 months ago by nomeata (previous) (diff)

comment:16 Changed 18 months ago by sjakobi

Summary: SpecContsr needs two runs when one should sufficeSpecConstr needs two runs when one should suffice

comment:17 Changed 18 months ago by nomeata

Differential Rev(s): Phab:D4519
Status: newpatch

I put the code up for review and discussion at Phab:D4519

comment:18 Changed 18 months ago by nomeata

Ok, here is some very motivation data:

If I run SpecConstr twice, with simplification in between (and without the patch in Phab:D4519), then there are significant performance improvements:

Nofib allocations
Benchmark name          previous      change now
nofib/allocs/event     129683224   +   8.44% 140627312   bytes
nofib/allocs/fulsom    243329632   -   7.83% 224287920   bytes
nofib/allocs/mandel2      922640   -  13.89%    794448   bytes
nofib/allocs/minimax     5371584   -   8.73%   4902576   bytes
nofib/allocs/parstof     3038584   -   3.63%   2928248   bytes

Nofib instruction
Benchmark name              previous    change now
nofib/instr/compress2     549006516 -   6.36%    514104497
nofib/instr/fulsom        755145483 -   3.41%    729394470
nofib/instr/ida           261218740 -   3.98%    250820469
nofib/instr/k-nucleotide 2140743692 -   3.97%   2055743369
nofib/instr/minimax         9099200 -   4.38%      8700706

I checked that these are really due to the second SpecConstr (and not due to the extra simplification).

I did not investigate them individually, and unfortunately, Phab:D4519 does _not_ achieve any of these improvements. Is there a reasonably simple way to make SpecConstr more idempotent?

comment:19 Changed 18 months ago by simonpj

SpecConstr is already trying quite hard to be idempotent, but it is manifestly failing.

It would be extremely helpful to distil poster-child examples of when it's not idempotent. I think the example in the Description is such a case, although the Description doesn't really make that clear.

mandel2 might be a good starting point because it has a very small kernel. It's also a surprise that the best instruction count improvements seem unrelated to the allocations! I don't know what to make of that.

We need more insight.

Last edited 18 months ago by simonpj (previous) (diff)
Note: See TracTickets for help on using tickets.