Opened 4 years ago

Last modified 4 years ago

#10606 new feature request

avoid redundant stores to the stack when examining already-tagged data

Reported by: rwbarton Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.11
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

GHC compiles a function that performs case analysis on a value of an ADT like

bool :: a -> a -> Bool -> a
bool f t b = case b of
  False -> f
  True  -> t

to Cmm of the form

     {offset
       cwV:
           if ((Sp + -24) < SpLim) goto cwW; else goto cwX;
       cwW:
           R4 = R4;
           R3 = R3;
           R2 = R2;
           R1 = Bool.bool_closure;
           call (stg_gc_fun)(R4, R3, R2, R1) args: 8, res: 0, upd: 8;
       cwX:
           I64[Sp - 24] = cwL;     -- (*)
           R1 = R4;
           P64[Sp - 16] = R2;      -- (†1)
           P64[Sp - 8] = R3;       -- (†2)
           Sp = Sp - 24;           -- (‡)
           if (R1 & 7 != 0) goto cwL; else goto cwM;
       cwM:
           call (I64[R1])(R1) returns to cwL, args: 8, res: 8, upd: 8;
       cwL:
           if (R1 & 7 >= 2) goto cwT; else goto cwU;
       cwT:
           R1 = P64[Sp + 16];
           Sp = Sp + 24;
           call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
       cwU:
           R1 = P64[Sp + 8];
           Sp = Sp + 24;
           call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
     }

Statement (*) stores a return address for the evaluation of b to return to, and statements (†1), (†2) save local variables that are live in case alternatives, since they cannot be held in registers across the evaluation of b. But in the event that b is already evaluated and represented by a tagged pointer, all these stores are unnecessary: the return address written by (*) is simply dead, and the values saved in (†1), (†2) are still available in whatever locations they were copied to the stack from.

In many cases the data we examine is mostly tagged, and while the active part of the stack is likely to be in L1 cache, the cost of these stores and reads is probably still positive (barring secondary effects from changes to pipelining, branch prediction, and so on).

In this case we could certainly move the return address store (*) into block cwM, and possibly move the local variable stores (†1), (†2) into cwM as well, though it's then not clear to me how to recover the values in the alternatives (does Cmm have something like phi nodes?) I don't propose to move the statement (‡), as arithmetic on registers is essentially free anyways.

I tried implementing the part of this pertaining to the return address (*) and ran into two complications.

  • For some reason, when I moved the return address store (*) into the "data is not tagged" branch in the Stg->Cmm translation, this also resulted in both the local variable stores (†1), (†2) and the update to Sp (‡) being sunk into both branches of the "is the data tagged" conditional at some point in the Cmm optimization pipeline. This was useless since they couldn't be pushed further past the branch on the returned tag value, so the result was enlarged code size that outweighed the savings of avoiding a single store. I didn't investigate exactly why this sinking was dependent on the location of the store (*), but this should be fixable.
  • There may be heap checks in the alternatives. In that case, the code generator currently cleverly reuses the stack frame and info table set up for the evaluation of b in the heap failure branches. If we move some of the stores (*), (†1), (†2) into the evaluation branch cwM, then we either have to duplicate them in heap failure branches, or set up a new stack frame and info table, or do some other clever thing. Or in the worst case, only do this optimization when performing the heap check before the case (which may then become slightly more attractive).

I'm attaching the current version of my patch mainly for my own future reference; it seems to produce correct, but larger and marginally slower code, I believe for the reasons described above.

Attachments (1)

0001-XXX-Sink-writing-the-return-address-into-the-branch-.patch (5.3 KB) - added by rwbarton 4 years ago.

Download all attachments as: .zip

Change History (4)

comment:1 Changed 4 years ago by rwbarton

Type of failure: None/UnknownRuntime performance bug

comment:2 Changed 4 years ago by simonpj

Moreover, the code for each alternative has to work regardless of whether we arrive at it by doing an eval-and-return, or just jump to it for the fully-tagged case. So the code for the alternative needs to know where t,f are going to be.

You say that you can use them "wherever they are", and that may be true if they are in local variables (= registers). But then the return-from-eval code will need to re-load them from the stack into the agreed registers, before going to the shared code for the alternative. And that could be bad if the first thing the alternative does is to save them on the stack for another eval!

Morover, an unboxed-tuple-return might use up a bunch of registers.

This looks tricky to me. Happy to Skype about it if you are keen to pursue. But I think there is lower-hanging fruit: see "Cmm and code generation" on Status/SLPJ-Tickets.

comment:3 Changed 4 years ago by simonmar

This is basically the same as #8905.

It's not hard to rearrange the code to optimise the already-evaluated path, but as you noticed it will increase code size due to not being able to share the saving code with the heap-check failure, and having to reload things from the stack in the unevaluated case. Things are rather delicately arranged at the moment to generate small code.

I believe the reason that you get some duplication when sinking the return address is because there's a special case in the stack allocator to spot this.

One thing I think it would be worth doing is having an option to tune the tradeoff between code size and speed (like gcc's -Os), and the code generated for case expressions would be a prime candidate to be altered by this.

Note: See TracTickets for help on using tickets.