Opened 6 years ago

Last modified 21 months ago

#8272 new task

testing if SpLim=$rbp and Sp=$rsp changed performance at all

Reported by: carter Owned by: carter
Priority: normal Milestone:
Component: Compiler Version: 7.7
Keywords: callingConvention Cc: schyler, michalt, simonmar, rwbarton, bollu
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

testing if SpLim=$rbp and Sp=$rsp changed performance at all

would need a stack check but then push could be used to spill to the stack

Idea via Nathan Howell.

At the very least, the x86 PUSH instruction has a more succinct encoding than MOV.

worth hacking out to see if this can measurable shift ghc perf on nofib or not.

this would be part of a larger effort to explore ways to improve GHC's calling convention for modern hardware

Change History (20)

comment:1 Changed 6 years ago by ezyang

There is a comment in the original STG paper about why PUSH/POP could not be used for GHC's stack. The circumstances may have changed.

comment:2 Changed 6 years ago by carter

@ezyang you mean "Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine, SL Peyton Jones, Journal of Functional Programming 2(2), Apr 1992, pp127-202." ?

comment:3 Changed 6 years ago by carter

hrm... that seems like it might not be the right paper

comment:4 Changed 6 years ago by ezyang

Sorry, I misspoke. I am actually speaking of CALL/RET, and the problem is described in "Faster Laziness Using Dynamic Pointer Tagging." There is one more problem with setting Sp=$rsp, and that is that the register is already taken: execution maintains both a Haskell stack and a C stack. The C stack is used for register spills by the register allocator.

comment:5 Changed 6 years ago by carter

huh. that doesn't rule out experimenting with it, but definitely does make any such experimentation a bit more subtle. We'd basically need to make sure that theres enough scratch space on the current GHC stack segment for the register spills. i believe on sound way to conservatively track that is to compute the maximum size used by live variables in a function body, though with some extra adjustments roughly, right?

comment:6 Changed 6 years ago by carter

difficulty: Project (more than a week)Rocket Science

because of how subtle this may be, and how its moderately likely that the work will ultimately not work out (though its worth exploring), i'm setting the difficulty to "rocket science" :)

comment:7 Changed 6 years ago by schyler

It's possible this could fix #8048 or at least allow better/smarter code to be generated for spills.

comment:8 Changed 6 years ago by schyler

Cc: schyler added

comment:9 Changed 5 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:10 Changed 4 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:11 Changed 4 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:12 Changed 4 years ago by thomie

Milestone: 8.0.1

comment:13 Changed 2 years ago by michalt

Cc: michalt added

comment:14 Changed 2 years ago by michalt

Cc: simonmar rwbarton added

There's been an interesting discussion about this in: https://github.com/ghc-proposals/ghc-proposals/pull/17

If people still think this would be a good idea, I'd like to try things out. But I need some help to get started. :)

First question. It seems to me that there are two slightly different ideas here:

1) using %rsp for managing the Haskell stack (instead of %rbp)

2) using call/ret

Is 1) really a strict requirement for 2)? IIUC we're already using C-stack (%rsp) for spilling things during register allocation. The only problem I can think of is the amount of space for those return addresses. Am I missing something else or is this enough to prevent us from using call/ret? (also, somewhat related, wouldn't using call/ret allow us to get rid of proc-point splitting for LLVM?)

Second question. For every CmmCall that contains cml_cont we'd need to compile this into two instructions: call and jmp <block from cm_cont> (to jump over the info table that preceeds the block where we want to return to). But that means that the return address no longer points to the info table. Does anything else depend on this? Is there some easy way to check that?

Third question. How could this work in LLVM backend? I don't think LLVM even allows direct manipulation of the stack pointer. Also, even if we could manipulate it, I wouldn't be surprised if LLVM wanted to move things around the stack, which again sounds pretty problematic (for, e.g., GC)

PS. CCing simonmar and rwbarton since they were the main contributors to the linked GHC proposal. (hope you don't mind!)

comment:15 Changed 2 years ago by bgamari

In my view (1) is really the more interesting change as it enables use of native profiling tools. Currently we are essentially unable to use systems' native profiling tools to collect callstacks (e.g. perf record --callgraph) since they generally assume that the callstack is tracked by %rsp. This means that we either need to rely on extremely platform specific hacks (e.g. on Linux we may be able to collect the Haskell stack using eBPF) or invent our own silo of profiling tools.

Last edited 2 years ago by bgamari (previous) (diff)

comment:16 Changed 2 years ago by simonmar

@bgamari: doesn't the DWARF stack unwinding support that you've been working on help with perf?

@michalt: I think we have to do (1) in order to do call/ret, because otherwise the stack would be split over two places, and the RTS would have a terrible time walking it. Or perhaps I've misunderstood what you mean?

I'm actually not all that enthusiastic about the proposal having just re-read https://github.com/ghc-proposals/ghc-proposals/pull/17. The benefits are small or are actually regressions (code size and the overhead for jmp after call) and it's a huge upheaval.

comment:17 Changed 2 years ago by carter

@simonmar the stack pointer experiment sans call and rest might still be worth measuring,yes?

comment:18 Changed 2 years ago by michalt

@simonmar: When you say "terrible time walking the stack", you mean GC, right?

I've did some more googling and also found [1] which describes some of the differences between the Haskell and C stacks (eg, the Haskell one is heap allocated and easy to grow, the C one is per capability, etc.) I agree - all of this makes it sound that if we want call/ret, we do need (1).

Also I still don't really see how we could use %rsp in LLVM backend without pretty large changes (eg, using its stackmaps/safepoints).

I have to admit that I'm also less enthusiastic about the whole idea after looking into it a bit.

[1] https://www.reddit.com/r/haskell/comments/1wm9n4/question_about_stacks_in_haskell_and_rust/cf3vfeq/

comment:19 Changed 2 years ago by simonmar

@simonmar the stack pointer experiment sans call and rest might still be worth measuring,yes?

Having actual data would be a good starting point, but I'm not hopeful that it will be a win.

@simonmar: When you say "terrible time walking the stack", you mean GC, right?

Yes, and exceptions and various other RTS routines that need to understand the stack.

comment:20 Changed 21 months ago by bollu

Cc: bollu added

CCing myself in light of some simplexhc experiments. On ackermann function, GHC and clang O3 don't scale linearly.

On teaching simplexhc to generate C-like code (using call/ret instead of custom stack frame management), I began getting C-like performance on the same examples. Previously, I was around GHC, or somewhat slower. Link to numbers from that experiment here.

I sent out an e-mail to LLVM-dev, asking if it is possible to fake GHC-like "custom call stack on the heap" within LLVM. If possible, I'd like to implement that and check for slowdowns in C code. That should provide some data of the benefits of choosing to use the native call stack.

Last edited 21 months ago by bollu (previous) (diff)
Note: See TracTickets for help on using tickets.