Opened 3 years ago

Last modified 3 years ago

#12231 new bug

Eliminate redundant heap allocations/deallocations

Reported by: harendra Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 8.0.1
Keywords: Cc: michalt, tjakway
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by harendra)

The code generated by GHC sometimes places the Heap allocation and deallocation in a common path which has multiple sub-branches inside it. Some of those branches may not need heap at all and it just checks allocates and deallocates it without using. This adds unnecessary overhead and becomes significant when the loop is small and runs very frequently.

Instead of putting the heap allocation code at a higher level branch we can move it to more specific branches where it is really needed so that it gets out of the paths which do not need it.

I am attaching a specific example of such a case. The code used here is in a public github repository [2]. It should be quite easy to play with it, there are detailed instructions along with the code to reproduce and investigate the issue.

The files referenced in the following text are attached with this ticket. They are available in the github repo as well. Look at the CMM trace in decompose-loop-execution-trace.cmm and decompose-loop-full.cmm, start at label c4ic where we allocate space on heap (+48). There are multiple possible paths from this point on, some of those use the heap and some don't. Some interesting paths are:

1) c4ic (allocate) -> c4mw -> {c4pv} -> ...
2) c4ic (allocate) -> c4mw -> c4pw -> ((c4pr -> ({c4pe} -> ... | c4ph -> ...)) | cp4ps -> ...)

I have put those which use the heap inside curly braces, the rest do not use it at all. The paths which actually need the heap are rarely executed in this case. But because of the placement it is always allocated and deallocated in a the fast path which is executed almost all the time.

If we can place this allocation at both c4pv and c4pe instead of the common parent then we can save the fast path from this check. The same thing applies to the allocation at label c4jd as well.

The whole loop is composed on 51 instructions (ghc-8.0.1) and 8 of them are heap adjustment related, which comes out to almost 16%. If I can get this back then the performance of this loop will get quite close to the gcc compiled code which is currently around 30% faster.

This is a CMM level optimization which will benefit all architectures. I also noticed that llvm removed the heap adjustment though not the checks.

References

Attachments (5)

decompose-loop-execution-trace.cmm (1010 bytes) - added by harendra 3 years ago.
decompose-loop-full.cmm (18.7 KB) - added by harendra 3 years ago.
decompose-loop-execution-trace-ghc-7.10.3.asm (2.4 KB) - added by harendra 3 years ago.
decompose-loop-execution-trace-ghc-8.0.1.asm (1.7 KB) - added by harendra 3 years ago.
decompose-loop-execution-trace-llvm.asm (2.1 KB) - added by harendra 3 years ago.

Download all attachments as: .zip

Change History (9)

Changed 3 years ago by harendra

Changed 3 years ago by harendra

Attachment: decompose-loop-full.cmm added

Changed 3 years ago by harendra

Changed 3 years ago by harendra

Changed 3 years ago by harendra

comment:1 Changed 3 years ago by simonpj

I have not dug into the details here, but it's reminiscent of #8317, #8326, #9661, #10124, #10137, #1498, and #2289.

comment:2 Changed 3 years ago by harendra

Component: CompilerCompiler (CodeGen)
Description: modified (diff)

comment:3 Changed 3 years ago by michalt

Cc: michalt added

comment:4 Changed 3 years ago by tjakway

Cc: tjakway added
Note: See TracTickets for help on using tickets.