Opened 4 years ago

Closed 4 years ago

Last modified 2 years ago

#11372 closed bug (fixed)

Loopification does not trigger for IO even if it could

Reported by: jscholl Owned by:
Priority: normal Milestone: 8.0.1
Component: Compiler (CodeGen) Version: 7.10.3
Keywords: cmm, loopification, code generation, CodeGen Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1767
Wiki Page:

Description (last modified by simonpj)

The loopification optimization, as I understand it, allows a self-recursive function to jump to a local label instead of the beginning of the function, thus skipping a potential stack check. However, I only observe it triggering for pure functions while IO functions do not get that benefit, even when it would be possible.

I discovered this in #8793 after looking into an unexpected speedup by removing the IO context from an otherwise pure loop and while such a loop can simply be changed to a pure version (the IOLoop examples), other functions can not, but the optimization could be applied to them (see MapM.hs).

I tried to benchmark the differences between a naive loop in IO and some horrible inlinePerformIO hacks to get the loopification to fire and the "optimized" version performs 3-5% faster on my machine.

See Commentary/Compiler/Loopification for details

Attachments (3)

MapM.hs (1.7 KB) - added by jscholl 4 years ago.
mapM_ implementation as example of a loop which would benefit from the optimization, but currently doesn't.
IOLoop.hs (327 bytes) - added by jscholl 4 years ago.
Loop which can be made pure by the user
nofib-results (189.6 KB) - added by jscholl 4 years ago.
nofib run comparing current ghc-8.0 branch and my changes applied to it

Download all attachments as: .zip

Change History (12)

Changed 4 years ago by jscholl

Attachment: MapM.hs added

mapM_ implementation as example of a loop which would benefit from the optimization, but currently doesn't.

Changed 4 years ago by jscholl

Attachment: IOLoop.hs added

Loop which can be made pure by the user

comment:1 Changed 4 years ago by simonpj

Description: modified (diff)

comment:2 Changed 4 years ago by jscholl

Differential Rev(s): Phab:D1767
Status: newpatch

The problem with loopification for loops in IO / any monad of the form State# s -> (# State s, a #) is indeed caused by the extra State# tokens. While they have no runtime representation and thus are not passed to a self-recursive call, they are still counted as arguments when considering a self-recursive call for loopification.

So I tried to fix this by adding an additional parameter to getCallMethod, namely the number of void arguments, and subtract that number before checking if the correct number of arguments is present (I could not just remove void arguments from the arguments as the number of arguments is used in other places as well and I think there these void arguments have to generate stg_ap_v calls or something similar). I also tried to add a note describing the situation with these void arguments and loopification.

I also did a nofib-run. Most things changed only a little bit, but most of the code seems to run a little bit faster, though in praxis it is even less than in my benchmark, but this is not suprising as most code does not (only) consist of such loops in IO.

Changed 4 years ago by jscholl

Attachment: nofib-results added

nofib run comparing current ghc-8.0 branch and my changes applied to it

comment:3 Changed 4 years ago by Ben Gamari <ben@…>

In 4d51bfc8/ghc:

Do not count void arguments when considering a function for loopification.

This fixes #11372 by omitting arguments with a void-type when checking
whether a self-recursive tail call can be optimized to a local jump.
Previously, a function taking a real argument and a State# token
would report an arity of 1 in the SelfLoopInfo in getCallMethod,
but a self-recursive call would apply it to 2 arguments, one of them
being the State# token, thus no local jump would be generated.
As the State# token is not represented by anything at runtime, we can
ignore it and thus trigger the loopification optimization.

Test Plan: ./validate

Reviewers: austin, bgamari, simonmar

Reviewed By: bgamari

Subscribers: simonmar, thomie

Differential Revision: https://phabricator.haskell.org/D1767

GHC Trac Issues: #11372

comment:4 Changed 4 years ago by bgamari

Status: patchmerge

comment:5 Changed 4 years ago by bgamari

Milestone: 8.0.1

comment:6 Changed 4 years ago by bgamari

comment:7 Changed 4 years ago by bgamari

Resolution: fixed
Status: mergeclosed

comment:8 Changed 4 years ago by thomie

GHC Speed results are in:

nofib/time/fannkuch-redux 	3.153	- 11.45%	2.792	seconds
nofib/time/n-body 	        0.886	- 3.61%	        0.854	seconds

Nice!

Also interesting is the stabilization of the n-body performance in this graph.

comment:9 Changed 2 years ago by simonpj

Keywords: CodeGen added
Note: See TracTickets for help on using tickets.