Opened 3 years ago

Closed 2 years ago

#13220 closed bug (fixed)

Performance regressions in testsuite from join points

Reported by: lukemaurer Owned by: rwbarton
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 8.1
Keywords: JoinPoints Cc: dfeuer, lukemaurer
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:3391
Wiki Page:

Description (last modified by rwbarton)

Need to look at some modest performance regressions for the main join-point patch (#12988, wiki SequentCore):

Change History (43)

comment:1 Changed 3 years ago by simonpj

Description: modified (diff)

comment:2 Changed 3 years ago by bgamari

I'm actually not sure that the effect on compiler performance is all that modest, unfortunately. Gipeda shows that the build time for GHC itself went up by 13%. While some increase is to be expected, this is quite significant. Are there any obvious knobs that could be adjusted to reduce the amount of work that the join point analysis/simplifier do under -O1? The full analysis is great for -O2, but I do fear that users will be up in arms if the GHC regression is representative of user code.

Also, strangely enough, while the join points work eliminated all allocations from fannkuch-redux, the actual runtime of this test might have actually gone up by 8%. I say "might have" since runtime measurements are notoriously unstable, but it does look likely that this is a real regression: the test's runtime is quite long (4 seconds) and 8% is quite a large change. Even if this measurement were off by a factor of two we would want to understand it. Can you shed any light on this, Luke?

Last edited 3 years ago by bgamari (previous) (diff)

comment:3 Changed 3 years ago by nomeata

The fannkuch-redux weirdness is tracked at #13225. And it is stable: Once a few more commits come in, one can clearly distinguish a regression from a occasional one-time bump (that occur with some tests), and this is one: https://perf.haskell.org/ghc/#graph/nofib/time/fannkuch-redux;hl=8d5cf8bf584fd4849917c29d82dcf46ee75dd035 (This link will be less useful in a short while, unfortunately, as the graphs only show the latest commits.)

Last edited 3 years ago by nomeata (previous) (diff)

comment:4 Changed 3 years ago by simonpj

Gipeda shows that the build time for GHC itself went up by 13%

Well that is new and very unwelcome information.

Some of the compiler/perf tests went down quite a bit, a couple increased. So I was under the impression that compile time impact was slight, and some runtime improvements were jolly big.

But a 13% increase across the entire build is massive.

Can we get any insight into what is increasing? It's hard to konw where to start with a single number across the entire build. A good place to look might be binary sizes: if we generate more code then that'll take longer to compile.

Bother.

Still, there is no good reason that any of this join point stuff should cost much more. Yes, the occurrence analyser is doing a bit more work, but not much. If occ-anal cost 13% more I'd be surprised. But it'd have to cost 100% more to make the entire build take 13% more. Something is fishy.

comment:5 Changed 3 years ago by simonpj

Description: modified (diff)

comment:6 Changed 3 years ago by simonmar

What about looking at compiler allocations for nofib, to try to get a smaller example?

I think this should be an 8.2 release blocker.

comment:7 Changed 3 years ago by simonmar

Milestone: 8.2.1
Priority: normalhighest

comment:8 Changed 3 years ago by dobenour

Is the bootstrap time regression due to a bug I introduced in Phab:D2879 that was fixed by rwbarton in Phab:D3070?

Edit: as per both Gipedia and mpickering on #ghc this is not the case.

Last edited 3 years ago by dobenour (previous) (diff)

comment:9 Changed 3 years ago by mpickering

No they are two different bad regressions. Reid has a spreadsheet which shows how the build time has changed over time.

comment:10 Changed 3 years ago by simonpj

Reid, Ben, David: might one of you be able to dig into where the 13% build time increase is coming from? The compiler/perf tests suggest that it's not an across-the-board increase in compile time, so perhaps some things are big outliers?

Without more insight we are stuck on this, and we can hardly release with such a big step change in build time -- at least not without understanding it.

Thanks

comment:11 Changed 3 years ago by rwbarton

Yes, I'll do some preliminary investigations.

comment:12 in reply to:  2 Changed 3 years ago by lukemaurer

Replying to bgamari:

I'm actually not sure that the effect on compiler performance is all that modest, unfortunately. Gipeda shows that the build time for GHC itself went up by 13%. While some increase is to be expected, this is quite significant. Are there any obvious knobs that could be adjusted to reduce the amount of work that the join point analysis/simplifier do under -O1? The full analysis is great for -O2, but I do fear that users will be up in arms if the GHC regression is representative of user code.

There isn't anything obvious—the occurrence analyser looks for join points as it goes, and that process should only be a small constant overhead. Nothing in the simplifier (or elsewhere that I can think of) has to work particularly hard to handle join points (or it shouldn't, anyway!).

Also, strangely enough, while the join points work eliminated all allocations from fannkuch-redux, the actual runtime of this test might have actually gone up by 8%. I say "might have" since runtime measurements are notoriously unstable, but it does look likely that this is a real regression: the test's runtime is quite long (4 seconds) and 8% is quite a large change. Even if this measurement were off by a factor of two we would want to understand it. Can you shed any light on this, Luke?

I haven't seen that jump before, and it's not obvious why it would happen. A quick diff of the dumps shows that the two versions end up with wildly different code, so it may be hard to pin any particular culprit.

comment:13 Changed 3 years ago by rwbarton

The fannkuch-redux test is one that can be sensitive to mysterious code alignment effects; see #8279. However the reduction in allocations is so striking that it's worth checking whether something else could have gone wrong here.

I'm pretty familiar with fannkuch-redux, so I'd be happy to investigate that one as well.

comment:14 Changed 3 years ago by rwbarton

An initial report. I collected timing information from validation builds of the "Join points" commit and its parent commit, each built twice using ghc-8.0.2 and then a perf build of the same commit as the bootstrapping compiler (for reasons to become clear below). I have only looked at the timing results for a few modules so far. Here are the times for everyone's favorite module, DynFlags.hs.

stage1 build: bootstrapped by          mutator CPU seconds
    ghc-8.0.2                             63s
    ghc-HEAD-before-join                  63s
    ghc-HEAD-after-join                   67s
stage2 build:
    ghc-HEAD-before-join                  150-154s
    ghc-HEAD-after-join                   209-213s

Observations:

  • Performance of the perf build (used as the bootstrapping compiler, for the stage1 build of DynFlags) may have regressed by a few percent with join points. For now this is looking at the time for just one module, so I wouldn't take that number very seriously.
  • The time it took the stage1 compiler to perform the stage2 build of DynFlags increased by around 40% with the join point change. But, these numbers are both MUCH larger than the time it took to do the stage1 build of DynFlags. There are a few possible causes:
    • The stage1 compiler is built with only -O, unlike the perf build that produced the bootstrapping compiler, which built with -O2.
    • The stage1 compiler is built with -DDEBUG, so it contains extra sanity checks.
    • The stage2 compiler is built with -dcore-lint, so additional extra checks are being done during the stage2 build.
    • Something else about the build could be different between the two stages, like the values of CPP macros. I think this is unlikely to be the cause in this case.
    One of these points seems to have gotten much worse with join points.

The good news is that, according to preliminary results, the performance of the perf (i.e., release) build of the compiler seems to be minimally affected. And if the performance regression is in -dcore-lint, for example, it may be easy to track down and eliminate.

comment:15 Changed 3 years ago by rwbarton

Another data point: The total time it to finish the stage1 build (which does include some time spent not in ghc, like running configure scripts) increased from 1208s to 1272s when moving from ghc-HEAD-before-join to ghc-HEAD-after-join as the bootstrapping compiler, an increase of 5.3%. That's roughly consistent with the 6% increase in DynFlags.hs build time mentioned in the previous comment. Still not wonderful, but also not nearly as bad as the total build time regression would seem to indicate.

comment:16 Changed 3 years ago by simonpj

When GHC takes longer to compile the same file, it's sometime because it's generated bigger code. What does -dshow-passes show ghc-before and ghc-after compiling the same DynFlags.hs?

DynFlags is a jolly big file. Are there smaller modules that show a big compile-time difference. nofib-analyse shows some per-module info I think.

comment:17 Changed 3 years ago by rwbarton

There are two issues here.

  • Core Lint is expensive, and join points made it more expensive. For example HsExpr took twice as long to compile with -O -dcore-lint as with -O, before join points; and three times as long, after.

I'm still waiting on some builds to finish, but I suspect most of the 13% build time regression is attributable to extra Core Lint checks added with join points. Just looking at the code, markAllJoinsBad looks somewhat expensive; and it is called on the right hand side of every binding, the body of every lambda, the scrutinee of every case, etc.

This doesn't affect normal users, who won't use -dcore-lint.

  • Normal compiles (without -dcore-lint) are also slightly slower after join points. I'm working on getting better data on this and looking for outlier modules where the build time went up a lot. At Simon's suggestion, I looked at program sizes but they went up only a tiny bit on average (< 1%). The number of simplifier iterations also increased a tiny bit (< 1%).

comment:18 in reply to:  17 Changed 3 years ago by lukemaurer

Thanks—glad to hear it was (so far) a simple fix …

comment:19 Changed 3 years ago by simonpj

markAllJoinsBad looks somewhat expensive

You are right. It's ridiculously expensive, and there is a vastly easier way to do the job: keep track of the valid join points, not the invalid ones!

Patch coming

comment:20 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In 6bab649b/ghc:

Improve checking of joins in Core Lint

This patch addresses the rather expensive treatment of join points,
identified in Trac #13220 comment:17

Before we were tracking the "bad joins".  Now we track the good ones.
That is easier to think about, and much more efficient; see CoreLint
Note [Join points].

On the way I did some other modest refactoring, among other things
removing a duplicated call of lintIdBndr for let-bindings.

On teh

comment:21 Changed 3 years ago by simonpj

Reid, does this patch help?

comment:22 Changed 3 years ago by simonpj

Status on this: Reid/David are going to look into why so much time is going into Lint.

comment:23 Changed 3 years ago by simonpj

Status on this: Reid/David are going to look into why so much time is going into Lint.

Any progress on this?

comment:24 Changed 3 years ago by bgamari

Gipeda suggests that the patch didn't have much of an effect.

comment:25 Changed 3 years ago by rwbarton

I now believe the culprit is this added line in originally lintCoreVar, now lintVarOcc:

+        ; ensureEqTys ty ty' $ mkBndrOccTypeMismatchMsg var' var ty' ty

Reasoning: perf showed that when running with -dcore-lint, the single function that takes the most time after the garbage collector is nonDetCmpTypeX_go. Before the join points patch, it took 6.5% of the total time. After, it took 9.1% of the total time. And there seems to be nothing that corresponds to this ensureEqTys call before the join points patch.

comment:26 Changed 3 years ago by dfeuer

Cc: dfeuer added

comment:27 Changed 3 years ago by rwbarton

Hmm, removing that test (if it's even okay) also didn't have as much effect as I hoped.

comment:28 Changed 3 years ago by rwbarton

Okay, found the REAL culprit. It's embarrassingly simple, and not at all what I had been looking for.

Running ghc with -dcore-lint -v and grepping for Core Linted result revealed the issue, this new line in simplifyPgmIO:

            Err.dumpIfSet_dyn dflags Opt_D_dump_occur_anal "Occurrence analysis"
                      (pprCoreBindings tagged_binds);
+           lintPassResult hsc_env CoreOccurAnal tagged_binds;
 
                 -- Get any new rules, and extend the rule base
                 -- See Note [Overall plumbing for rules] in Rules.hs

We now run Core Lint twice per simplifier iteration, which in my sample program amounted to running Core Lint 50% more times overall. That's rather expensive.

comment:29 Changed 3 years ago by simonpj

Cc: lukemaurer added

Ha! That's funny. I don't know why Luke added that, but I suggest we just remove it. Occurrence analysis doesn't really add anything new that Lint can catch, I think. Or add yet another flag. But taking a big slow-down for this just doesn't seem worth it.

Luke do you agree?

comment:30 Changed 3 years ago by simonpj

BTW, do you withdraw comment:25?

comment:31 Changed 3 years ago by simonpj

And does this address Joachim's comment on https://phabricator.haskell.org/rGHC2effe18ab51d?

This might be interesting for @lukemaurer: 
According to https://perf.haskell.org/ghc/#revision/2effe18ab51d66474724d38b20e49cc1b8738f60 
this patch undid the effect of join points on fannkuch-redux (i..e allocation goes up by 1318762.9%, but runtime goes down by 5%). Was this ever cleared up?

Also, integer runtime goes up by 13%, 
and hidden by 6%, and code size generally by around 4%. Expected?

comment:32 in reply to:  30 Changed 3 years ago by rwbarton

Replying to simonpj:

BTW, do you withdraw comment:25?

Well, the addition of that ensureEqTys did contribute a couple percent to the total compilation time (with -dcore-lint) in my test. If it's useful we can keep it; I don't understand what kind of substitution le_subst Core Lint is carrying around, so it's unclear to me whether this is actually testing something about the program being compiled or testing the consistency of Core Lint itself.

comment:33 Changed 3 years ago by simonpj

OK, fine. Reid, will you remove the offending Lint call, and pick up the perf benefits (I hope)?

That may still leave comment:31

comment:34 Changed 3 years ago by rwbarton

Differential Rev(s): Phab:3391

Phab:3391 is for removing the added Core Lint pass after occurrence analysis.

comment:35 Changed 3 years ago by rwbarton

Description: modified (diff)

One of the performance test regressions was due to Core Lint running more and is fixed by the above patch. The others remain to be investigated.

comment:36 Changed 3 years ago by Ben Gamari <ben@…>

In 71916e1c/ghc:

Remove Core Lint pass on occurrence analysis output (#13220)

It was expensive, as the simplifier runs for many iterations,
and probably not very useful.

Test Plan: harbormaster

Reviewers: austin, bgamari, dfeuer

Reviewed By: dfeuer

Subscribers: dfeuer, thomie

Differential Revision: https://phabricator.haskell.org/D3391

comment:37 Changed 3 years ago by bgamari

Resolution: fixed
Status: newclosed

comment:38 Changed 3 years ago by simonpj

Owner: lukemaurer deleted
Resolution: fixed
Status: closednew

Re-opening: comment:31, and four out of the five perf regressions in the Description, are not yet attended to.

Last edited 3 years ago by simonpj (previous) (diff)

comment:39 Changed 3 years ago by bgamari

Ahh, yes, fair point.

comment:40 Changed 2 years ago by bgamari

Owner: set to rwbarton

Reid, do you suppose you could have a look at the remaining regressions when you get a chance?

comment:41 Changed 2 years ago by rwbarton

Priority: highestnormal

The other regressions are small (~5%) and the join points commit is so complicated and so long ago that I doubt looking at these changes is the easiest way to gain a few percent on a few benchmarks.

Furthermore the issue is clouded by indirect effects of the join points change that result from running a stage2 compiler. For example parsing and typechecking allocates slightly less after join points, but these gains are roughly canceled out by increased allocations during desugaring. Since these phases were presumably not changed by the join points commit, this changes must be due to different code generation with join points. These changes are probably more interesting than slight changes in perf tests.

In all three perf/compiler tests where allocations changed, there were no significant changes to Core program size. It looks like the simplifier is just doing more work, which is no big surprise. I looked at T9020 (the simplest test case) with stage1 compilers built with ticky, but the code changes are extensive enough that I couldn't draw any conclusion from the results (many changes, new functions added, and old functions removed).

comment:42 Changed 2 years ago by simonpj

Fair enough. Shall we close this, then?

comment:43 Changed 2 years ago by bgamari

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.