Opened 6 years ago

Last modified 7 months ago

#8655 new task

Evaluate know-to-terminate-soon thunks

Reported by: nomeata Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.6.3
Keywords: CPRAnalysis Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I guess I’ll better put my interior monologue in a ticket than on ghc-dev...

In order to implement nested CPR (#1600), I had to enrich the CPR type to keep track of whether something, when entered, will converge for sure. My code does not solve the halting problem, but does only simple stuff, so if something is known to converge, is is very likely to be cheap.

Simon suggested to measure the effect of evaluating a let-bound thunk with the known-to-terminate property, as if the body was using it strictly. This tickets tracks this suggestion, and reports progress.

Change History (13)

comment:1 Changed 6 years ago by nomeata

A first attempt looks promising:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           atom          +0.0%     -0.0%     +0.7%     +0.7%     +0.0%
   binary-trees          +0.0%     +0.0%     -0.4%     -0.4%     +0.0%
      cacheprof          +0.0%     +0.0%     -0.8%     +0.0%     +0.0%
        circsim          +0.0%     +0.0%     -2.7%     -2.7%     +0.0%
    constraints          +0.0%     +0.0%     -1.3%     -1.3%     +0.0%
   cryptarithm1          +0.0%     +0.0%     -1.5%     -1.0%     +0.0%
 fannkuch-redux          +0.0%     +0.0%     +0.1%     +0.2%     +0.0%
          fasta          +0.0%     +0.0%     -0.6%     +0.6%     +0.0%
         hidden          +0.0%     +0.0%     +0.0%     +0.7%     +0.0%
        integer          +0.0%     +0.0%     +0.2%     +0.2%     +0.0%
   k-nucleotide          +0.0%     +0.0%     -1.9%     -1.9%     +0.0%
           lcss          +0.0%     +0.0%     -1.6%     -0.8%     +0.0%
         n-body          +0.0%     +0.0%     -1.1%     -1.1%     +0.0%
           para          +0.0%     +0.0%     -1.9%     -2.8%     +0.0%
       pidigits          +0.0%     +0.0%     +0.0%     -1.1%     +0.0%
          power          +0.0%     +0.0%     -2.9%     -2.9%     +0.0%
            scs          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
  spectral-norm          +0.0%     +0.0%     +0.2%     +0.2%     +0.0%
      transform          +0.0%     +0.0%     -1.8%     -1.8%     +0.0%
      wave4main          +0.0%     +0.0%     -1.9%     -1.0%     +0.0%
   wheel-sieve1          +0.0%     +0.0%     +0.8%     +0.8%     +0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -0.1%     -2.9%     -2.9%    -17.4%
            Max          +0.0%     +0.0%     +0.8%     +0.8%     +0.0%
 Geometric Mean          -0.0%     -0.0%     -0.9%     -0.7%     -0.2%

Such a speculative optimization cannot be expected to be a definite win, but I think the numbers are encouraging: No large increase and a detectable improvement in the mean.

I put my code in wip/cbv-conv-thunk, which shares some patches with wip/nested-cpr that are not in master yet. Let’s see how that goes for a rebasing branch...

comment:2 Changed 6 years ago by nomeata

(Hmm, on second thought: I guess for Runtime, these numbers are below nofib’s precision... So maybe the gain is simply zero? The unchanged allocations seem to support that.)

comment:3 Changed 6 years ago by nomeata

Indeed, after running before and after again, and making sure the machine does nothing else in that time, I get

            Min          -0.0%     -0.1%     -2.2%     -1.6%     -1.3%
            Max          +0.0%     +0.0%     +3.6%     +3.6%     +9.7%
 Geometric Mean          -0.0%     -0.0%     +0.2%     +0.3%     +0.1%

So probably it’s not worth it.

comment:4 Changed 6 years ago by tibbe

Try running the shootout benchmarks in "slow" mode. They have been tuned to have meaningful runtimes.

comment:5 Changed 6 years ago by nomeata

Thanks for the suggestion. Two new runs, unloaded machine, slow mode. Slightly different numbers, but encouraging again:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +0.0%     +0.0%     +0.3%     +0.5%     +0.0%
           atom          +0.0%     -0.0%     +0.6%     +0.6%     +0.0%
   binary-trees          +0.0%     +0.0%     +0.0%     -0.0%     +0.0%
          boyer          +0.0%     +0.0%     -0.5%     -0.5%     +0.0%
      cacheprof          +0.0%     -0.2%     -1.6%     -1.6%     +0.9%
       calendar          +0.0%     +0.0%     -0.7%     -0.7%     +0.0%
        circsim          +0.0%     +0.0%     -0.9%     -1.2%     +0.0%
       clausify          +0.0%     +0.0%     +0.6%     +0.6%     +0.0%
  comp_lab_zift          +0.0%     +0.0%     +0.7%     +0.0%     +0.0%
    constraints          +0.0%     +0.0%     -0.3%     -0.4%     +0.0%
   cryptarithm1          +0.0%     +0.0%     -0.5%     -0.5%     +0.0%
          event          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
         exp3_8          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
 fannkuch-redux          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
          fasta          +0.0%     +0.0%     +0.2%     +0.2%     +0.0%
            fft          +0.0%     +0.0%     -0.8%     -0.8%     +0.0%
       fibheaps          +0.0%     +0.0%     -3.0%     -3.0%     +0.0%
          fluid          -0.0%     -0.1%      0.01      0.01     +0.0%
    gen_regexps          +0.0%     +0.0%     -1.2%     -0.4%     +0.0%
         genfft          +0.0%     +0.0%     -2.2%     +0.0%     +0.0%
         hidden          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
            ida          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
        integer          +0.0%     +0.0%     -0.2%     -0.2%     +0.0%
      integrate          +0.0%     +0.0%     -0.9%     -0.9%     +0.0%
   k-nucleotide          +0.0%     +0.0%     -0.4%     -0.4%     +0.0%
          kahan          +0.0%     +0.0%     +1.2%     +1.2%     +0.0%
        knights          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
           lcss          +0.0%     +0.0%     -1.9%     -1.9%     +0.0%
     multiplier          +0.0%     +0.0%     -0.4%     -0.4%     +0.0%
         n-body          +0.0%     +0.0%     -0.6%     -0.6%     +0.0%
           para          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
      paraffins          +0.0%     +0.0%     -3.3%     -2.9%     +0.0%
       pidigits          +0.0%     +0.0%     +0.2%     +0.4%     +0.0%
          power          +0.0%     +0.0%     -0.6%     -0.6%     +0.0%
         primes          +0.0%     +0.0%     -0.4%     -0.4%     +0.0%
         queens          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
reverse-complem          +0.0%     +0.0%     +0.0%     +0.3%     +0.0%
        rewrite          +0.0%     +0.0%     +1.7%     +2.5%     +0.0%
          sched          +0.0%     +0.0%     +0.3%     +0.3%     +0.0%
            scs          +0.0%     +0.0%     +0.5%     +0.5%     +0.0%
          solid          +0.0%     +0.0%     -0.6%     -0.6%     +0.0%
  spectral-norm          +0.0%     +0.0%     +0.0%     -0.1%     +0.0%
            tak          +0.0%     +0.0%     +0.7%     +0.7%     +0.0%
      transform          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
      typecheck          -0.0%     +0.0%     -0.7%     -0.7%     +0.0%
           wang          +0.0%     +0.0%     -1.1%     -1.1%     +0.0%
      wave4main          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
   wheel-sieve1          +0.0%     +0.0%     +0.0%     +0.0%     +0.0%
   wheel-sieve2          +0.0%     +0.0%     -3.3%     -4.1%     +0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -0.2%     -3.3%     -4.1%    -19.3%
            Max          +0.0%     +0.0%     +1.7%     +2.5%     +0.9%
 Geometric Mean          -0.0%     -0.0%     -0.4%     -0.3%     -0.2%

I’m still a bit surprised that allocations do not change – but at least they change for fluid, so maybe I am not measuring noise after all :-)

comment:6 Changed 6 years ago by tibbe

I don't know if slow mode is really for anything but the shootout benchmarks though, so pay closer attention to those results above.

comment:7 Changed 6 years ago by nomeata

The first attempt was a bit naive; doing proper detection of surely converging expressions is a bit more involved. Some notes at the NestedCPR wiki page. Unfortunately it makes stuff that was safe before (such as removing variables from the environment of a strictness signature) not safe any more, so I expect it to take more work until the code is actually correct – currently, simple tests work, but ghc-stage2 diverges while eating up memory very fast.

comment:8 Changed 6 years ago by nomeata

Ok, next try. This time I modified only exprOkForSpeculation: When checking an application, it will see if the function has a strictness signature with the Converges flag, and if all arguments are exprOkForSpeculation (they might be unlifted, hence speculation would also evaluate them), the whole expression is ok for speculation.

The results are overwhelming....ly minuscule:

Allocations change exactly as before:

          fluid         -45.0%     -0.1%      0.00      0.00     +0.0%
--------------------------------------------------------------------------------
            Min         -49.5%     -0.1%    -15.4%    -15.4%     -5.7%
            Max         -41.3%     +0.0%    +97.9%    +97.9%     +0.0%
 Geometric Mean         -47.4%     -0.0%    +13.5%    +13.6%     -0.2%

(Ignore code size, one working copy had ticky-ticky enabled. Runtime number unusable, machine was not unloaded.)

I looked at fluid, and it is actually quite nice what happens here: There are thunks calling read_n_val, which has a definition of (.., ..). CPR turns that in to (# .., .. #). So currently, we are allocating a thunk for the worker of read_n_val (which did not cancel with anything). When called, this thunk will allocate a (..,..) box, which later is taken apart by yet another two thunks, representing fst .. resp. snd ... After using the Converges flag, we immediately call $wread_n_val, which returns quickly after allocating two thunks. We thereby save the allocation of (..,..) and the thunks for fst .. and snd ...

What is interesting is that this change did not happen in CorePrep, but already in Float Out, directly after the demand analyser. The function calling exprOkForSpeculation is lvlCase.

I also measured the *static* effect of this change. When compiling GHC, the libraries and nofib,

  • master did 205788 calls to exprOkForSpeculation from lvlCase. Of these 7169 (3.48%) returned true, while
  • my branch did 206234 calls to exprOkForSpeculation from lvlCase, of these 7245 (3.51%) returned true.

I doubt that measuring the dynamic effect will give any more insight than what we already gain, namely:

Using Converges in exprOkForSpeculation is a good idea that does not cost much in terms of implementation and does not cause programs to go bad. When it kicks in, it does nice things, but in practice the turnout is negligibly.

Last edited 6 years ago by nomeata (previous) (diff)

comment:9 Changed 5 years ago by nomeata

Resolution: wontfix
Status: newclosed

Since nested cpr was not merged, this ticket makes little sense any more. Or, put differently, it was investigated and deemed not worthwhile pursuing on its own.

comment:10 Changed 5 years ago by simonpj

Owner: nomeata deleted
Resolution: wontfix
Status: closednew

A simple analysis that discovers surely-converging expressions is surely a good thing. I would regret discarding the work you did on that, even if it's not very promising in terms of payoff.

Do you have that piece in a branch somewhere?

Simon

comment:11 Changed 5 years ago by nomeata

I think the branch wip/cbv-conv-thunk reflected the latest state of my work, which itself is branched off wip/nested-cpr (or an older version of that branch – git isn’t very good in keeping track of the history of branches, unfortuantely).

comment:12 Changed 3 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:13 Changed 7 months ago by simonpj

Keywords: CPRAnalysis added
Note: See TracTickets for help on using tickets.