Opened 9 years ago

Closed 7 years ago

#4065 closed bug (fixed)

Inconsistent loop performance

Reported by: rl Owned by:
Priority: normal Milestone: 7.6.1
Component: Compiler Version: 6.13
Keywords: Cc: ezyang, dterei
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Here are two small benchmarks:

import Criterion.Main

foo :: Int -> Int -> Int
foo n k | n <= 0    = k
        | otherwise = foo (n-1) (k+1)

bar :: Int -> Int -> Int
bar n k | n == 0    = k
        | otherwise = bar (n-1) (k+1)

main :: IO ()
main = defaultMain [ bench "foo" $ nf (uncurry foo) (20000000,0)
                   , bench "bar" $ nf (uncurry bar) (20000000,0)
                   ]

On my laptop, I consistently get a mean running time of about 36ms for foo and about 45ms for bar. This is a rather big difference. The Core looks about the same and going via C gives me about 42ms for both loops so this is certainly a code generator problem.

Change History (9)

comment:1 Changed 9 years ago by simonmar

Milestone: 6.14.1

The problem is this:

Test.$wbar_entry()
        { []
        }
    cmr:
        _slB::I64 = R2;
        if (_slB::I64 != 0) goto cmv;
        R1 = R3;
        jump (I64[I64[Sp]]) ();
    cmv:
        R2 = _slB::I64 - 1;
        R3 = R3 + 1;
        jump Test.$wbar_entry ();
}

The assignment of _slB should really be inlined (or forward-propagated, if you like). The current code generator isn't clever enough to do this transformation when the use of the variable is in a different basic block from its assignment, and it's not an easy extension to make. This very example appears in a comment in CmmOpt.hs, in fact.

The new code generator will be able to do this using the dataflow framework. I'm sure LLVM will do it too (and more). We can leave the ticket open on the 6.14 branch to make sure the new backend fixes it.

comment:2 Changed 9 years ago by igloo

Milestone: 7.0.17.0.2

comment:3 Changed 9 years ago by igloo

Milestone: 7.0.27.2.1

comment:4 Changed 8 years ago by daniel.is.fischer

I just ran the benchmark with 7.0.3, I get the reverse outcome with that, foo takes about 48ms and bar about 42ms. Via C (-O2 -fvia-C -optc-O3), both take 40-41ms (bar is typically a bit faster). So, unless it's platform-dependent, the new code generator seems to have turned things around.

The core looks fine for both, basically

case n of
  0# -> ...
  _  -> ...

case n <=# 0# of
  True  -> ...
  False -> ...

for the workers. I don't understand cmm, so I'm guessing, the relevant parts seem to be

    cnO:
        _sn9::I32 = I32[Sp + 0];
        if (_sn9::I32 != 0) goto cnS;
        R1 = I32[Sp + 4];
        Sp = Sp + 8;
        jump (I32[Sp + 0]) ();
    cnS:
        _snI::I32 = I32[Sp + 4] + 1;
        _snJ::I32 = _sn9::I32 - 1;
        I32[Sp + 4] = _snI::I32;
        I32[Sp + 0] = _snJ::I32;
        jump FooBar_zdwbar_info ();

for bar and

    coF:
        _coI::I32 = %MO_S_Le_W32(I32[Sp + 0], 0);
        ;
        if (_coI::I32 >= 1) goto coL;
        _soy::I32 = I32[Sp + 4] + 1;
        _soz::I32 = I32[Sp + 0] - 1;
        I32[Sp + 4] = _soy::I32;
        I32[Sp + 0] = _soz::I32;
        jump FooBar_zdwfoo_info ();
    coL:
        R1 = I32[Sp + 4];
        Sp = Sp + 8;
        jump (I32[Sp + 0]) ();

for foo. The asm:

FooBar_zdwbar_info:
.LcnT:
	movl 0(%ebp),%eax
	testl %eax,%eax
	jne .LcnX
	movl 4(%ebp),%esi
	addl $8,%ebp
	jmp *0(%ebp)
.LcnX:
	decl %eax
	incl 4(%ebp)
	movl %eax,0(%ebp)
	jmp FooBar_zdwbar_info

resp.

FooBar_zdwfoo_info:
.Lcpe:
	cmpl $0,0(%ebp)
	jle .Lcph
	movl 0(%ebp),%eax
	decl %eax
	incl 4(%ebp)
	movl %eax,0(%ebp)
	jmp FooBar_zdwfoo_info
.Lcph:
	movl 4(%ebp),%esi
	addl $8,%ebp
	jmp *0(%ebp)

doesn't tell me either why bar is faster than foo.

comment:5 Changed 8 years ago by simonmar

Cc: ezyang added

Edward - this might be a good example to try with your new optimisation pass.

comment:6 Changed 8 years ago by igloo

Blocked By: 4258 added
Milestone: 7.2.17.6.1

comment:7 Changed 8 years ago by dterei

Cc: dterei added

comment:8 Changed 7 years ago by simonmar

Blocked By: 4258 removed
difficulty: Unknown

The new code generator is giving essentially the same code for these two, except that the sense of the branch is different (but that shouldn't make a difference).

Test_zdwbar_info:
.LcrC:
	testq %r14,%r14
	jne .LcrI
.LcrJ:
	movq %rsi,%rbx
	jmp *(%rbp)
.LcrI:
	decq %r14
	incq %rsi
	jmp Test_zdwbar_info

Test_zdwfoo_info:
.Lcqj:
	testq %r14,%r14
	jle .Lcqr
.Lcqs:
	decq %r14
	incq %rsi
.Lcqr:
	movq %rsi,%rbx
	jmp *(%rbp)

The old code generator was ok on foo, but worse on bar:

Test_zdwbar_info:
.Lcra:
	movq %r14,%rax
	testq %r14,%r14
	jne .Lcrg
	movq %rsi,%rbx
	jmp *0(%rbp)
.Lcrg:
	leaq -1(%rax),%r14
	incq %rsi
	jmp Test_zdwbar_info

comment:9 Changed 7 years ago by simonmar

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.