Opened 3 years ago

Last modified 3 years ago

#12028 new bug

Large let bindings are 6x slower (since 6.12.x to 7.10.x)

Reported by: tommd Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.3
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

See the attached somewhat reduced version of the main (manually unrolled) portion of an MD5 implementation taken from PureMD5. It used to compile in about 200ms, which is already bad enough, but now takes 1200 ms (ghc 6.12.3 to ghc 7.10.3 using -O2). I'd like to note that in my light testing the UNPACK pragma accounts for 400ms of this delta.

Attachments (1)

regression-md5.hs (5.6 KB) - added by tommd 3 years ago.
A slice of pureMD5 that shows performance issues in GHC

Download all attachments as: .zip

Change History (6)

Changed 3 years ago by tommd

Attachment: regression-md5.hs added

A slice of pureMD5 that shows performance issues in GHC

comment:1 Changed 3 years ago by simonpj

Type of failure: None/UnknownCompile-time performance bug

comment:2 Changed 3 years ago by tommd

On 8.0 rc2:

$ time ghc -O2 -fforce-recomp regression-md5.hs
[1 of 1] Compiling Data.Digest.Pure.MD5 ( regression-md5.hs, regression-md5.o )

real    0m1.497s
user    0m1.460s
sys     0m0.028s

comment:3 Changed 3 years ago by bgamari

It looks like there was a rather significant regression from 7.6 (where compilation took around 700 ms) to 7.8 (1200 ms).

It looks like we are spending a significant amount of time (500 ms out of a 1500 ms compilation) in CodeGen on 8.0.1 despite the fact that the simplified program is only 6000 terms, 2000 types (admittedly, "Codegen" timings are rather vague due to how Cmm data is streamed).

According to a profile, the codegen time is mostly being spent in CmmSink. There was a fair amount of churn in the code generator from 7.12.3 including reworking much of it to use hoopl, so perhaps it's not surprising that things regressed a bit.

Last edited 3 years ago by bgamari (previous) (diff)

comment:4 Changed 3 years ago by bgamari

Cc: simonmar added

Cc'ing Simon Marlow, who is quite familiar with the sinking pass.

Indeed this example produces an extremely large amount of code (around 2700 lines of C--) in applyMD5Rounds. Indeed the sink pass does do a significantly amount of fiddling with this code. For instance, prior to sinking much of the code for this function is bit twiddling,

      _cbnY::I64 = _s9Sn::I64 ^ 4294967295;
      _s9WQ::I64 = _cbnY::I64;
      _cbo1::I64 = _s9WQ::I64 & _s9Sp::I64;
      _s9WR::I64 = _cbo1::I64;
      _cbo4::I64 = _s9Sn::I64 & _s9So::I64;
      _s9WP::I64 = _cbo4::I64;
      _cbo7::I64 = _s9WP::I64 | _s9WR::I64;
      _s9WS::I64 = _cbo7::I64;
      _cboa::I64 = _s9WS::I64 + _s9SG::I64;
      _s9WT::I64 = _cboa::I64;
      _cbod::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9WT::I64));
      _s9WU::I64 = _cbod::I64;
      _cbog::I64 = _s9WU::I64 + 3614090360;
      _s9WV::I64 = _cbog::I64;
      _cboj::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9WV::I64));
      _s9WW::I64 = _cboj::I64;
      _cbom::I64 = _s9WW::I64 + _s9Sm::I64;
      _s9WX::I64 = _cbom::I64;
      _cbop::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9WX::I64));
      _s9WO::I64 = _cbop::I64;
      _cbos::I64 = _s9WO::I64 >> 25;
      _s9X0::I64 = _cbos::I64;
      _cbov::I64 = _s9WO::I64 << 7;
      _s9WZ::I64 = _cbov::I64;
      _cboy::I64 = _s9WZ::I64 | _s9X0::I64;
      _s9X1::I64 = _cboy::I64;
      _cboB::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9X1::I64));
      _s9X2::I64 = _cboB::I64;
      _cboE::I64 = _s9X2::I64 + _s9Sn::I64;
      _s9X3::I64 = _cboE::I64;
      _cboH::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9X3::I64));
      _s9WY::I64 = _cboH::I64;
      _cboK::I64 = _s9WY::I64 ^ 4294967295;

Which sinking turns into,

      _s9WO::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9Sn::I64 & _s9So::I64 | _s9Sn::I64 ^ 4294967295 & _s9Sp::I64 + _s9SG::I64)) + 3614090360)) + I64[Sp + 152]));
      _s9WY::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32((_s9WO::I64 << 7) | (_s9WO::I64 >> 25))) + _s9Sn::I64));
      _s9X4::I64 = %MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(%MO_UU_Conv_W32_W64(%MO_UU_Conv_W64_W32(_s9WY::I64 & _s9Sn::I64 | _s9WY::I64 ^ 4294967295 & _s9So::I64 + _s9SX::I64)) + 3905402710)) + _s9Sp::I64));

It would be interesting to know how the code produced by 6.12 differs from that produced by newer compilers.

comment:5 Changed 3 years ago by simonmar

There were regressions due to the new code generator, but we felt that overall the architecture benefits were worth the hit. I actually spent a lot of time getting the overheads down to an acceptable level.

A common trick to apply is to spot a pattern in the code generator and do some specialised code generation. Maybe it's possible to apply that here? See also #7198.

Note: See TracTickets for help on using tickets.