Opened 5 years ago

Last modified 4 years ago

#9041 new bug

NCG generates slow loop code

Reported by: nh2 Owned by:
Priority: normal Milestone:
Component: Compiler (NCG) Version: 7.6.3
Keywords: Cc: carter.schonwald@…, mail@…, simonmar, trommler
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


In we compile the code

main :: IO ()
main = do
  loop (maxBound :: Word32) $ \i -> do
    when (i `rem` 100000000 == 0) $
      putStrLn "foo"

and compare the assembly generated by the NCG and -fllvm.

The LLVM code is 10 times faster.

While the LLVM code looks great, NCG one generated looks quite unfortunate and messy, with a lot of unnecessary moving around.

I am by no means an expert in this, but some of it looks like low hanging potential improvements.

This is the LLVM code:

    movl    $4294967295, %esi           /* load the loop bound */
    movabsq $-6067343680855748867, %rdi /* load a magic number for the modulus */
    jmp .LBB1_2
    incl    %ecx                        /* loop core start */
    cmpq    %rsi, %rcx
    je  .LBB1_6                         /* check loop bound */

    /* do the modulus with two multiplications, a shift and a magic number */
    /* note : gcc does the same reduction */ 
    movq    %rcx, %rax
    mulq    %rdi
    shrq    $26, %rdx
    imulq   $100000000, %rdx, %rax  
    cmpq    %rax, %rcx
    jne .LBB1_4                         /* loop core end */
    /* Code omitted: print, then return to loop beginning */

So the core of the loop is a nice and short:


movq [division replaced by magic multiplication]


Now what NCG produces:

Main_zdwa_info:           /* loop core start */
    leaq -16(%rbp),%rax   /* stack check */
    cmpq %r15,%rax
    jb .Lc3JO             /* jump not taken in the normal case */
    movl $4294967295,%eax /* issue: loading the bound on every iteration */
    cmpq %rax,%r14
    jne .Lc3JB
   /* Return from main. Code omitted */
.Lc3JB: /* test the index for modulus */
    movl $100000000,%eax  /* issue: unnecessary moves */
    movq %rax,%rbx        /* and loading the modulo arg on every iteration */
    movq %r14,%rax
    xorq %rdx,%rdx        /* shorter alternative to mov $0,%rdx */
    divq %rbx             /* issue: doing the division (llvm and gcc avoid this) */
    testq %rdx,%rdx
    jne .Lc3JU            /* This jump is usually taken. */
   /* do the printing. Code omitted. Not relevant */
   /* increment index and (I guess) restore registers messed up by the printing */
    movq 8(%rbp),%rax     /* This code is not very relevant because we don't almost never */
    incq %rax  
    movl %eax,%r14d
    addq $16,%rbp
    jmp Main_zdwa_info
    leaq 1(%r14),%rax   /*issue: why not just increment r14? */
    movl %eax,%r14d     
    jmp Main_zdwa_info

So the core of this loop is

movl [stack check]



Some issues here:

  • the stack check is inside the loop
  • the loop limit is loaded every time inside the loop
  • the modulo arg is loaded every time inside the loop
  • we are shuffling around registers
  • no strength reduction (div is not replaced by cheaper magic number imul)
  • weird increment using leaq

It would be great if somebody with codegen insight could comment on whether these are easy targets for improvement!

Change History (3)

comment:1 Changed 5 years ago by carter

Cc: carter.schonwald@… added

Some of these match what are known issues with the ncg.

comment:2 Changed 4 years ago by thomie

Type of failure: Compile-time performance bugRuntime performance bug

Grouping this with the other 165 runtime performance bugs.

comment:3 Changed 4 years ago by trommler

Cc: trommler added
Note: See TracTickets for help on using tickets.