Opened 3 years ago

Last modified 9 months ago

#12798 new bug

LLVM seeming to over optimize, producing inefficient assembly code...

Reported by: GordonBGood Owned by:
Priority: normal Milestone: 8.10.1
Component: Compiler (LLVM) Version: 8.0.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #12808 Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

Since in many cases, the use of the LLVM backend is the only way to avoid the NCG's poor register allocation (ticket #8971), this is a concern that using "-fllvm" is producing overly complex code through a (seeming) failed attempt to optimize.

The following code uses a very simple "odds-only" implementation of the Sieve of Eratosthenes with a very tight inner culling loop limited to using a 16 Kilobyte buffer (<= the size of most modern CPU L1 data cache size) to reproduce the problem; it uses a "twos" Look Up Table (LUT) for better speed than using a variable shift left operation for setting the composite bits in the buffer array as it (should) take the same number of registers and the array look-up instruction is easier for the CPU to fuse than a variable shift left:

-- GHC_EfficiencyBug.hs
{-# LANGUAGE FlexibleContexts #-}
{-# OPTIONS_GHC -O3 -fllvm -rtsopts -keep-s-files #-} -- or -O2

import Control.Monad.ST
import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base

numLOOPS = 10000 :: Int

twos :: UArray Int Word32
twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

soe :: () -> [Word32]
soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
 bufb = runSTUArray $ do
  let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes
  cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool)
  cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb
  let loop n = -- cull a number of times to test timing
        if n <= 0 then return cmpstsb else
        let cullp i =
              let p = i + i + 3 in
              let s = (p * p - 3) `div` 2 in
              if s > bfLmt then loop (n - 1) else do
                isCmpst <- unsafeRead cmpstsb i
                if isCmpst then cullp (i + 1) else -- is Prime
                  let cull j = -- very tight inner loop where all the time is spent
                        if j > bfLmt then cullp (i + 1) else do
                          let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31)))
                          let w = j `shiftR` 5
                          ov <- unsafeRead cmpstsw w
                          unsafeWrite cmpstsw w (ov .|. sh)
                          cull (j + p)
                  in cull s
        in cullp 0
  loop numLOOPS

main = print $ length $ soe()

The main culling is repeated "numLOOPS" times to get a reasonable execution time for accurate timing and to make the time required to use the List comprehension to determine the number of found primes (the answer) a negligible part of the execution time. Timing results can be produced by running "./GHC_EfficiencyBug +RTS -s".

The desired assembly code result for the tight inner loop is as for the Rust/LLVM compiler, in this case for x86_64 64-bit code:

	.p2align	4, 0x90
.LBB10_27:
	movq	%rcx, %rdx
	shrq	$5, %rdx
	movl	%ecx, %esi
	andl	$31, %esi
	movl	(%rbp,%rsi,4), %esi
	orl	%esi, (%r14,%rdx,4)
	addq	%rax, %rcx
.LBB10_26:
	cmpq	%r13, %rcx
	jb	.LBB10_27

The above code is extremely efficient on a CPU that is not cache bottle necked (such as the AMD Bulldozer series are) and takes just about three clock cycles per inner composite culling loop on Intel Sky Lake; it is just as efficient for x86 code since there are only seven registers used in this inner loop.

Due to this attempt at "over-optimization", the GHC/LLVM backend produces the following x86_64 64-bit code:

	.align	16, 0x90
.LBB34_2:                               # %c8T2
                                        # =>This Inner Loop Header: Depth=1
	movq	%rcx, %rsi
	sarq	$5, %rsi
	movl	%r8d, %edi
	andl	$124, %edi
	movl	16(%rax,%rdi), %edi
	orl	%edi, 16(%r11,%rsi,4)
	addq	%r14, %rcx
	addq	%rdx, %r8
	cmpq	%r10, %rcx
	jle	.LBB34_2

As can be seen, instead of just masking the "twos" index register by 31 (0x1F), the code is using two extra separate registers to contain "(j * 4)" increment and the accumulated index, which increment is added to the "twos" index register per loop and masked by 124 (0x7C or 0x1F times 4), requiring an extra two registers and an extra instruction for the extra addition. This isn't a problem as to the number of registers for x86_64 code which has more than enough, but it adds the extra instruction execution time of one third of a CPU clock cycle (I know, only one ninth extra time).

However, for 32-bit x86 code with barely enough registers previously, the use of the extra registers triggers a chain of three register reloads as can be seen in the following assembly code:

	.align	16, 0x90
LBB33_2:                                # %c8Wb
                                        # =>This Inner Loop Header: Depth=1
	movl	%ebx, %ebp
	sarl	$5, %ebp
	movl	%edi, %ecx
	andl	$124, %ecx
	movl	%esi, %edx
	movl	%eax, %esi
	movl	36(%esp), %eax          # 4-byte Reload
	movl	8(%eax,%ecx), %ecx
	movl	%esi, %eax
	movl	%edx, %esi
	orl	%ecx, 8(%esi,%ebp,4)
	addl	32(%esp), %ebx          # 4-byte Folded Reload
	addl	28(%esp), %edi          # 4-byte Folded Reload
	cmpl	%eax, %ebx
	jle	LBB33_2

The above code runs about 25% slower than it should on Intel Sky Lake for this 32-bit code.

This was tested for GHC version 8.0.1 under both Windows and Linux for both 32-bit and 64-bit code with identical results for each native code width.

The code was also tested for 32 and 64 bit code produced by the NCG; for this specific problem, NCG takes the simple approach and does not waste the extra register. However, due to the inefficient allocation of registers as per ticket #8971, not moving the loop completion check to the end of the loop and thus requiring an extra jump instruction, and not combining the read/modify/write into a single instruction, it is still slower (much slower for 32-bit code) than the LLVM produced code. As its problems are known, I have not documented the NCG code.

Conclusion: This may seem like a nit picky type of bug as in some use cases the execution time cost is very small, but it may be an indication of problems in other use cases that cause more serious effects on execution speed. It is my feeling that for such low level somewhat imperative types of code, GHC should really produce code that is as fast as C/C++/Rust.

Change History (10)

comment:1 Changed 3 years ago by AlexET

On current HEAD with llvm 3.9.0, the follwing code

import Data.Word
import Data.Bits
import Data.Array.ST (runSTUArray)
import Data.Array.Base

import Control.Monad.ST

numLOOPS = 10000 :: Int

twos :: UArray Int Word32
twos = listArray (0, 31) [1 `shiftL` i | i <- [0 .. 31]]

soe :: () -> [Word32]
soe() = 2 : [fromIntegral i * 2 + 3 | (i, False) <- assocs bufb] where
 bufb = runSTUArray $ do
  let bfLmt = (256 * 1024) `div` 2 - 1 -- to 2^18 + 2 is 128 KBits - 1 = 16 KBytes                                                                                                                                 
  cmpstsb <- newArray (0, bfLmt) False :: ST s (STUArray s Int Bool)
  cmpstsw <- (castSTUArray :: STUArray s Int Bool -> ST s (STUArray s Int Word32)) cmpstsb
  return $! twos -- force evaluation of twos outside the loop.
  let loop n = -- cull a number of times to test timing                                                                                                                                                            
        if n <= 0 then return cmpstsb else
        let cullp i =
              let p = i + i + 3 in
              let s = (p * p - 3) `div` 2 in
              if s > bfLmt then loop (n - 1) else do
                isCmpst <- unsafeRead cmpstsb i
                if isCmpst then cullp (i + 1) else -- is Prime                                                                                                                                                     
                  let cull j = -- very tight inner loop where all the time is spent                                                                                                                                
                        if j > bfLmt then cullp (i + 1) else do
                          let sh = unsafeAt twos (j .&. 31) -- (1 `shiftL` (j .&. 31)))                                                                                                                            
                          let w = j `shiftR` 5
                          ov <- unsafeRead cmpstsw w
                          unsafeWrite cmpstsw w (ov .|. sh)
                          cull (j + p) in cull s in cullp 0
  loop numLOOPS

main = print $ length $ soe()

Gives the inner loop, which is almost the same as rust.

.LBB28_7:                              
        movq    %rcx, %rdx
        sarq    $5, %rdx
        movl    %ecx, %edi
        andl    $31, %edi
        movl    16(%r9,%rdi,4), %edi
        orl     %edi, 16(%rax,%rdx,4)
        addq    %rsi, %rcx
.LBB28_5:
        cmpq    $131071, %rcx           
        jle     .LBB28_7

The CMM and initial llvm code is the same as your code for 8.0.1, so it seems the difference is due to the fact that rust ships with its own more recent llvm than ghc 8.0.1 supports.

The difference in the code between my version and your original version is that we force twos early which is needed to prevent the evaluation of that within the loop, an optimisation which seems to have been missed by HEAD.

comment:2 Changed 3 years ago by GordonBGood

So you are saying we need to both force the evaluation of "twos" and run a newer version of LLVM (which we can't do with GHC version 8.0.1) in order to get the desired output but that the forced evaluation of "twos" would not be necessary if HEAD did not have a regression as to this optimization?

EDIT_ADD: Rather than adding the "return $! twos" line, I think it even simpler to cause twos to be evaluated by just adding twos seq (with back single quotes around the seq) before "twos" is used, such as just before the call to "loop numLOOPS" or before "cullp 0" or before "cull s".

In fact, it might not be considered a regression that "twos" is not evaluated unless forced to pre-evaluate as per your suggestion since although one might hope that the strictness analyzer would see that "twos" is only used by strict code and force it to be strict, there is no guarantee that it will do so across versions; thus, it would be good programming practice to force this evaluation so as to be sure.

Last edited 3 years ago by GordonBGood (previous) (diff)

comment:3 in reply to:  1 Changed 3 years ago by GordonBGood

@AlexET, Replying to AlexET:

On current HEAD with llvm 3.9.0, the following code:

... code skipped for brevity...

The CMM and initial llvm code is the same as your code for 8.0.1, so it seems the difference is due to the fact that rust ships with its own more recent llvm than ghc 8.0.1 supports.

The difference in the code between my version and your original version is that we force twos early which is needed to prevent the evaluation of that within the loop, an optimisation which seems to have been missed by HEAD.

It's even worse with GHC version 8.0.1 than I thought:, with the very minor change of the inner loop to as follows:

                  let cull j = -- very tight inner loop where all the time is spent
                        if j > bfLmt then return () else do
                          let sh = unsafeAt twos (j .&. 31)
                          let w = j `shiftR` 5
                          ov <- unsafeRead cmpstsw w
                          unsafeWrite cmpstsw w (ov .|. sh) -- (1 `shiftL` (j .&. 31)))
                          cull (j + p) in do { cull s; cullp (i + 1) } in cullp 0

only changed so the loop back for the next prime value is outside the cull loop, the assembly code is as follows:

	.align	16, 0x90
.LBB33_3:                               # %caLP.i.caLP.i_crit_edge
                                        #   in Loop: Header=BB33_2 Depth=1
	movq	(%r12), %rdx
.LBB33_2:                               # %caLP.i
                                        # =>This Inner Loop Header: Depth=1
	movq	%rsi, %rcx
	movl	%ecx, %edi
	movq	%rdx, %rsi
	addq	%rcx, %rsi
	sarq	$5, %rcx
	movq	-24(%r12), %rdx
	movq	-16(%r12), %rbx
	andl	$31, %edi
	movl	16(%rbx,%rdi,4), %edi
	orl	%edi, 16(%rdx,%rcx,4)
	cmpq	-8(%r12), %rsi
	jle	.LBB33_3

with three register reloads in the loop and another version in the code also reloading the p increment value for a total of four reloads. I can't see that the code generated should be any different then for the original ticket code.

I've investigated this and believe that I know the reason; it isn't related to this ticket issue but rather likely is the root cause of the issue as per ticket #12808.

In short, it is all a matter or strictness: The slight change to the original code to advance the prime index outside the cull loop causes the call to "cullp" to be non-strict as can be seen by the much larger heap required. While this does not cause all that many boxed thunks relative to the amount of culling work to be done (number of loops times the number of primes up to the root of the sieve range) so the GC time and execution time due to boxed thunk processing does not impact the execution time significantly, it triggers the root cause of ticket #12808: the optimization for boxed thunk code does not lift the loop invariant code flow out of the loop, whereas it does for strict code (perhaps due to the order of optimizations depending on what type of code is found).

Thus, it produces this horribly inefficient assembly code.

I'll be following up with more comments pushing the cause of ticket #12808 over on that ticket.

Last edited 3 years ago by GordonBGood (previous) (diff)

comment:4 Changed 3 years ago by bgamari

comment:5 Changed 3 years ago by bgamari

Description: modified (diff)

comment:6 Changed 3 years ago by bgamari

Description: modified (diff)

comment:7 Changed 3 years ago by bgamari

Milestone: 8.2.18.4.1

It doesn't look like anything will happen on this for 8.2.

comment:8 Changed 20 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:9 Changed 15 months ago by bgamari

Milestone: 8.6.18.8.1

These won't be fixed for 8.6, bumping to 8.8.

comment:10 Changed 9 months ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.