Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#10491 closed bug (wontfix)

Regression, simplifier explosion with Accelerate, cannot compile, increasing tick factor is not a workaround

Reported by: robertce Owned by: bgamari
Priority: highest Milestone: 7.10.2
Component: Compiler Version: 7.10.1
Keywords: Cc: tmcdonell@…, george, gidyn
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by robertce)

In Accelerate we are encountering problems with the simplifier exploding the number of terms, which consequently cause huge compile times. This has only happened since the release of 7.10.1 and it appears to still be happening in the ghc-7.10 branch.

In the attached Slice.hs you can see what is about the worst case example. Trying to compile this with -O2 yields

[1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o )
ghc: panic! (the 'impossible' happened)
  (GHC version 7.10.1.20150602 for x86_64-apple-darwin):
	Simplifier ticks exhausted
  When trying UnfoldingDone $j_s1FRp
  To increase the limit, use -fsimpl-tick-factor=N (default 100)
  If you need to do this, let GHC HQ know, and what factor you needed
  To see detailed counts use -ddump-simpl-stats
  Total ticks: 1668604

Bumping the tick factor to 150 stops this from happening, but at that point the simplifier blows up. See output.txt for the output of ghc -v. After half an hour we end up with ~4 million terms. At this point ghc takes up about 9GB of memory and I have to kill it.

In order to reproduce this, you will need the accelerate package and all its various dependencies. It doesn't have to be the latest from github. The version on hackage also works with the example I have given. I will try to come up with a test case that is not so heavy on dependencies, but I wanted to ensure this was raised before 7.10.2 is released.

Steps to reproduce:

cabal install accelerate
ghc -O2 -fsimpl-tick-factor=150 Slice.hs

Attachments (2)

Slice.hs (12.3 KB) - added by robertce 4 years ago.
output.txt (6.1 KB) - added by robertce 4 years ago.

Download all attachments as: .zip

Change History (50)

Changed 4 years ago by robertce

Attachment: Slice.hs added

Changed 4 years ago by robertce

Attachment: output.txt added

comment:1 Changed 4 years ago by robertce

Description: modified (diff)

comment:2 Changed 4 years ago by George

This doesn't seem to be a problem on the 7.10.2 alpha Haskell Platform:

ghc --version

The Glorious Glasgow Haskell Compilation System, version 7.10.1.20150601

cabal install --ghc-options="-O2" accelerate --reinstall

cabal install --ghc-options="-O2" accelerate --reinstall Resolving dependencies... In order, the following will be installed: accelerate-0.15.1.0 (reinstall) Warning: Note that reinstalls are always dangerous. Continuing anyway... Configuring accelerate-0.15.1.0... Building accelerate-0.15.1.0... Installed accelerate-0.15.1.0 Updating documentation index /Users/gcolpitts/Library/Haskell/share/doc/x86_64-osx-ghc-7.10.1.20150601/index.html

fwiw the following didn't give an error but not sure if it did the "right" thing, in any case the above should be definitive

cabal install -j3 -O2 accelerate

cabal install -j3 -O2 accelerate Resolving dependencies... Configuring fclabels-2.0.2.2... Configuring hashtables-1.2.0.2... Building hashtables-1.2.0.2... Building fclabels-2.0.2.2... Installed fclabels-2.0.2.2 Installed hashtables-1.2.0.2 Configuring accelerate-0.15.1.0... Building accelerate-0.15.1.0... Installed accelerate-0.15.1.0 Updating documentation index /Users/gcolpitts/Library/Haskell/share/doc/x86_64-osx-ghc-7.10.1.20150601/index.html

comment:3 Changed 4 years ago by robertce

Accelerate itself compiles without issue. The issue arises when compiling the attached Slice.hs, which is not part of the accelerate package. Sorry if that was unclear. It's actually part of an experimental version of the accelerate-cuda package, but I pulled it out to reduce the the number of dependencies required. Are you able to compile Slice.hs with the Haskell platform?

comment:4 Changed 4 years ago by robertce

Description: modified (diff)

comment:5 Changed 4 years ago by George

Nope, I cannot successfully compile, I reproduce the problem:

ghc -O2 Slice.hs
[1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o )
ghc: panic! (the 'impossible' happened)
  (GHC version 7.10.1.20150601 for x86_64-apple-darwin):
	Simplifier ticks exhausted
  When trying RuleFired Class op bound
  To increase the limit, use -fsimpl-tick-factor=N (default 100)
  If you need to do this, let GHC HQ know, and what factor you needed
  To see detailed counts use -ddump-simpl-stats
  Total ticks: 1668404

Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

when I do

 ghc -O2 -fsimpl-tick-factor=150 Slice.hs

it quickly uses over 11G of memory. I'm not sure what the peak memory use is or how long it took to compile as I let it run overnight (I have 12G of RAM on my Mac). The resulting binary of the 321 line source file is 51 MB

Last edited 4 years ago by simonpj (previous) (diff)

comment:6 Changed 4 years ago by George

Summary: Simplifier explosion with AccelerateSimplifier explosion with Accelerate, increasing tick factor is not a workaround

I changed the summary as I wanted to emphasize that, for the filer, increasing the tick factor is not a workaround unlike most of these bugs. In fact, there does not seem to be any workaround for the filer to the problem that he cannot compile Slice.hs with -O2

Last edited 4 years ago by George (previous) (diff)

comment:7 Changed 4 years ago by George

Summary: Simplifier explosion with Accelerate, increasing tick factor is not a workaroundSimplifier explosion with Accelerate, cannot compile, increasing tick factor is not a workaround

comment:8 Changed 4 years ago by tmcdonell

Cc: tmcdonell@… added

comment:9 Changed 4 years ago by George

Cc: george added

comment:10 Changed 4 years ago by George

Summary: Simplifier explosion with Accelerate, cannot compile, increasing tick factor is not a workaroundRegression: Simplifier explosion with Accelerate, cannot compile, increasing tick factor is not a workaround

comment:11 Changed 4 years ago by George

Summary: Regression: Simplifier explosion with Accelerate, cannot compile, increasing tick factor is not a workaroundRegression, simplifier explosion with Accelerate, cannot compile, increasing tick factor is not a workaround

comment:12 Changed 4 years ago by gidyn

Cc: gidyn added
Type of failure: None/UnknownCompile-time performance bug

comment:13 Changed 4 years ago by thoughtpolice

Priority: normalhighest

Okay, I'm bumping this to high priority, and me and Ben can investigate and talk with Simon about this tomorrow in our meeting.

comment:14 Changed 4 years ago by bgamari

George, if I understand correctly you were able to compile Slice.hs with GHC 7.10.1 with reasonable memory and CPU usage if passed -fsimpl-tick-factor=150. Is this correct?

Last edited 4 years ago by bgamari (previous) (diff)

comment:15 Changed 4 years ago by bgamari

For the record, my notes on this bug can be found here, https://github.com/bgamari/ghc-T10491/blob/master/notes.mkd.

Interestingly enough, I have not observed reasonable compiler behavior even well before the 7.10.1 release (I'm currently looking at 029a296a770addbd096bbfd6de0936327ee620d4 which is from March 2015).

Last edited 4 years ago by bgamari (previous) (diff)

comment:16 in reply to:  14 ; Changed 4 years ago by George

Replying to bgamari:

I was able to compile it with -fsimpl-tick-factor=150, not sure if memory use was reasonable, cpu usage was close to 100%. As I wrote above:

when I do

ghc -O2 -fsimpl-tick-factor=150 Slice.hs

it quickly uses over 11G of memory. I'm not sure what the peak memory use is or how long it took to compile as I let it run overnight (I have 12G of RAM on my Mac). The resulting binary of the 321 line source file is 51 MB

comment:17 in reply to:  16 ; Changed 4 years ago by bgamari

Replying to George:

I was able to compile it with -fsimpl-tick-factor=150, not sure if memory use was reasonable, cpu usage was close to 100%. As I wrote above:

when I do

ghc -O2 -fsimpl-tick-factor=150 Slice.hs

it quickly uses over 11G of memory. I'm not sure what the peak memory use is or how long it took to compile as I let it run overnight (I have 12G of RAM on my Mac). The resulting binary of the 321 line source file is 51 MB

I am a bit confused by the fact that the bug description seems to suggest that this behavior began at some point since the release of 7.10.1,

This has only happened since the release of 7.10.1 and it appears to still be happening in the ghc-7.10 branch.

However I am finding that the commits prior to the 7.10.1 release also exhibit the behavior.

Last edited 4 years ago by bgamari (previous) (diff)

comment:18 in reply to:  17 ; Changed 4 years ago by robertce

Replying to bgamari:

What I'm a bit unclear on is that the bug description seems to suggest this behavior began at some point since the release of 7.10.1,

This has only happened since the release of 7.10.1 and it appears to still be happening in the ghc-7.10 branch.

However I am finding that the commits prior to the 7.10.1 release also exhibit the behavior.

I'm not surprised it's also happening with commits prior to 7.10.1. By saying it's only happened since the release of 7.10.1, I was speaking in release terms. It works with 7.8.4, but not 7.10.1 or the current 7.10 branch.

comment:19 in reply to:  18 Changed 4 years ago by bgamari

Replying to robertce:

I'm not surprised it's also happening with commits prior to 7.10.1. By saying it's only happened since the release of 7.10.1, I was speaking in release terms. It works with 7.8.4, but not 7.10.1 or the current 7.10 branch.

Ahh, I see. English can be a frustratingly ambiguous language at times.

In other news, it appears that 7.10's behavior begins to diverge from that of 7.8.3 at float out,

  • after the initial simplifier phase both 7.8 and 7.10 have roughly 3000 terms, 7000 types
  • after float out the program has 3000 terms, 8000 types in 7.8 whereas it has 12000 terms and 60000 types in 029a296a770addbd096bbfd6de0936327ee620d4

comment:20 Changed 4 years ago by simonpj

I've just compiled Slice.hs with HEAD quite easily

bash$ ghc-stage2 -dshow-passes Slice.hs -package accelerate -O2 -fforce-recomp
....
   4,168,610,960 bytes allocated in the heap
     973,775,616 bytes copied during GC
      67,207,400 bytes maximum residency (14 sample(s))
       2,684,880 bytes maximum slop
             185 MB total memory in use (0 MB lost due to fragmentation)

The program never ets vast

[1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization)
  = {terms: 2,214, types: 3,698, coercions: 26}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 2,183, types: 3,841, coercions: 20}
Result size of Simplifier iteration=2
  = {terms: 2,182, types: 3,835, coercions: 20}
Result size of Simplifier
  = {terms: 2,182, types: 3,835, coercions: 20}
*** Specialise:
Result size of Specialise
  = {terms: 6,196, types: 26,056, coercions: 470}
*** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 7,738, types: 35,962, coercions: 470}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 9,756, types: 45,944, coercions: 10,471}
Result size of Simplifier iteration=2
  = {terms: 10,295, types: 49,489, coercions: 21,914}
Result size of Simplifier iteration=3
  = {terms: 9,724, types: 46,649, coercions: 14,825}
Result size of Simplifier iteration=4
  = {terms: 9,694, types: 46,444, coercions: 14,376}
Result size of Simplifier
  = {terms: 9,694, types: 46,444, coercions: 14,376}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 9,301, types: 45,776, coercions: 13,996}
Result size of Simplifier iteration=2
  = {terms: 9,267, types: 45,696, coercions: 13,846}
Result size of Simplifier iteration=3
  = {terms: 9,268, types: 45,686, coercions: 13,744}
Result size of Simplifier iteration=4
  = {terms: 9,268, types: 45,681, coercions: 13,682}
Result size of Simplifier
  = {terms: 9,268, types: 45,681, coercions: 13,682}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 21,154, types: 76,735, coercions: 30,042}
Result size of Simplifier iteration=2
  = {terms: 18,674, types: 69,540, coercions: 22,805}
Result size of Simplifier iteration=3
  = {terms: 18,479, types: 67,598, coercions: 19,578}
Result size of Simplifier iteration=4
  = {terms: 17,773, types: 64,887, coercions: 16,920}
Result size of Simplifier
  = {terms: 17,773, types: 64,887, coercions: 16,920}
*** Float inwards:
Result size of Float inwards
  = {terms: 17,773, types: 64,887, coercions: 16,920}
*** Called arity analysis:
Result size of Called arity analysis
  = {terms: 17,773, types: 64,887, coercions: 16,920}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 17,458, types: 63,723, coercions: 15,764}
Result size of Simplifier iteration=2
  = {terms: 17,528, types: 63,751, coercions: 15,458}
Result size of Simplifier iteration=3
  = {terms: 17,560, types: 63,758, coercions: 15,272}
Result size of Simplifier iteration=4
  = {terms: 17,551, types: 63,748, coercions: 15,182}
Result size of Simplifier
  = {terms: 17,551, types: 63,748, coercions: 15,182}
*** Demand analysis:
Result size of Demand analysis
  = {terms: 17,551, types: 63,748, coercions: 15,182}
*** Worker Wrapper binds:
Result size of Worker Wrapper binds
  = {terms: 21,047, types: 77,338, coercions: 15,182}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 19,533, types: 69,686, coercions: 15,210}
Result size of Simplifier iteration=2
  = {terms: 18,528, types: 67,708, coercions: 15,346}
Result size of Simplifier iteration=3
  = {terms: 18,293, types: 67,029, coercions: 15,164}
Result size of Simplifier
  = {terms: 18,293, types: 67,029, coercions: 15,164}
*** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 19,134, types: 72,375, coercions: 15,164}
*** Common sub-expression:
Result size of Common sub-expression
  = {terms: 17,353, types: 61,395, coercions: 5,875}
*** Float inwards:
Result size of Float inwards
  = {terms: 17,353, types: 61,395, coercions: 5,875}
*** Liberate case:
Result size of Liberate case
  = {terms: 17,353, types: 61,395, coercions: 5,875}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 15,872, types: 53,291, coercions: 6,049}
Result size of Simplifier
  = {terms: 15,745, types: 52,388, coercions: 5,875}
*** SpecConstr:
Result size of SpecConstr
  = {terms: 16,083, types: 52,968, coercions: 5,875}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 16,074, types: 52,968, coercions: 5,875}
Result size of Simplifier iteration=2
  = {terms: 16,075, types: 53,003, coercions: 5,875}
Result size of Simplifier
  = {terms: 16,075, types: 53,003, coercions: 5,875}
*** Tidy Core:
Result size of Tidy Core
  = {terms: 10,807, types: 36,367, coercions: 5,684}
*** CorePrep:
Result size of CorePrep
  = {terms: 13,102, types: 39,418, coercions: 5,688}
*** Stg2Stg:
*** CodeGen:
*** Assembler:

The

comment:21 Changed 4 years ago by bgamari

The blow-up in 7.10 appears to be due to the introduction of silent superclass parameters, a notion which has been replaced in master (namely a6f0f5ab45b2643b561e0a0a54a4f14745ab2152 and surrounding commits). I'm working on backporting these to the 7.10 branch.

comment:22 Changed 4 years ago by bgamari

Unfortunately it is a bit tricker bringing a6f0f5ab45b2643b561e0a0a54a4f14745ab2152 to ghc-7.10 than I was hoping it would be,

I haven't yet taken a careful look at able to the patches fixing superclass generation.

Last edited 4 years ago by bgamari (previous) (diff)

comment:23 Changed 4 years ago by simonpj

I looked at this a bit too, but I'm now in an all-day meeting.

One idea. I think that 95% of the pain comes from cross-module specialisation, dramatically exacerbated by the silent superclasses thing. If I'm right, a much quicker fix would be -fno-specialise.

A less draconian fix would be an extra flag -fno-cross-module-specialise, which turns Specialise.specImports into a no-op. GHC never used to do cross-module specialisation. Maybe we should even disable it by-default in 7.10? Anyway implementing such a flag would take only a minute or two, and would be a Good Thing anyway.

Moving HEAD's fix for silent userclasses is a pretty complicated solution.

Ben, suggestions:

  • See if -fno-specialise solves the problem
  • Add the flag and see if that too solves the problem
  • Add some stats gathering in Specialise that reports how many imported functions are specialised, at how many different types, and how big the specialised code is. The idea is so we can more easily see when there's a specialisation blow-up.
  • If so we can switch off cross-module specialisation by default (although I'd prefer it to be on in 7.12)

Simon

comment:24 Changed 4 years ago by George

-fno-specialise does not seem to solve the problem ( time ghc -v -fsimpl-tick-factor=150 -fno-specialise -O2 Slice.hs ) > t.out 2>&1 &

tail t.out
  = {terms: 4,218,587, types: 4,087,814, coercions: 113,719}
Result size of Simplifier
  = {terms: 4,218,587, types: 4,087,814, coercions: 113,719}
*** Float inwards:
Result size of Float inwards
  = {terms: 4,218,587, types: 4,087,814, coercions: 113,719}
*** Called arity analysis:
Result size of Called arity analysis
  = {terms: 4,218,587, types: 4,087,814, coercions: 113,719}
*** Simplifier:

sorry for spam caused by multiple updates, will be more cautious going forward

Last edited 4 years ago by George (previous) (diff)

comment:25 Changed 4 years ago by bgamari

George, don't worry about the edits; I am quite guilty of long strings of edits myself and as far as I can tell Trac only produces a message on the initial submission.

Simon, I am sad to say that I can confirm that -fno-specialise does not help this issue. That being said, I'll see to it that we have a enabled-by-default -fcross-module-specialise option and some diagnostics output going forward.

comment:26 Changed 4 years ago by simonpj

Well I suggest you use -fno-specialise in any further investigations. It doesn't fix it, but it'll remove distracting clutter from what you see.

The silent-superclass thing is not really the cause. (And hence I don't think it's worth a heroic effort to port HEAD to 7.10. It certainly generates a lot more dictionaries, so perhaps there might be twice as much code as without. But you are seeing MUCH more than that! So maybe there is a bug lurking in HEAD too; it just happens that the code generated by silent superclasses in 7.10 tickles it.

So the hunt goes on. Monday is a washout for me.

Simon

comment:27 Changed 4 years ago by bgamari

George, what GHC version did you test -fno-specialise on? While yesterday I was able to confirm that -fno-specialise seemed to make no difference on a test machine running what should have been 7.10.1 I am now having trouble replicating this on my laptop. Unfortunately I no longer have access to the test environment on which I tested this yesterday but clearly something was inconsistent.

I am now seeing on multiple machines that -fno-specialise indeed allows things to compile,

$ ghc -V
The Glorious Glasgow Haskell Compilation System, version 7.10.1
$$ time ghc Slice.hs -fforce-recomp -O2 -fno-specialise
[1 of 1] Compiling Slice            ( Slice.hs, Slice.o )

real    0m3.759s
user    0m1.688s
sys     0m0.044s
$ time ghc Slice.hs -fforce-recomp -O2 
[1 of 1] Compiling Slice            ( Slice.hs, Slice.o )
^C

real    0m51.103s
user    0m44.336s
sys     0m0.948s

I am now looking at whether disabling only cross-module specialisation is enough to eliminate the blow-up.

comment:28 Changed 4 years ago by bgamari

Phab:D999 adds a -fno-cross-module-specialisation option which works around the blow-up will likely still be present in 7.10.2. I'm still trying to work out exactly what is happening.

comment:29 Changed 4 years ago by George

ghc -V The Glorious Glasgow Haskell Compilation System, version 7.10.1.20150612

It seems that the order of the args determines if it succeeds or not.

With -fno-specialise before -O2 it fails, it seems the flag is being ignored:

bash-3.2$ time ghc -fno-specialise -fforce-recomp -O2 Slice.hs
[1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o )
ghc: panic! (the 'impossible' happened)
  (GHC version 7.10.1.20150612 for x86_64-apple-darwin):
	Simplifier ticks exhausted
  When trying UnfoldingDone $j_s1FS5
  To increase the limit, use -fsimpl-tick-factor=N (default 100)
  If you need to do this, let GHC HQ know, and what factor you needed
  To see detailed counts use -ddump-simpl-stats
  Total ticks: 1668604

Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug


real	0m19.542s
user	0m18.451s
sys	0m0.901s

bash-3.2$ time ghc -fno-specialise -fforce-recomp -O2 -v4 Slice.hs 2>&1 | fgrep Specialise
*** Specialise:
==================== Specialise ====================
Result size of Specialise

real	1m49.604s
user	1m42.560s
sys	0m14.477s

With -fno-specialise after -O2 it succeeds as the flag works as expected:

bash-3.2$ time ghc -O2 -fno-specialise -fforce-recomp Slice.hs
[1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o )

real	0m3.299s
user	0m3.085s
sys	0m0.158s

$ time ghc -fforce-recomp -O2 -fno-specialise -v4 Slice.hs 2>&1 | fgrep Specialise

real	1m41.641s
user	1m32.874s
sys	0m21.100s

I assume I should file a bug about ghc processing of command line options

Last edited 4 years ago by George (previous) (diff)

comment:30 Changed 4 years ago by bgamari

Having recently seen how the -O option is implemented, it's actually not terribly surprising that it's order dependent. Essentially -O2 is equivalent to a passing a bunch of individual -f options explicitly. -f options are themselves just flipping flags in the parser state and are therefore order dependent.

I'll admit that this behavior is a bit surprising, but then again I'm not entirely sure how to improve the situation without sacrificing expressive power. I wonder how other compilers manage this.

comment:31 Changed 4 years ago by George

To a naive users who doesn't know how the -O2 option is implemented it is very surprising. I think something needs to be changed. AFAIK, there is no documentation that it is order dependent. A language that is not order dependent should not have command line options that are!

At a minimum if -O2 and -fno<something> means -f<something> and -fno<something> and some remaining -f<options> then the user should get a message about which of the <something> or no<something> options has higher precedence due to their order, and perhaps additional text about how to reverse that precedence with a differ order.

Ideally, in my opinon, -fno<something> should have higher precedence than the -f<something> option implicit in -O2 and a message should be emitted saying that. This would remove order dependence. At least, this is what I think off the top of my head from my naive viewpoint.

I believe there is a ticket for an ER or task to improve command line options. I'll try to find it.

In the simpler case of no -O2 the following work differently and there is no message that tells which option is taking precedence. They should probably both give an error:

bash-3.2$ ghc -fforce-recomp -fspecialise -fno-specialise Slice.hs [1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o ) bash-3.2$ ghc -fforce-recomp -fno-specialise -fspecialise Slice.hs [1 of 1] Compiling Data.Array.Accelerate.CUDA.Array.Slice ( Slice.hs, Slice.o )

Last edited 4 years ago by George (previous) (diff)

comment:32 Changed 4 years ago by bgamari

George, sure, I agree that the current behavior is quite surprising. Let's continue the command line parsing discussion on another issue.

On the issue of the specialiser, it seems that cross-module specialisation generates a fairly large number of dictionaries (along with superclasses). Most of these are for various index types.

For instance, one then finds Show, Elt, and Shape implementations for types of the form (((Z :. Int) :. Int) :. Int) for all types up to (((((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. A). (N.B. data tail :. head = tail :. head is one of the principle types in accelerate and is used to encode array indices). Along with the dictionary the specialiser also produces specialised implementations, which in the case of Shape constitutes quite a bit of code,

class (Elt sh, Elt (Any sh), Repr.Shape (EltRepr sh)) => Shape sh where
  dim    :: sh -> Int
  size   :: sh -> Int
  ignore :: sh
  intersect :: sh -> sh -> sh
  toIndex   :: sh -> sh -> Int
  fromIndex :: sh -> Int -> sh
  bound  :: sh -> sh -> Boundary a -> Either a sh
  iter  :: sh -> (sh -> a) -> (a -> a -> a) -> a -> a
  iter1 :: sh -> (sh -> a) -> (a -> a -> a) -> a
  rangeToShape ::  (sh, sh) -> sh
  shapeToRange ::  sh -> (sh, sh)
  shapeToList :: sh -> [Int]
  listToShape :: [Int] -> sh
  sliceAnyIndex :: sh -> Repr.SliceIndex (EltRepr (Any sh)) (EltRepr sh) () (EltRepr sh)

Here is a representative Shape dictionary,

$s$fShape:._sbUF :: Shape ((Z :. Int) :. Int)
$s$fShape:._sbUF =
  Data.Array.Accelerate.Array.Sugar.D:Shape
    @ ((Z :. Int) :. Int)
    $dElt_a9Ww
    $dElt_a9Yh
    ($dShape_a9Zc `cast` ...)
    (Data.Array.Accelerate.Array.Sugar.$fShape:._$cdim
       @ (Z :. Int)
       $dElt_a9Ww
       $dElt_a9Yh
       ($dShape_a9Zc `cast` ...)
       $dShape_a9Yq)
    (Data.Array.Accelerate.Array.Sugar.$fShape:._$csize
       @ (Z :. Int)
       $dElt_a9Ww
       $dElt_a9Yh
       ($dShape_a9Zc `cast` ...)
       $dShape_a9Yq)
    ...
    (Data.Array.Accelerate.Array.Sugar.$fShape:._$csliceAnyIndex
       @ (Z :. Int)
       $dElt_a9Ww
       $dElt_a9Yh
       ($dShape_a9Zc `cast` ...)
       $dShape_a9Yq)

$s$fShape:._$csize_sbT7 :: Z :. Int -> Int
$s$fShape:._$csize_sbT7 =
  \ (eta_abqn :: Z :. Int) ->
    size
      @ (Data.Array.Accelerate.Array.Sugar.EltRepr (Z :. Int))
      ($dShape_a9YB `cast` ...)
      (Data.Array.Accelerate.Array.Sugar.fromElt
         @ (Z :. Int) $dElt_a9XT eta_abqn)

-- other functions elided for conciseness

In addition, one also finds a Eq, ArrayElt, and Shape dictionaries and implementations for tuple types from (((), Int), Int) to ((((((((((), Int), Int), Int), Int), Int), Int), Int), Int), Int). I believe that these tuple instances arise as a result of the :. instances, due to the Shape (EltRepr sh) constraint on Shape.

In addition, one also finds ArrayElt dictionaries for tuples of the same shape but with s/Int/(), e.g. (((), ()), ()).

I have yet to look carefully at what the simplifier does to these definitions that causes such pain. Something clearly goes quite awry as the program goes from,

Result size of Simplifier iteration=1
  = {terms: 57,908, types: 179,084, coercions: 66,346}

to,

Result size of Simplifier iteration=2
  = {terms: 4,975,110, types: 5,662,488, coercions: 68,643}

I did notice that the PreInlineUnconditionally count tends to be quite high in the culpable passes.

Regardless, all of these dictionaries does explain the rather large programs seen post-specialisation.

Last edited 4 years ago by bgamari (previous) (diff)

comment:33 Changed 4 years ago by simonpj

  • There is, so far as I can see, no documentation of the fact that command-line arguments are applied in-order (so -fspecialise -fno-specialise has the same effect as -fno-specialise). Ben: could you add documentation of this, to the preamble of 4.10 in the user manual? In 4.10.1, when discussing -O, stress the order-dependence with individual flags.
  • Let's debate whether or not this order dependence is a good design on another ticket.
  • I think we have a workaround for this particular problem (-fno-specialise)
  • But as Ben mentions, we still don't know why it leads to such a stunning blow-up, so let's keep looking for that.

comment:34 Changed 4 years ago by bgamari

I've opened #10560 to discuss the command-line parser behavior. George, perhaps you want to write down some thoughts there (or just paste Comment 31 from above for reference)?

Simon, documentation is on its way.

I'll be looking at the simplifier blow-up shortly.

comment:35 Changed 4 years ago by bgamari

One thing in the post-explosion Core that I've noticed is some extremely large bindings of type,

lvl_scb5
  :: forall a_aaJ9.
     (((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. Int
     -> (((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. Int
     -> Data.Array.Accelerate.Type.Boundary a_aaJ9
     -> Either
          a_aaJ9
          ((((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. Int)

with variously sized indices. The size of the binding (determined very roughly by the number of lines in the textual Core representation) scales very strongly with the size of the index,

  • (((Z :. Int) :. Int) :. A) :. Int: 4,383 lines
  • ((((Z :. Int) :. Int) :. A) :. A) :. Int: 11,777 lines
  • (((((Z :. Int) :. Int) :. A) :. A) :. A) :. Int: 34,246 lines
  • ((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. Int: 108,745 lines
  • (((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. Int: 351,252 lines

The right-hand-sides appear to be deeply nested case analyses on Bools. I'm still looking for where these bindings originate. All-in-all, there is a clear pattern here of each additional array index growing bindings by a factor of three.

There is also this 1,116,847-line-long binding,

$s$fShape:._$cbound_sbQ3
  :: forall a_aaJ9.
     ((((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. A)
     :. Int
     -> ((((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. A)
        :. Int
     -> Data.Array.Accelerate.Type.Boundary a_aaJ9
     -> Either
          a_aaJ9
          (((((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. A)
           :. Int)
Last edited 4 years ago by bgamari (previous) (diff)

comment:36 in reply to:  34 Changed 4 years ago by George

Replying to bgamari: Ben, Simon thanks for explaining this. I've added my comments to #10560. Sorry to distract you from the bug here. I was just surprised by the command line option issue. Adding the documentation will definitely be a big help.

I've opened #10560 to discuss the command-line parser behavior. George, perhaps you want to write down some thoughts there (or just paste Comment 31 from above for reference)?

Simon, documentation is on its way.

I'll be looking at the simplifier blow-up shortly.

comment:37 Changed 4 years ago by bgamari

At this point I'm pretty certain that bound for the deep (:.) instances is the culprit here. For instance, the implementation for ((((((((Z :. Int) :. Int) :. A) :. A) :. A) :. A) :. A) :. A) grows from {terms: 10, types: 59, coercions: 147} to {terms: 1,375,016, types: 1,282,534, coercions: 14,151} in one simplifier iteration (phase 0).

Let's follow the specialized implementation of bound for ((Z .: Int) :. Int) :. Int which clearly exhibits this explosion, but on a much smaller scale (from {terms: 10, types: 23, coercions: 33} to {terms: 2,789, types: 2,743, coercions: 78} over the same simplifier iteration; it's not clear that the smaller types are exploding quite as much, namely they still have 0 coercions after the simplifier does its work).

It starts simple enough,

$s$fShape:._$cbound_sbQv
  :: forall a_aaK6.
     ((Z :. Int) :. Int) :. Int
     -> ((Z :. Int) :. Int) :. Int
     -> Data.Array.Accelerate.Type.Boundary a_aaK6
     -> Either a_aaK6 (((Z :. Int) :. Int) :. Int)
$s$fShape:._$cbound_sbQv =
  \ (@ a142_abqH)
    (w4_abqI :: ((Z :. Int) :. Int) :. Int)
    (w5_abqJ :: ((Z :. Int) :. Int) :. Int)
    (w6_abqK :: Data.Array.Accelerate.Type.Boundary a142_abqH) ->
    Data.Array.Accelerate.Array.Sugar.$w$cbound
      @ ((Z :. Int) :. Int)
      $s$fElt:._sbKS
      ($s$fShape(,)_sbz9 `cast` ...)
      @ a142_abqH
      w4_abqI
      w5_abqJ
      w6_abqK

This gets simplified to this, https://gist.github.com/bgamari/c6e360478b0e98df62e3.

More analysis shortly.

comment:38 Changed 4 years ago by bgamari

Let's look at bound more closely,

  -- |Apply a boundary condition to an index.
  bound  :: sh -> sh -> Boundary a -> Either a sh

It takes the size of an array, an index, and a Boundary a value describing how lookups falling outside of the array should be handled. The result is either the index that should be looked up or a fixed value.

In light of this it seems clear where the deep case analyses appearing in the simplified $s$fShape:._$cbound_sbQv are coming from: the compiler is expanding tests for whether each index falls within the array. This test being replicated can be found here https://github.com/AccelerateHS/accelerate/blob/release/0.15/Data/Array/Accelerate/Array/Representation.hs#L110.

Interestingly, the master branch (5803c5e566a0647139523f2aad06081dc2818ef2) decides not to inline any of the Data.Array.Accelerate.Array.Sugar.$w$cbound occurrences, which is why it is able to finish.

Also, the default implementation Data.Array.Accelerate.Array.Sugar.bound is not marked as INLINEABLE and marking it as NOINLINE allows GHC 7.10.2 to compile. Likewise, marking bound as INLINE on master produces a blow-up similar to what we see with 7.10 (actually it does even worse, running out of simplifier ticks even with -fsimpl-tick-factor=150). I suppose that GHC 7.10 probably decides on its own that it should be INLINEABLE but master does not, resulting in the observed difference.

Regardless, it does seem as though the Accelerate folks would want a function like bound to be inlined (robertce, perhaps you could comment here?). Perhaps the real bug here is our inability to produce a reasonable amount of optimized code for this function? That being said, the definition of bound isn't exactly conducive to efficient implementation.

Last edited 4 years ago by bgamari (previous) (diff)

comment:39 in reply to:  38 ; Changed 4 years ago by robertce

Replying to bgamari:

Regardless, it does seem as though the Accelerate folks would want a function like bound to be inlined (robertce, perhaps you could comment here?). Perhaps the real bug here is our inability to produce a reasonable amount of optimized code for this function? That being said, the definition of bound isn't exactly conducive to efficient implementation.

It's actually not terribly important that bound is inlined for Accelerate. We achieve high-performance by generating CUDA C or LLVM IR from our DSL rather than having fast Haskell code. The functions in Shape, like bound, are primarily used in our (slow) reference interpreter. If necessary, we could just put a NOINLINE pragma on bound or add an -fno-specialse to the top of any files affected. I guess what we really would like is to just not have such an explosion when something like bound is defined in the way it is. Having GHC figure out optimised code for it is not something I see us having any immediate need for.

By the way, thanks for looking at this and tracking it down so quickly. Sorry I couldn't have been more help.

comment:40 in reply to:  39 Changed 4 years ago by bgamari

Replying to robertce:

It's actually not terribly important that bound is inlined for Accelerate. We achieve high-performance by generating CUDA C or LLVM IR from our DSL rather than having fast Haskell code. The functions in Shape, like bound, are primarily used in our (slow) reference interpreter. If necessary, we could just put a NOINLINE pragma on bound or add an -fno-specialse to the top of any files affected.

Adding a NOINLINE would be the least intrusive solution. We also will have a -fno-cross-module-specialise in 7.10.2 which also works around the issue but it obviously won't be available in earlier releases.

I guess what we really would like is to just not have such an explosion when something like bound is defined in the way it is. Having GHC figure out optimised code for it is not something I see us having any immediate need for.

There are "simple" steps I can imagine we could take to avoid these explosions (e.g. stop inlining into an expression after it has reached some size) but there are likely Good Reasons (TM) why we don't currently do this.

Simon, perhaps you could comment here?

By the way, thanks for looking at this and tracking it down so quickly. Sorry I couldn't have been more help.

No worries! I'm glad I could help.

comment:41 Changed 4 years ago by simonpj

Interesting! Looking at [​https://github.com/AccelerateHS/accelerate/blob/release/0.15/Data/Array/Accelerate/Array/Representation.hs#L110 this code] (linked in comment:38), we see that if bound is inlined, we get seven new calls to bound. And each of those seven will yield seven new ones, and so on. Very exponential!

Some thoughts:

  • One could rewrite the linked code to say
    ... where bound1 = bound sh ix bndy
    
    and use bound1 for all seven occurrences of bound sh ix bndy. Does that help?
  • If it does, one might have hoped that CSE would catch it. But GHC's current CSE is not very clever. I have an idea for how to improve it so that it would catch it, if anyone interested in optimisation would like to try it out. RSVP.
  • Use -ddump-inlinings to see why HEAD is choosing not to inline it.
  • In general this kind of thing is quite hard to prevent. The classic case is
     f x = f (x-1) + f (x-2)
    
    but GHC doesn't inline recursive functions, so we are ok. This case is more like
     f1 x = f2 (x-1) + f2 (x-2)
     f2 x = f3 (x-1) + f3 (x-2)
     f3 x = f4 (x-1) + f4 (x-2)
     ...etc...
    
    But the successive f's are generated by the nested instance declarations. And I tried hard NOT to prevent inlining in these cases becuase it can be dramatically better to allow it. (E.g. look at the definition of fromIndex at the same link.

I don't know a good way to trim off unproductive exponential inlinings without killing off good inlinings too.

In short, I don't know a general solution. But any definition that has a seven-way (or even two-way) expansion over a recursive type index is going to cause trouble. Worth looking out for these.

comment:42 Changed 4 years ago by bgamari

Simonpj, indeed sharing bound sh ix bndy between the alternatives does help, but only if the binding is marked as NOINLINE, otherwise it seems that the compiler is eager to inline it right back in.

comment:43 Changed 4 years ago by simonpj

Aha, yes you are right. If we have

let x = bound sh ix bndy
in case v of
   A -> case x of ...
   B -> case x of ...
   etc

then GHC will inline x because

  • Doing so eliminates a thunk
  • Doing so does no more work

Hmm. I really don't see a good general solution here.

As an ad-hoc solution, using NOINLINE will be fine. It's probably simpler to use it on the entire bound method (with a prominent Note to explain) rather than creating an auxiliary NOINLINE binding. That would be an upstream change in accelerate.

For posterity, here is the offending instance in Data.Array.Accelerate.Array.Representation:

instance Shape sh => Shape (sh, Int) where
  ...
  bound (sh, sz) (ix, i) bndy
    | i < 0     = case bndy of
                    Clamp      -> bound sh ix bndy `addDim` 0
                    Mirror     -> bound sh ix bndy `addDim` (-i)
                    Wrap       -> bound sh ix bndy `addDim` (sz+i)
                    Constant e -> Left e
    | i >= sz   = case bndy of
                    Clamp      -> bound sh ix bndy `addDim` (sz-1)
                    Mirror     -> bound sh ix bndy `addDim` (sz-(i-sz+2))
                    Wrap       -> bound sh ix bndy `addDim` (i-sz)
                    Constant e -> Left e
    | otherwise = bound sh ix bndy `addDim` i
    where
      Right ds `addDim` d = Right (ds, d)
      Left e   `addDim` _ = Left e

Note the seven recursive calls; plus the fact that we use instances with deeply-nested tuples

Simon

comment:44 Changed 4 years ago by bgamari

I've opened this pull-request addressing this issue upstream.

https://github.com/AccelerateHS/accelerate/pull/276

comment:45 Changed 4 years ago by bgamari

Owner: set to bgamari

comment:46 Changed 4 years ago by bgamari

Resolution: wontfix
Status: newclosed

As Simon described above, there isn't much we can do to safely optimize this sort of program in general. Moreover, it seems that 7.11 already decides on its own not to inline the offending function. Closing as WONTFIX.

comment:47 Changed 4 years ago by Ben Gamari <ben@…>

In 89834d6d99da564aa14e63f2f801f50a615ce322/ghc:

Add -fcross-module-specialise flag

Summary:
As of 7.10.1 we specialize INLINEABLE identifiers defined in other
modules. This can expose issues (compiler bugs or otherwise) in some cases
(e.g. Trac #10491) and therefore we now provide a way for the user to disable
this optimization.

Test Plan: Successfully compile Splice.hs from Trac #10491.

Reviewers: simonpj, austin

Reviewed By: simonpj

Subscribers: simonpj, thomie, bgamari

Differential Revision: https://phabricator.haskell.org/D999

GHC Trac Issues: #10491

comment:48 Changed 4 years ago by Ben Gamari <ben@…>

In 302d937782ccb3068244e948d49daff3435e05c0/ghc:

Add -fcross-module-specialise flag

Summary:
As of 7.10.1 we specialize INLINEABLE identifiers defined in other
modules. This can expose issues (compiler bugs or otherwise) in some cases
(e.g. Trac #10491) and therefore we now provide a way for the user to disable
this optimization.

Test Plan: Successfully compile Splice.hs from Trac #10491.

Reviewers: simonpj, austin

Reviewed By: simonpj

Subscribers: simonpj, thomie, bgamari

Differential Revision: https://phabricator.haskell.org/D999

GHC Trac Issues: #10491
Note: See TracTickets for help on using tickets.