Opened 12 years ago

Closed 10 years ago

Last modified 3 years ago

#1969 closed bug (fixed)

enormous compile times

Reported by: duncan Owned by: igloo
Priority: normal Milestone: 6.12 branch
Component: Compiler Version: 6.8.1
Keywords: performance Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case: T1969
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


Some modules cause ghc to take a very very long time (and a lot of memory) to compile, even without optimisations.

Here is an example of a module that takes almost forever to compile. The WASH/HTML/HTMLMonad98.hs module from WASH-2.12

It is a 185k, 5,800 line module consisting almost entirely of data, class and instance declarations.

It might be interesting to use this module as a test case to profile ghc's front end to see if there are any obvious inefficiencies or unecessary non-linear algorithms.

WASH/HTML/HTMLPrelude.hs is almost as bad. Between the two of them they push the overall compile time for WASH to several minutes with -O0 and nearly half an hour with -O1. GHC's memory use while compiling WASH also grows to over 300Mb with -O0 and over 600Mb with -O1 (on a 64bit box).

All in all, WASH is an excellent stress test for GHC.

Change History (26)

comment:1 Changed 12 years ago by igloo

Keywords: performance added
Milestone: 6.8.3
Priority: lownormal
Type: taskbug

Thanks for the report. We have a few similar bugs; I've just keyworded them all "performance".

comment:2 Changed 12 years ago by simonmar

Type: bugcompile-time performance bug

comment:3 Changed 12 years ago by igloo

This program:

main :: IO ()
main = writeFile "W.hs" $ unlines $ map unlines foo

foo :: [[String]]
foo = [ "module J where",
        "class C a where",
        "    c :: a -> String",
        "    d :: a -> String",
        "    d x = c x",
        "    e :: a -> String",
        "    e x = c x"
      ] :
      [ ["data " ++ d ++ " = " ++ d,
         "instance C " ++ d ++ " where",
         "    c " ++ d ++ " = \"" ++ d ++ "\""]
        | i <- [1..1000],
          let d = 'A' : show i

generates a W.hs with lots of instances like

data A24 = A24
instance C A24 where
    c A24 = "A24"

Compiling with

ghc -fforce-recomp -O -c W.hs +RTS -p

this is the interesting bit of the profile:

COST CENTRE              MODULE              no.    entries  %time %alloc
       specProgram       Specialise         8794        2006   0.0    0.0
        specBind         Specialise         8804        6006   0.0    0.0
         splitUDs        Specialise         8883        2002   3.1    4.0
          elemVarSet     VarSet             9174     1998000   0.1    0.2
          listToCallDetails Specialise      8985     2000002   1.0    2.3
           unionCalls    Specialise         9182     1998000  31.1   37.6
            tcCmpType    Type               9186    57864047   9.3    0.0
          intersectsVarSet VarSet           8885     4006001   0.2    0.0

The number of entries for the last 5 grows quadratically with the number of instances.

comment:4 Changed 12 years ago by simonpj

Very interesting, and entirely unexpected (to me). No time to look now, but good progress.


comment:5 Changed 11 years ago by simonpj

Owner: set to igloo
Type: compile-time performance bugmerge

Ian: you absolutely nailed the problem, thank you. I refactored a bit, and behold the specialiser pass runs really fast on that code now. Here's the patch, which might be worth merging to the branch:

Mon Apr 28 16:57:11 BST 2008
  * Fix Trac #1969: perfomance bug in the specialiser

You might want to check that it cures the problem on the original code?

I don't quite know how to add a test; maybe not worth the trouble.


comment:6 Changed 11 years ago by simonpj

PS: I've just noticed that the original report described a problem even without -O, so that part at least can't be cured by my patch, since the specialiser only runs with -O.

So, perhaps you can see if that is the case, and re-open if so?


comment:7 Changed 11 years ago by igloo

Type: mergecompile-time performance bug

I've merged the patch, but I think there's more work to be done here.

comment:8 Changed 11 years ago by igloo


comment:9 Changed 11 years ago by simonmar

Architecture: MultipleUnknown/Multiple

comment:10 Changed 11 years ago by simonmar

Operating System: MultipleUnknown/Multiple

comment:11 Changed 11 years ago by igloo


comment:12 Changed 11 years ago by simonpj

Ian diagnosed the problem. Basically we're compiling lots of things like:

    data A24 = A24
    instance C A24 where
        c A24 = "A24"


    class C a where
        c :: a -> String
        d :: a -> String
        d x = c x
        e :: a -> String
        e x = c x

In the ticket 1000 instances are generated, but when I tried that the compiler was using 1.5G of RAM before I killed it. 100 instances is plenty to show a problem:

 doPassM        SimplCore                           36599     6   0.0    0.0
  simplifyPgmIO SimplCore                           36605    12   0.0    0.0
   OccAnal      SimplCore                           37381    10   0.0    0.0
    occurAnalysePgm  OccurAnal                      37399    10   0.0    0.0
     occAnalBind     OccurAnal                      37400  3018   0.0    0.0
      occAnalRec     OccurAnal                      37680  2717   0.0    0.0
       reOrderRec    OccurAnal                      43292   502   0.0    0.0
        reOrderCycle OccurAnal                      43310   200   1.0    0.3
         stronglyConnCompFromEdgedVerticesR Digraph 43342   200   0.0    0.0
          graphFromEdgedVertices            Digraph 43343   200  32.3   42.0

It looks to me like the problem is:

There's a binder

      \ (@ a_aue) ($dC_azi :: J.C a_aue) (x_auh :: a_aue) ->
        J.c @ a_aue $dC_azi x_auh)

in the (Rec pairs) being passed to occAnalBind, which, via its idRuleVars, ties everything in a big knot. I assume that what's going on here is that this is the "normal" c, and there are specialisation rules for uses of c at the various A* types? And to make things worse, I guess J.$dmd is not allowed to be a loop breaker, as then the interaction of rules and inlining could lead to an infinite loop?

So with 100 class instances, we end up with a 500 node SCC being given to reOrderRec/reOrderCycle. This finds a loop breaker, which presumably removes at most one instance's worth of definitions from the SCC. Then it calls stronglyConnCompFromEdgedVerticesR to recompute the scc, and we go round again. reOrderCycle gets called 200 times, building a new SCC each time round, with size decreasing roughly linearly from 500 down to 3.

I'm not sure why the space usage is so high; I haven't really looked at that. I guess we're probably holding on to the old SCCs or something for some reason. Biographical profiling says that the space is all VOID, but I'm not convinced that's trustworthy.

comment:13 Changed 10 years ago by simonpj

Owner: changed from igloo to simonpj

I'm fixing this.

comment:14 Changed 10 years ago by simonpj

Owner: changed from simonpj to igloo
Type: compile-time performance bugmerge

OK the patch is this

Mon Mar 23 10:38:26 GMT 2009
  * Avoid quadratic complexity in occurrence analysis (fix Trac #1969)
    The occurrence analyser could go out to lunch in bad cases, because
    of its clever loop-breaking algorithm. This patch makes it bale out
    in bad cases.  Somewhat ad-hoc: a nicer solution would be welcome.
    See Note [Complexity of loop breaking] for the details.
    M ./compiler/simplCore/OccurAnal.lhs -22 +71

The fix is still not perfect, because it simply bales out in the tricky case. My feeling is that there is probably a good algorithm somewhere for "find the minimal set of nodes whose removal will make the graph acyclic", but I don't know of it. Furthermore, the problem is made tricker by the presence of RULES etc; see extensive comments in OccurAnal.

Anyway, baling out certainly fixes the bad complexity.

I think this could merge into 6.10.

Ian: what about a test-suite example? Maybe just a big enough instance to trip the timeout?


comment:15 Changed 10 years ago by augustss

It's the Minimum Feedback Vertex Set problem; I think it's NP-hard. But there are good approximations that are cheap.

comment:16 Changed 10 years ago by simonpj

Do you know any such approximations? A quick google search shows lots of special cases, but nothing that jumped out at me.

Relevant points:

  • The vertices each have a "score", and I want to minimise the total score of zapped vertices, not just their number.
  • It's a directed graph.

I suppose it's possible that dualising the problem (so the vertices become edges and vice versa) might lead to other relevant material.


comment:17 Changed 10 years ago by igloo

Test Case: T1969

I've added a compiler space usage test, T1969.

comment:18 in reply to:  17 Changed 10 years ago by chak

Replying to igloo:

I've added a compiler space usage test, T1969.

Did you test it on OS X? I get

=====> T1969(normal)
cd ./space_leaks && '/Users/chak/Code/ghc-test/ghc/stage2-inplace/ghc' -fforce-r
ecomp -dcore-lint -dcmm-lint  -dno-debug-output -c T1969.hs   +RTS -tT1969.comp.
stats --machine-readable -RTS  >T1969.comp.stderr 2>&1
peak_megabytes_allocated 19 is more than maximum allowed 15
*** unexpected failure for T1969(normal)

in a validate run with the latest patches.

comment:19 Changed 10 years ago by igloo

Yes, it consistently gave me 15 on OS X. I'll relax it to allow 19 too. Do let me know if you see any more of these failures; I'm not sure how loose the bounds should be.

comment:20 Changed 10 years ago by igloo


comment:21 Changed 10 years ago by igloo

Type: mergecompile-time performance bug

comment:22 Changed 10 years ago by simonmar

difficulty: Difficult (1 week)Difficult (2-5 days)

comment:23 Changed 10 years ago by simonmar

Type of failure: Compile-time performance bug

comment:24 Changed 10 years ago by igloo

Milestone: branch

comment:25 Changed 10 years ago by igloo

Resolution: fixed
Status: newclosed

We think the main performance issue is fixed now.

comment:26 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In cc5ca21/ghc:

Improved stats for Trac #1969

With my latest commits

  76a5477 Move zonking out of tcFamTyPats
  b255ae7 Orient improvement constraints better

perf has improved slightly for T1969:

allocs:    733M -> 26M
residency: 43M  -> 41M

I don't know exactly why, but hey, it's good
Note: See TracTickets for help on using tickets.