Opened 9 years ago

Closed 8 years ago

Last modified 7 years ago

#5085 closed bug (wontfix)

internal error: evacuate: strange closure type

Reported by: mitar Owned by: simonmar
Priority: highest Milestone: 7.4.1
Component: Runtime System Version: 7.1
Keywords: Cc: mmitar@…, mk.fraggod@…, pho@…
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

While running my test and benchmarking program for Etage-Graph package, I am getting sometimes (in around 1% runs) the following error (with different closure type number):

test: internal error: evacuate: strange closure type 4869608
    (GHC version 7.1.20101124 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
Aborted

As it is not very often it is hard to debug. I am running the program as:

./test -s 400 +RTS -N4

Where 400 is number of nodes in the graph. Maybe it also happens with smaller number of nodes.

Change History (18)

comment:1 Changed 9 years ago by mitar

OK. I also got one segmentation fault.

comment:2 Changed 9 years ago by simonmar

Milestone: 7.2.1
Owner: set to simonmar
Priority: normalhighest
Type of failure: None/UnknownRuntime crash

Thanks for the report.

comment:3 Changed 9 years ago by simonmar

How long should the test take to run? Seems to be taking a very long time here.

comment:4 Changed 9 years ago by simonmar

Ok, I left the -s 400 command running for 1.5 hours or so, and it didn't complete or crash (I had -debug turned on which might slow things down). I've also been trying -s 100 repeatedly; each one takes 20s or so, but no crashes so far.

Any more hints as to how I might reproduce this?

comment:5 Changed 9 years ago by mitar

You have to let the test finish. This is one run. It takes some time, yes. ;-) And in around 1 of 10 (program/test) runs at 400 nodes I get this error.

comment:6 Changed 9 years ago by mitar

I have just uploaded a bit improved version, but which still has this problem.

comment:7 Changed 9 years ago by mk.fg

Cc: mk.fraggod@… added

I'm having the same issue with git-annex and ghc 7.0.2, and can reproduce the bug reliably using git-annex. Several git-bisect tests showed that fairly trivial (and seemingly unrelated) changes introduce the issue, hence the maintainer advised to report the bug here.

Detailed info is available at the git-annex tracker. I'm unsure if it's the same issue though, since ghc version seem to be different here and it's i686, not x86_64. Should I open a separate ticket?

I'm not really familiar with haskell language, but if there's any more helpful info or test data I can provide on the issue, I'd be happy to do so.

comment:8 in reply to:  7 ; Changed 9 years ago by simonmar

Replying to mk.fg:

I'm having the same issue with git-annex... Should I open a separate ticket?

Yes, please make a separate ticket. I looked at the link you gave, but couldn't immediately see how to reproduce the problem. Can you give me enough information to be able to reproduce the problem here? I'll need the exact version of git-annex, how to build it, the input data (repo?), and the commands that provoke the error.

comment:9 Changed 9 years ago by PHO

Cc: pho@… added

comment:10 in reply to:  8 Changed 9 years ago by mk.fg

Replying to simonmar:

Replying to mk.fg:

I'm having the same issue with git-annex... Should I open a separate ticket?

Yes, please make a separate ticket. I looked at the link you gave, but couldn't immediately see how to reproduce the problem. Can you give me enough information to be able to reproduce the problem here? I'll need the exact version of git-annex, how to build it, the input data (repo?), and the commands that provoke the error.

Tried to reproduce this on a separate, clean x86_64 machine with the same exherbo linux and i386 debian linux vm without any luck.

Since then I've updated ghc (to 7.0.3), git-annex and configuration of repository in question, and the issue seem to be gone. Reverting git-annex doesn't seem to help either, guess I'll try to rollback ghc update, but failing that I probably won't be able to get it again, alas.

comment:11 Changed 8 years ago by igloo

Milestone: 7.2.17.4.1

comment:12 Changed 8 years ago by simonmar

I managed to get a segfault with this example and an up to date GHC built yesterday, using the suggested options (-s 400 +RTS -N4). I've rebuilt the binary with -debug and I'm trying to provoke a segfault again, but two runs so far have been sucessful:

Generating a random graph of size 400.
Graph contains 400 nodes and 59999 edges.
Dijkstra search time for shortest paths: 993.855562s
Etage search time for shortest paths: 0.172937s (5.0s timeout)
Etage graph (external structure) growing time: 7.674317s
Found 0.75 % shortest paths.
etage-graph-test: DissolvingException "()"
[1]    31310 exit 1    

at least, I assume that's a successful run.

I'm not hopeful about finding this bug, because the program takes so long to run and ties up 4 cores. I'll keep trying though.

comment:13 Changed 8 years ago by simonmar

I should have mentioned: if you know of a way to trigger the crash more often or more quickly, that would help a lot. Do certain heap settings make it more likely to fail?

comment:14 Changed 8 years ago by mitar

Hm, this does not look like successful run, it seems your computer is slower (or probably because of the debugging is now slower) than mine and timeout is too low and not all paths have been found. ;-) It should find 100 % of shortest paths. Maybe a little explanation of the program:

  • it generates a random graph of some size
  • it runs Dijkstra among all graph nodes
  • it generates a data-flow structure for search for shortest paths among all graph nodes
    • this structure is a structure of spark-based IO computations and connections between them
  • it runs search for all shortest paths, this is a message-passing algorithm among all nodes and a lot of sparks and inter-spark communication is happening (this is where a segfault occurs, because it really extensively use Haskell sparks)
    • stopping condition is that for some time (5 s timeout by default) no path has been improved, assuming all shortest paths have been found
  • it compares found shortest paths with known shortest paths (found with Dijkstra)

So that only 0.75 % paths have been found means that in fact the problematic part of the code have not run long (enough). This is probably why it has succeeded.

So when running in debug mode timeout should be increased. Please increase minCollectTimeout and initialCollectTimeout in src/Test.hs.

comment:15 Changed 8 years ago by mitar

I am sorry but I do not know how to generate a crash quicker. It is really a huge and extensive use of Haskell sparks and it seems it is a rare problem so it takes time to get it. I have not tested different heap settings.

comment:16 Changed 8 years ago by simonmar

Resolution: wontfix
Status: newclosed

I'm giving up on this one, sadly. Let's hope the bug surfaces in another setting that is easier to reproduce.

comment:17 Changed 7 years ago by mitar

That's interesting. I tried to reproduce this on a virtual machine running Linux and I tried 7.0.4, 7.2.2 and 7.4.1 and I cannot reproduce it anymore. It is true that I kept the same Haskell platform (2012.2) for all tests. That is probably a good thing.

But what is even more interesting is that my algorithm does not work correctly on 7.2.2 and 7.4.1 anymore! It is a message-passing shortest-path searching algorithm which incrementally updates states of each node as it discovers better and better paths. And when it finds a better node, it informs all the neighbors about that which might also improve their list of best paths. And this is repeated. Every node is a Haskell spark, edges, too. So I really create a lot of sparks. At least around 60000 of them for -s 400. :-)

And on 7.0.4 algorithm works. When messages stop being passed around, all shortest paths are found. But on 7.2.2 and 7.4.1 this is not so. Again and again only around 91% paths are found and then messages stop and program finishes because of this, but not all paths are found.

I am not working on this project anymore so I also don't have time or motivation to really debug it. I just wanted to publish my findings. So something is different between 7.0 and 7.2+ versions. Maybe there is a bug in my code which was not visible before. Maybe some API semantic changed just slightly, so that GHC sill compiles, but behavior is changed. I don't know. And it is too complex and stochastic to easy debug it. But of course, this is also why it is a good example of a complex program which really pushes GHC and its runtime to limits.

comment:18 Changed 7 years ago by simonmar

difficulty: Unknown

Thanks for the update. It sounds like perhaps there's an underlying bug that is manifesting in a different way now.

Note: See TracTickets for help on using tickets.