Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#6167 closed bug (worksforme)

Compile stalls with pause returning ERESTARTNOHAND

Reported by: erikd Owned by:
Priority: normal Milestone: 7.6.1
Component: Compiler Version: 7.4.1
Keywords: Cc: pho@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: Building GHC failed Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I'm using ghc 7.4.1 from the debian package to compile GHC from git HEAD on linux powerpc.

The build all seems to go fine until it reaches this:

  HC [stage 1] libraries/containers/dist-install/build/Data/Sequence.o

At which stage the compile stalls. By stalls, I mean the compiler seems to make no further progress and consumes less than 1% CPU and less than 1% memory.

Killing the compile with Ctrl-C and running 'make' again and it stalls again in the same place. Doing a 'make clean' and starting again and it stall in the same place once more.

The command that is being run at the stall is:

/home/erikd/Git/ghc-upstream-git/inplace/lib/ghc-stage1 \
    -B/home/erikd/Git/ghc-upstream-git/inplace/lib -H64m -O0 -fasm -package-name \
    containers-0.5.0.0 -hide-all-packages -i -ilibraries/containers/. \
    -ilibraries/containers/dist-install/build -ilibraries/containers/dist-install/build/autogen \
    -Ilibraries/containers/dist-install/build -Ilibraries/containers/dist-install/build/autogen \
    -Ilibraries/containers/include -optP-include \
    -optPlibraries/containers/dist-install/build/autogen/cabal_macros.h -package array-0.3.0.3 \
    -package base-4.5.0.0 -package deepseq-1.2.0.1 -package ghc-prim-0.2.0.0 -O2 -Wall -XHaskell98 \
    -O0 -dcore-lint -no-user-package-db -rtsopts -odir libraries/containers/dist-install/build \
    -hidir libraries/containers/dist-install/build -stubdir libraries/containers/dist-install/build \
    -hisuf hi -osuf o -hcsuf hc -c libraries/containers/./Data/Sequence.hs -o \
     libraries/containers/dist-install/build/Data/Sequence.o

If I run that under strace I find that at the stall its doing the following:

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---

A bit of googling tells me that ERESTARTNOHAND is supposedly a kernel level errno that is not supposed to escape into userland. See

https://lkml.org/lkml/2011/12/23/117

Will continue the investigation.

Change History (12)

comment:1 Changed 7 years ago by erikd

I raised a but against the debian kernel:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=677690

comment:2 Changed 7 years ago by erikd

From the debian bug (now closed):

The task called pause() with an empty signal set, which means: wait forever. Perhaps it meant to enable SIGVTALRM, but it hasn't.

comment:3 Changed 7 years ago by erikd

Running the compile command that hangs under gdb, hitting control-C and getting a backtrace shows the following:

Program received signal SIGINT, Interrupt.
0x0fd2d3e0 in pause () from /lib/powerpc-linux-gnu/libc.so.6
(gdb) bt
#0  0x0fd2d3e0 in pause () from /lib/powerpc-linux-gnu/libc.so.6
#1  0x114dd0f4 in awaitUserSignals ()
#2  0x114baac8 in scheduleDetectDeadlock ()
#3  0x114b9f60 in schedule ()
#4  0x114bc1c4 in scheduleWaitThread ()
#5  0x114fdb24 in rts_evalLazyIO ()
#6  0x114b32bc in real_main ()
#7  0x114b33f8 in hs_main ()
#8  0x10015a68 in main ()

comment:4 Changed 7 years ago by erikd

I'm not convinced that this is a linux kernel bug as I was in my initial bug report.

I've tried a number of things like commenting out all the code in awaitUserSignals () so that its an ampty function and when I run it under gdb, I still see the same stack trace which doesn't make sense, because the call to the pause function is commented out!!!

I need to find time to get to the bottom of this.

comment:5 Changed 7 years ago by simonmar

Architecture: Unknown/Multiplepowerpc
difficulty: Unknown
Milestone: 7.6.1

It sounds like GHC itself has hit an infinite loop, which will result in a deadlock and the behaviour you're seeing. It could be a bug that only happens on powerpc. Is the GHC build on Debian Linux/PPC registerised or unregisterised? That is, is it using the native backend or compiling via C?

Adding -v to the command that hangs might help.

comment:6 Changed 7 years ago by erikd

Architecture: powerpcUnknown/Multiple

I am a member of the Debian Haskell Group and we have not seen this problem with any other arch so it does look like a PowerPC specific issue.

Debian's GHC is built without explicitly turning off the registered build. I therefore assume that it is registered and using the native backend. When I compile GHC from git, I am setting -fasm (BuildFlavour = devel2, "stage = 2" still commented out).

If I add -v to the command that hangs I get:

Glasgow Haskell Compiler, Version 7.5.20120630, stage 1 booted by GHC version 7.4.1
Using binary package database: /home/erikd/Git/ghc-upstream-git/inplace/lib/package.conf.d/package.cache
wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-inplace
wired-in package integer-gmp mapped to integer-gmp-0.3.0.0-inplace
wired-in package base mapped to base-4.6.0.0-inplace
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to template-haskell-2.6.0.0-inplace
wired-in package dph-seq not found.
wired-in package dph-par not found.
Hsc static flags: -static
Created temporary directory: /tmp/ghc30901_0
*** C pre-processor:
'/usr/bin/gcc' '-E' '-undef' '-traditional' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-I' 'libraries/containers/dist-install/build' '-I' 'libraries/containers/dist-install/build' '-I' 'libraries/containers/dist-install/build/autogen' '-I' 'libraries/containers/include' '-I' '/home/erikd/Git/ghc-upstream-git/libraries/array/include' '-I' '/home/erikd/Git/ghc-upstream-git/libraries/base/include' '-I' '/home/erikd/Git/ghc-upstream-git/rts/dist/build' '-I' '/home/erikd/Git/ghc-upstream-git/includes' '-I' '/home/erikd/Git/ghc-upstream-git/includes/dist-ghcconstants/header' '-I' '/home/erikd/Git/ghc-upstream-git/includes/dist-derivedconstants/header' '-D__GLASGOW_HASKELL__=705' '-Dlinux_BUILD_OS=1' '-Dpowerpc_BUILD_ARCH=1' '-Dlinux_HOST_OS=1' '-Dpowerpc_HOST_ARCH=1' '-include' 'libraries/containers/dist-install/build/autogen/cabal_macros.h' '-x' 'c' 'libraries/containers/Data/Sequence.hs' '-o' '/tmp/ghc30901_0/ghc30901_0.hscpp'
*** Checking old interface for containers-0.5.0.0:Data.Sequence:
*** Parser:
*** Renamer/typechecker:
<hang>

comment:7 Changed 7 years ago by simonmar

We lack a way to reproduce the problem here, so it is hard for us to track it down. There are basically three ways to track this down:

  • Compile GHC with profiling and -fprof-auto, and run it with +RTS -xc to get a stack trace. You will probably need to hit Control-C to interrupt the hanging GHC, the RTS should then spit out the stack trace.
  • Insert lots of traces in the compiler to try to find out where it has hung.
  • Use gdb and poke around in the stack of the blocked thread to try to find out where it is stuck. See Debugging/CompiledCode.

comment:8 Changed 7 years ago by erikd

I switched from the "devel2" way of build ghc (defined in mk/build.mk.sample) to the "prof" way and suddenly everything built without any hang.

Experimenting a little further it turns out that the important difference was that for "devel2" the GhcStage2HcOpts variable does not include -fasm so that the second stage was built un-registered.

This raises two questsions:

  • Should the un-registered build correctly on PowerPC?
  • Should GhcStage2HcOpts use -fasm?

comment:9 Changed 7 years ago by simonmar

-fasm is a no-op, to get an unregisterised build you have to add GhcUnregisteriesd=YES to mk/build.mk, so I think your build is still registerised.

What probably made the difference is adding -DDEBUG. I can test that here.

comment:10 Changed 7 years ago by erikd

I'm using the following mk/build.mk:

SRC_HC_OPTS        = -H64m -O -fasm
GhcLibHcOpts       = -O -dcore-lint
GhcStage1HcOpts    = -Rghc-timing -O -fasm
GhcStage2HcOpts    = -Rghc-timing -O0 -fasm -DDEBUG
SplitObjs          = NO
HADDOCK_DOCS       = NO
BUILD_DOCBOOK_HTML = NO
BUILD_DOCBOOK_PS   = NO
BUILD_DOCBOOK_PDF  = NO
LAX_DEPENDENCIES   = NO

NoFibWays   =
STRIP_CMD   = :
GhcDebugged = Yes

and now something has changed. Regardless of whether -fasm is included in GhcStage2HcOpts the build runs to completion.

While I'm not seeing this anymore, I'm not sure if the problem is fixed.

comment:11 Changed 7 years ago by simonmar

Resolution: worksforme
Status: newclosed

I tried to reproduce it by validating with DEBUG turned on, but didn't see the problem either. Since the symptom has disappeared, I'll close the ticket and we can re-open if it occurs again.

comment:12 Changed 7 years ago by PHO

Cc: pho@… added
Note: See TracTickets for help on using tickets.