Opened 2 years ago

Closed 23 months ago

#14669 closed bug (fixed)

Windows binaries sometimes throw a stack overflow.

Reported by: sergv Owned by:
Priority: highest Milestone: 8.4.1
Component: Runtime System Version: 8.2.1
Keywords: Cc: Phyx-
Operating System: Windows Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D4343
Wiki Page:

Description

It seems that 32 bit Windows GHC version 8.2.1 onwards (I can reproduce it with 8.2.1, 8.2.2 and 8.4.1-alpha, but not with 8.0.2) has some issues in Runtime System related to exception handling. The symptom is that executable exiting via unhandled exception ends with a segmentation fault and non-zero exit code. It would be okay-ish if not for the fact that this bug causes tools like ghc-pkg or hsc2hs to exit with non-zero exit code when asked for their --version. In turn, this breaks cabal-the-executable which interprets non-zero exit code of the --version call as a failure and refuses to configure further.

Why do those executables segfault when invoked with --version? It's because they call exitSuccess after printing a version, which throws an ExitSuccess exception. Thus executables exit with unhandled exception and this bit seems to be faulty. Please see minimalistic example below:

$ cat HW.hs
import System.Exit

main :: IO ()
main = do
  putStrLn "Situation normal"
  exitWith ExitSuccess
$ ghc HW.hs
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
 ./HW.exe
Situation normal
Segmentation fault
$ echo $?
139

If exception is caught then everything's ok. But then it's hard to signal non-zero exit code:

$ cat HWCatch.hs
{-# LANGUAGE ScopedTypeVariables #-}

import Control.Exception
import System.Exit

main :: IO ()
main = do
  (res :: Either SomeException ()) <- try $ do
    putStrLn "Situation normal"
    exitWith ExitSuccess
  print res
$ ghc HWCatch.hs
[1 of 1] Compiling Main             ( HWCatch.hs, HWCatch.o )
Linking HWCatch.exe ...
$ ./HWCatch.exe
Situation normal
Left ExitSuccess
$ echo $?
0

System info:

$ uname -a
MINGW64_NT-6.1 box 2.9.0(0.318/5/3) 2017-09-13 23:16 x86_64 Msys
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.2.2
$ ghc --info
 [("Project name","The Glorious Glasgow Haskell Compilation System")
 ,("GCC extra via C opts"," -fwrapv -fno-builtin")
 ,("C compiler command","$topdir/../mingw/bin/gcc.exe")
 ,("C compiler flags"," -U__i686 -march=i686 -fno-stack-protector")
 ,("C compiler link flags"," ")
 ,("C compiler supports -no-pie","YES")
 ,("Haskell CPP command","$topdir/../mingw/bin/gcc.exe")
 ,("Haskell CPP flags","-E -undef -traditional")
 ,("ld command","$topdir/../mingw/bin/ld.exe")
 ,("ld flags","")
 ,("ld supports compact unwind","YES")
 ,("ld supports build-id","YES")
 ,("ld supports filelist","NO")
 ,("ld is GNU ld","YES")
 ,("ar command","$topdir/../mingw/bin/ar.exe")
 ,("ar flags","q")
 ,("ar supports at file","YES")
 ,("touch command","$topdir/bin/touchy.exe")
 ,("dllwrap command","$topdir/../mingw/bin/dllwrap.exe")
 ,("windres command","$topdir/../mingw/bin/windres.exe")
 ,("libtool command","")
 ,("perl command","$topdir/../perl/perl.exe")
 ,("cross compiling","NO")
 ,("target os","OSMinGW32")
 ,("target arch","ArchX86")
 ,("target word size","4")
 ,("target has GNU nonexec stack","False")
 ,("target has .ident directive","True")
 ,("target has subsections via symbols","False")
 ,("target has RTS linker","YES")
 ,("Unregisterised","NO")
 ,("LLVM llc command","llc")
 ,("LLVM opt command","opt")
 ,("Project version","8.2.2")
 ,("Project Git commit id","0156a3d815b784510a980621fdcb9c5b23826f1e")
 ,("Booter version","8.2.1")
 ,("Stage","2")
 ,("Build platform","i386-unknown-mingw32")
 ,("Host platform","i386-unknown-mingw32")
 ,("Target platform","i386-unknown-mingw32")
 ,("Have interpreter","YES")
 ,("Object splitting supported","YES")
 ,("Have native code generator","YES")
 ,("Support SMP","YES")
 ,("Tables next to code","YES")
 ,("RTS ways","l debug thr thr_debug thr_l thr_p ")
 ,("RTS expects libdw","NO")
 ,("Support dynamic-too","NO")
 ,("Support parallel --make","YES")
 ,("Support reexported-modules","YES")
 ,("Support thinning and renaming package flags","YES")
 ,("Support Backpack","YES")
 ,("Requires unified installed package IDs","YES")
 ,("Uses package keys","YES")
 ,("Uses unit IDs","YES")
 ,("Dynamic by default","NO")
 ,("GHC Dynamic","NO")
 ,("GHC Profiled","NO")
 ,("Leading underscore","YES")
 ,("Debug on","False")
 ,("LibDir","C:\\home\\ghc\\ghc-8.2.2-x32\\lib")
 ,("Global Package DB","C:\\home\\ghc\\ghc-8.2.2-x32\\lib\\package.conf.d")
 ]
$ ghc-pkg --version # This is a sign of the problem
GHC package manager version 8.2.2
Segmentation fault
$ echo $?
139

Attachments (2)

sources.tar.xzaa (205.9 KB) - added by sergv 23 months ago.
First part of preprocessed output, and -O0/-O1 assembly
sources.tar.xzab (205.9 KB) - added by sergv 23 months ago.
Second and final part of preprocessed output, and -O0/-O1 assembly

Download all attachments as: .zip

Change History (34)

comment:1 Changed 2 years ago by Phyx-

Status: newinfoneeded

I can't reproduce this

Tamar@Rage ~/ghc2> /r/x86/ghc-8.2.2/bin/ghc-pkg.exe --version; echo $status                       
GHC package manager version 8.2.2                                                                 
0
                                                                                                                                                                    
Tamar@Rage ~/ghc2> /r/x86/ghc-8.2.2/bin/ghc.exe HW.hs; ./HW.exe; echo $status                     
[1 of 1] Compiling Main             ( HW.hs, HW.o )                                               
Linking HW.exe ...                                                                                
Situation normal                                                                                  
0                                                                                                 

Haskell Exceptions are also language level, they don't throw an actual signal so they shouldn't be triggering the exception handlers. More likely than not, since it happens across compilers, you have an external process interrupting your Haskell program.

using the 8.4.1 alpha compile your program with -debug and run it with +RTS --generate-crash-dump and upload the dump somewhere and link it back.

comment:2 Changed 2 years ago by sergv

Oh, I forgot to mention that debug RTS does not have this problem :)

sergey@box /c/home/ghc/bugs$ ../ghc-8.2.2-x32/bin/ghc HW.hs -fforce-recomp
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
sergey@box /c/home/ghc/bugs$ ./HW.exe
Situation normal
Segmentation fault
sergey@box /c/home/ghc/bugs$ ../ghc-8.2.2-x32/bin/ghc HW.hs -fforce-recomp -debug
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
sergey@box /c/home/ghc/bugs$ ./HW.exe
Situation normal
sergey@box /c/home/ghc/bugs$
sergey@box /c/home/ghc/bugs$ ../ghc-8.4.1-alpha/bin/ghc HW.hs -fforce-recomp
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
sergey@box /c/home/ghc/bugs$ ./HW.exe
Situation normal
Segmentation fault
sergey@box /c/home/ghc/bugs$ ../ghc-8.4.1-alpha/bin/ghc HW.hs -fforce-recomp -debug
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
sergey@box /c/home/ghc/bugs$ ./HW.exe
Situation normal
sergey@box /c/home/ghc/bugs$ echo $?
0

I believe this issue has similar reproducibility difficulties as https://ghc.haskell.org/trac/ghc/ticket/14081.

comment:3 Changed 2 years ago by Phyx-

Can you run +RTS --generate-crash-dump. also 8.4 should have produced a stack trace.

comment:4 Changed 2 years ago by sergv

It seems that without exception nothing is generated. And with -debug there are no exceptions...

sergey@box /c/home/ghc/bugs$ ../ghc-8.4.1-alpha/bin/ghc HW.hs -fforce-recomp -debug -rtsopts
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
sergey@box /c/home/ghc/bugs$ ./HW.exe +RTS --generate-crash-dumps
Situation normal
sergey@box /c/home/ghc/bugs$ echo $?
0
sergey@box /c/home/ghc/bugs$ ls
HW.exe*  HW.hi  HW.hs  HW.o  HWCatch.hs
Last edited 2 years ago by sergv (previous) (diff)

comment:5 Changed 2 years ago by Phyx-

Sorry, forgot to clarify, you don't need -debug for +RTS --generate-crash-dumps.

comment:6 Changed 2 years ago by sergv

It seems nothing's generated anyway:

sergey@box /c/home/ghc/bugs$ ../ghc-8.4.1-alpha-x32/bin/ghc HW.hs -rtsopts -fforce-recomp
[1 of 1] Compiling Main             ( HW.hs, HW.o )
Linking HW.exe ...
sergey@box /c/home/ghc/bugs$ ./HW.exe +RTS --generate-crash-dumps --generate-stack-traces=yes
Situation normal
Segmentation fault
sergey@box /c/home/ghc/bugs$ ls
HW.exe*  HW.hi  HW.hs  HW.o  HWCatch.hs

comment:7 Changed 2 years ago by Phyx-

That's very peculiar.. use procdump then https://docs.microsoft.com/en-us/sysinternals/downloads/procdump procdump.exe -t -ma -e 1 -x . Hw.exe

comment:8 Changed 2 years ago by sergv

Okay procdump seems to fare better:

sergey@box /c/home/ghc/bugs$ procdump.exe -accepteula -t -ma -e 1 -x . Hw.exe

ProcDump v9.0 - Sysinternals process dump utility
Copyright (C) 2009-2017 Mark Russinovich and Andrew Richards
Sysinternals - www.sysinternals.com

Process:               HW.exe (684)
CPU threshold:         n/a
Performance counter:   n/a
Commit threshold:      n/a
Threshold seconds:     n/a
Hung window check:     Disabled
Log debug strings:     Disabled
Exception monitor:     First Chance+Unhandled
Exception filter:      [Includes]
                       *
                       [Excludes]
Terminate monitor:     Enabled
Cloning type:          Disabled
Concurrent limit:      n/a
Avoid outage:          n/a
Number of dumps:       1
Dump folder:           .\
Dump filename/mask:    PROCESSNAME_YYMMDD_HHMMSS
Queue to WER:          Disabled
Kill after dump:       Disabled


Press Ctrl-C to end monitoring without terminating the process.

Situation normal
[19:51:59] Exception: C0000005.ACCESS_VIOLATION
[19:51:59] Dump 1 initiated: .\HW.exe_180114_195159.dmp
[19:52:00] Dump 1 writing: Estimated dump file size is 35 MB.
[19:52:01] Dump 1 complete: 35 MB written in 1.4 seconds
[19:52:01] Dump count reached.

I've uploaded the dump to https://fex.net/get/403580615683/210526178.

comment:9 Changed 2 years ago by Phyx-

This seems like a genuine segfault to me.

CONTEXT:  (.ecxr)
eax=02703818 ebx=00527150 ecx=02700040 edx=02702f04 esi=00000001 edi=02702f00
eip=004c4343 esp=0028bcac ebp=02703804 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
HW+0xc4343:
004c4343 89442440        mov     dword ptr [esp+40h],eax ss:002b:0028bcec=????????
Resetting default scope
WARNING: Stack overflow detected. The unwound frames are extracted from outside normal stack bounds.

FAULTING_IP: 
HW+c4343
004c4343 89442440        mov     dword ptr [esp+40h],eax

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 004c4343 (HW+0x000c4343)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000001
   Parameter[1]: 0028bcec
Attempt to write to address 0028bcec

DEFAULT_BUCKET_ID:  INVALID_STACK_ACCESS

Something is writing to $sp+0x40 which seems to be invalid.

Actually the address sp itself is pointing to seems to be invalid. Attach the broken binary too please.

comment:10 Changed 2 years ago by sergv

Please find the binary and a copy of the same dump log at https://fex.net/#!034044549103.

comment:11 Changed 2 years ago by Phyx-

Summary: Executable finishing via unhandled exception results in segmentation fault on 32 bit Windows32-bit binaries sometimes throw a stack overflow on shutdown.

Hmm that binary works fine for me...

Tamar@Rage ~/ghc2> /r/HW.exe; echo $status  
Situation normal                            
0                                           

I'll try to find a Windows 7 machine to test on. Have you already tried on a different machine?

comment:12 Changed 2 years ago by sergv

This issue is really complicated in terms of how it could be reproduced. I've tried it on few Windows 7 machines and it worked correctly. The logs I sent are from vanilla Windows 7 virtual machine I have at home and it is reproducible there. I recall that https://ghc.haskell.org/trac/ghc/ticket/14081 had similar issues - perhaps it's some kind of Windows preference that enables more runtime checks and causes failures, but unfortunately I have absolutely no clue what that might be.

comment:13 Changed 2 years ago by sergv

After quite some trial and error I figured out why debug RTS works and release doesn't. It boils down to gcc compiler optimizations that RTS was compiled with - deubg is build with -O0 and release with -O2. If I build debug RTS with -O2, -O1 or -Og then I can reproduce the crash as well. However, the exact cause of the problem is not known yet.

comment:14 Changed 23 months ago by Phyx-

Ok, compile the rts with -Og -g, recompile your example and run it in gdb. gdb --args HW.exe.

Once it crashes get a backtrace using bt. It's entirely possible the 32 bit version of GCC has issues. It's not tested or used as much as the 64 bit one unfortunately.

comment:15 Changed 23 months ago by sergv

Unfortunately, there's no informative backtrace to speak of:

sergey@box /c/home/ghc/bugs/rts-investigations$ gdb --quiet -ex run --args ./HW.exe
Reading symbols from ./HW.exe...done.
Starting program: C:\home\ghc\bugs\rts-investigations\HW.exe
[New Thread 2916.0x360]
warning: `C:\Windows\SYSTEM32\ntdll.dll': Shared library architecture i386:x86-64 is not compatible with target architecture i386.
warning: `C:\Windows\SYSTEM32\wow64.dll': Shared library architecture i386:x86-64 is not compatible with target architecture i386.
warning: `C:\Windows\SYSTEM32\wow64win.dll': Shared library architecture i386:x86-64 is not compatible with target architecture i386.
warning: `C:\Windows\SYSTEM32\wow64cpu.dll': Shared library architecture i386:x86-64 is not compatible with target architecture i386.
warning: Could not load shared library symbols for WOW64_IMAGE_SECTION.
Do you need "set solib-search-path" or "set sysroot"?
warning: Could not load shared library symbols for WOW64_IMAGE_SECTION.
Do you need "set solib-search-path" or "set sysroot"?
warning: Could not load shared library symbols for NOT_AN_IMAGE.
Do you need "set solib-search-path" or "set sysroot"?
warning: Could not load shared library symbols for NOT_AN_IMAGE.
Do you need "set solib-search-path" or "set sysroot"?
[New Thread 2916.0x100]
[New Thread 2916.0xe44]
[New Thread 2916.0xdb0]
Situation normal

Thread 1 received signal SIGSEGV, Segmentation fault.
0x0000002b in ?? ()
(gdb) where
#0  0x0000002b in ?? ()
#1  0x00788140 in n_capabilities ()
#2  0x01bde848 in ?? ()
#3  0x00000000 in ?? ()
(gdb)

However, I did narrow down the problem. Compiling rts/StgCRun.c with -O0 fixes the issue, while compiling it with -O1 reproduces the crash. Is there anything I can do to StgCRun.c to make locating the crash easier?

comment:16 Changed 23 months ago by Phyx-

That's great work @sergv!

It does make some sense since StgCRun.c controls part of the stack allocations. Could you provide the preprocessed file inplace/mingw/bin/gcc -E StgCRun.c -o StgCRun.i

along with the assembly output of the file compiled at -O1 and -O0. inplace/mingw/bin/gcc -O0 -S StgCRun.c -o StgCRun-O0.s and similar for -O1.

comment:17 Changed 23 months ago by Phyx-

Cc: Phyx- added

Changed 23 months ago by sergv

Attachment: sources.tar.xzaa added

First part of preprocessed output, and -O0/-O1 assembly

Changed 23 months ago by sergv

Attachment: sources.tar.xzab added

Second and final part of preprocessed output, and -O0/-O1 assembly

comment:18 Changed 23 months ago by sergv

Okay, I did just that with a command shown below (taken from make output). Please find all three files packed in the sources.tar.xz* attachments and combine them via cat: cat sources.tar.xz* | xz -d | tar xvf -.

sergey@box /c/home/projects/ghc/rts$ (cd ..; "C:\home\projects\ghc\inplace\lib/../mingw/bin/gcc.exe" "-U__i686" "-march=i686" "-fno-stack-protector" "-DTABLES_NEXT_TO_CODE" "-U__i686" "-march=i686" "-fno-stack-protector" "-Iincludes" "-Iincludes/dist" "-Iincludes/dist-derivedconstants/header" "-Iincludes/dist-ghcconstants/header" "-Irts" "-Irts/dist/build" "-DCOMPILING_RTS" "-fno-strict-aliasing" "-fno-common" "-Irts/dist/build/./autogen" "-Wno-error=inline" "-fno-omit-frame-pointer" "-g3" "-DRtsWay=\"rts_debug\"" "-DWINVER=0x06000100" "-w" "-DDEBUG" "-g2" "-DTRACING" "-x" "c" "rts\StgCRun.c" "-no-pie" "-Wimplicit" "-include" "C:/home/projects/ghc/includes\ghcversion.h" "-Iincludes" "-Iincludes/dist" "-Iincludes/dist-derivedconstants/header" "-Iincludes/dist-ghcconstants/header" "-Irts" "-Irts/dist/build" "-Irts/dist/build" "-Irts/dist/build/./autogen" "-IC:\home\projects\ghc\libraries\base\include" "-IC:\home\projects\ghc\libraries\integer-gmp\include" "-IC:/home/projects/ghc/rts/dist/build" "-IC:/home/projects/ghc/includes" "-IC:/home/projects/ghc/includes/dist-derivedconstants/header" -S -O0 -o e:/StgCRun.O0.s)

comment:19 Changed 23 months ago by Phyx-

Thanks! I'll take a look at them once I get home this evening.

comment:20 Changed 23 months ago by Phyx-

alright, well that's annoying.. at -O1 or higher the optimizers optimize away the calls to ___chkstk_ms. Which means the stack is not probed so it doesn't grow.

This explains why the resulting stack access is nonsense. The optimizers need to be taught to leave this call alone.

comment:21 Changed 23 months ago by Phyx-

Hmm well actually maybe we just have to request it.. can you try again at -O1 but this time also add -fstack-check when compiling that file.

comment:22 Changed 23 months ago by sergv

I have tried -fstack-check but that doesn't help at all. It even breaks compilation with -O0! Just compare the outputs below:

-O0 -fno-stack-check:

	.file	"StgCRun.c"
	.text
Ltext0:
	.globl	_win32AllocStack
	.def	_win32AllocStack;	.scl	2;	.type	32;	.endef
_win32AllocStack:
LFB210:
	.file 1 "rts//StgCRun.c"
	.loc 1 111 0
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	movl	$8224, %eax
	call	___chkstk_ms
	subl	%eax, %esp
	.loc 1 113 0
	movl	$0, %eax
	.loc 1 114 0
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc

-O0 -fstack-check:

	.file	"StgCRun.c"
	.text
Ltext0:
	.globl	_win32AllocStack
	.def	_win32AllocStack;	.scl	2;	.type	32;	.endef
_win32AllocStack:
LFB210:
	.file 1 "rts//StgCRun.c"
	.loc 1 111 0
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset 5, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register 5
	orl	$0, -4096(%esp)
	orl	$0, -8192(%esp)
	orl	$0, -8224(%esp)
	subl	$8224, %esp
	.loc 1 113 0
	movl	$0, %eax
	.loc 1 114 0
	leave
	.cfi_restore 5
	.cfi_def_cfa 4, 4
	ret
	.cfi_endproc

comment:23 Changed 23 months ago by Phyx-

hmm, ok. Looking closer, the issue is that at -O1 and higher the compiler notices the space isn't used. Because it's stack allocated it won't be valid outside the frame anyway so it correctly optimizes away the allocation.

I have to wait till the weekend to look at it, but things to try if you wish:

  • try marking the function volatile;
  • try adding __attribute__((optimize("O0"))) to the function to disable optimizations.

Second one will probably work, but isn't very portable. so I'll likely go with inline assembly.

comment:24 Changed 23 months ago by Phyx-

Hmm seems volatile doesn't, was a long shot anyway. but using an attribute does. I'll submit a patch tonight.

comment:25 Changed 23 months ago by Phyx-

Architecture: x86Unknown/Multiple
Differential Rev(s): Phab:D4343
Milestone: 8.4.1
Priority: normalhighest
Status: infoneededpatch

comment:26 Changed 23 months ago by Phyx-

Summary: 32-bit binaries sometimes throw a stack overflow on shutdown.Windows binaries sometimes throw a stack overflow.

comment:27 Changed 23 months ago by Phyx-

@sergv Can you try out the patch attached? It should fix the issue.

comment:28 Changed 23 months ago by sergv

@Phyx- The patch works - freshly built release RTS does not crash any more.

comment:29 Changed 23 months ago by Phyx-

@serv thanks for the help and testing. I'll make sure it gets into 8.4.

comment:30 Changed 23 months ago by sergv

@Phyx- Thanks for your help!

comment:31 Changed 23 months ago by Ben Gamari <ben@…>

In a55d581f/ghc:

Fix Windows stack allocations.

On Windows we use the function `win32AllocStack` to do stack
allocations in 4k blocks and insert a stack check afterwards
to ensure the allocation returned a valid block.

The problem is this function does something that by C semantics
is pointless. The stack allocated value can never escape the
function, and the stack isn't used so the compiler just optimizes
away the entire function body.

After considering a bunch of other possibilities I think the simplest
fix is to just disable optimizations for the function.

Alternatively inline assembly is an option but the stack check function
doesn't have a very portable name as it relies on e.g. `libgcc`.

Thanks to Sergey Vinokurov for helping diagnose and test.

Test Plan: ./validate

Reviewers: bgamari, erikd, simonmar

Reviewed By: bgamari

Subscribers: rwbarton, thomie, carter

GHC Trac Issues: #14669

Differential Revision: https://phabricator.haskell.org/D4343

comment:32 Changed 23 months ago by bgamari

Resolution: fixed
Status: patchclosed
Note: See TracTickets for help on using tickets.