Opened 8 years ago

Closed 8 years ago

#5748 closed bug (fixed)

ghci segfault on OS X after dlsym failed lookup

Reported by: gwright Owned by: igloo
Priority: highest Milestone: 7.4.1
Component: GHCi Version: 7.2.1
Keywords: Cc: pho@…
Operating System: MacOS X Architecture: Unknown/Multiple
Type of failure: GHCi crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

I've had repeatable segfaults with ghci 7.2.2 (OS X 10.6) and 7.0.4 (OS X 10.7 and 10.6). The immediate cause is a failed lookup of an external symbol in rts/Linker.c. The failure is not detected and the NULL value returned is eventually dereferenced, leading to a segfault. The underlying bug is still present in HEAD.

This is what happens:

gwright-macbook> ghci -v Test.hs
GHCi, version 7.2.2: http://www.haskell.org/ghc/  :? for help
Glasgow Haskell Compiler, Version 7.2.2, stage 2 booted by GHC version 7.0.4
Using binary package database: /usr/local/lib/ghc-7.2.2/package.conf.d/package.cache
Using binary package database: /Users/gwright/.ghc/x86_64-darwin-7.2.2/package.conf.d/package.cache
hiding package Cabal-1.12.0 to avoid conflict with later version Cabal-1.13.3
wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-14e0c022e5d4efa3a40ab5991f2b2a1b
wired-in package integer-gmp mapped to integer-gmp-0.3.0.0-2e2b0fd56be1a5f60c50913e615691d9
wired-in package base mapped to base-4.4.1.0-5ca60b2acbb66fd59e5f81685cb72740
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to template-haskell-2.6.0.0-e7db5d1205f362bb792ab7bd5c7bbfae
wired-in package dph-seq not found.
wired-in package dph-par not found.
Hsc static flags: -static
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
*** Chasing dependencies:
Chasing modules from: 
Stable obj: []
Stable BCO: []
unload: retaining objs []
unload: retaining bcos []
Ready for upsweep []
Upsweep completely successful.
*** Deleting temp files:
Deleting: 
*** Chasing dependencies:
Chasing modules from: *Test.hs
Stable obj: []
Stable BCO: []
unload: retaining objs []
unload: retaining bcos []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = Sun Jan  1 18:20:14 EST 2012
         ms_mod = main:Main,
         ms_textual_imps = [import Prelude,
                            import Math.Symbolic.Wheeler.TensorUtilities,
                            import Math.Symbolic.Wheeler.TensorBasics,
                            import Math.Symbolic.Wheeler.Tensor,
                            import Math.Symbolic.Wheeler.IO,
                            import Math.Symbolic.Wheeler.Symbol,
                            import Math.Symbolic.Wheeler.Numeric,
                            import Math.Symbolic.Wheeler.Expr,
                            import Math.Symbolic.Wheeler.Commutativity,
                            import Math.Symbolic.Wheeler.Canonicalize,
                            import Math.Symbolic.Wheeler.Basic, import Data.Ratio,
                            import Data.Maybe]
         ms_srcimps = []
      }]
*** Deleting temp files:
Deleting: 
compile: input file Test.hs
Created temporary directory: /var/folders/4j/4jmo0VgVHgu2WNrlXKFTB++++TI/-Tmp-/ghc61560_1
*** Checking old interface for main:Main:
[1 of 1] Compiling Main             ( Test.hs, interpreted )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar = 788
*** Simplifier:
Result size of Simplifier iteration=1 = 800
Result size of Simplifier = 788
*** Tidy Core:
Result size of Tidy Core = 788
*** CorePrep:
Result size of CorePrep = 1010
*** ByteCodeGen:
Upsweep completely successful.
*** Deleting temp files:
Deleting: /var/folders/4j/4jmo0VgVHgu2WNrlXKFTB++++TI/-Tmp-/ghc61560_1/ghc61560_0.c /var/folders/4j/4jmo0VgVHgu2WNrlXKFTB++++TI/-Tmp-/ghc61560_1/ghc61560_0.o
Warning: deleting non-existent /var/folders/4j/4jmo0VgVHgu2WNrlXKFTB++++TI/-Tmp-/ghc61560_1/ghc61560_0.c
Warning: deleting non-existent /var/folders/4j/4jmo0VgVHgu2WNrlXKFTB++++TI/-Tmp-/ghc61560_1/ghc61560_0.o
Ok, modules loaded: Main.
*Main> x
*** Parser:
*** Desugar:
*** Simplify:
*** CorePrep:
*** ByteCodeGen:
Loading package bytestring-0.9.2.0 ... linking ... done.
Loading package transformers-0.2.2.0 ... linking ... done.
Loading package mtl-2.0.1.0 ... linking ... done.
Loading package array-0.3.0.3 ... linking ... done.
Loading package deepseq-1.2.0.1 ... linking ... done.
Loading package text-0.11.1.12 ... linking ... done.
Loading package parsec-3.1.2 ... linking ... done.
Loading package uniqueid-0.1.1 ... linking ... done.
Loading package Wheeler-0.1 ... linking ... done.
Segmentation fault
gwright-macbook> 

The program is trying to display a record type (the variable "x"). The record type has a Show instance associated with it, and that is apparently not being resolved correctly, leading to a NULL dereference and a segfault.

The problem is isolated to OS X as far as I can tell. The code responsible for the error is in the function relocateSection:

        else if(reloc->r_extern)
        {
            struct nlist *symbol = &nlist[reloc->r_symbolnum];
            char *nm = image + symLC->stroff + symbol->n_un.n_strx;

            IF_DEBUG(linker, debugBelch("relocateSection: looking up external symbol %s\n", nm));
            IF_DEBUG(linker, debugBelch("               : type  = %d\n", symbol->n_type));
            IF_DEBUG(linker, debugBelch("               : sect  = %d\n", symbol->n_sect));
            IF_DEBUG(linker, debugBelch("               : desc  = %d\n", symbol->n_desc));
            IF_DEBUG(linker, debugBelch("               : value = %p\n", (void *)symbol->n_value));
            if ((symbol->n_type & N_TYPE) == N_SECT) {
                value = relocateAddress(oc, nSections, sections,
                                        symbol->n_value);
                IF_DEBUG(linker, debugBelch("relocateSection, defined external symbol %s, relocated address %p\n", nm, (void *)value));
            }
            else {
                value = (uint64_t) lookupSymbol(nm);
                IF_DEBUG(linker, debugBelch("relocateSection: external symbol %s, address %p\n", nm, (void *)value));
            }
        }

The returned value from lookupSymbol is not checked for failure. A simple check for NULL and a call to errorBelch is all that's needed to fix this. There's another place where the return value from lookupSymbol is not checked and it should be fixed similarly.

In a bit more detail, the program "Test.hs" I was loading does some simple tests on a library. I admit to having been sloppy while sorting out the module exports, but the library compiles without warnings when -Wall is set. The library has a top-level module that re-exports the most commonly used symbols, but the test program doesn't use it, importing all of the modules it needs explicitly. Am I doing something that is known to be dangerous?

The failed symbol lookup is

lookupSymbol: looking up _Wheelerzm0zi1_MathziSymbolicziWheelerziSimpleSymbol_zdfShowSzuzdcshowsPrec_closure

which is the Show instance for a SimpleSymbol data type. (The library implements a symbolic algebra DSL.) The symbol is undefined in the object module:

gwright-macbook> nm HSWheeler-0.1.o  | grep "_Wheelerzm0zi1_MathziSymbolicziWheelerziSimpleSymbol_zdfShowSzuzdcshowsPrec_closure"
                 U _Wheelerzm0zi1_MathziSymbolicziWheelerziSimpleSymbol_zdfShowSzuzdcshowsPrec_closure
gwright-macbook> 

so some sort of failure is expected when I try to show something of the SimpleSymbol type.

I'm puzzled that the first indication of failure is a segfault, or, after I patch rts/Linker.c, an error from deep inside the linker. It seems that there is something else going wrong which ought to generate a warning at least.

I will generate a patch against HEAD to check for the failed symbol lookups; it would be good if it were included in the final 7.4.1.

Attachments (1)

0001-check-for-failed-external-symbol-lookups-partial-fix.patch (7.9 KB) - added by gwright 8 years ago.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 8 years ago by gwright

I forgot to mention that the original segfaults occurred on clean builds from scratch of the haskell platform 2011.4.0.0. So I don't think that the underlying undefined symbol is caused by a global/user interface mismatch.

comment:2 Changed 8 years ago by simonmar

difficulty: Unknown

Clearly the segfault ought to produce an informative error message instead.

However, the fact that the symbol is missing entirely is suspicious. Is it possible that the .hi files are out of sync with the library object file? If everything is in sync, we should investigate further to find out why the symbol is missing.

comment:3 Changed 8 years ago by gwright

Owner: set to gwright

I have a patch that checks for lookup failure in the two places in rts/Linker.c where an invalid return from dlsym could slip by . I'll validate it and send it in tomorrow.

Afterward, I'll try to understand why the Show instance isn't being found. I agree that it's odd, and I would have expected an error earlier if the instance were missing. I will start over from scratch, looking closely at the .hi files.

comment:4 Changed 8 years ago by PHO

Cc: pho@… added

comment:5 Changed 8 years ago by gwright

Well, this just gets most interesting. I patched rts/Linker.c to error out if lookupSymbol returned NULL and it broke the build.

The problem is a single failed symbol lookup, approx_tab defined in HSinteger-gmp-0.3.0.0.o. The symbol is defined:

gwright-macbook> nm  HSinteger-gmp-0.3.0.0.o  | grep approx_tab
00000000000672c0 d approx_tab

but is missing a leading underscore. On OS X, the lookupSymbol function always strips off the first character, without first checking if it is an underscore. (Amusingly, there is an assertion to test that the leading character of a symbol is an underscore. Evidently assertions are seldom turned on during a compiler build, otherwise this would have shown up earlier.)

I changed the linker to produce a debugBelch instead of an error after a failed lookup. The bug occurs a number of times while building ghc, probably every time TH invokes ghci.

Instead of mucking about with gmp, I'm going to try patching lookupSymbol to check for the leading underscore and not strip the first character if it is not there.

I also poked around the gmp mailing list and saw ominous mention of approx_tab being treated specially, to work around a Darwin "linker bug". Verily, a maze of twisty passages, all alike.

comment:6 Changed 8 years ago by gwright

The failure to lookup the "approx_tab" symbol is a real, and long-standing, linker bug.

First, my little experiment to not strip of the first character of a symbol if it wasn't "_" did not work. Checking the source of dlsym showed why: dlsym always prefixes its symbol argument with an underscore before it searches the symbol tables. It can never find a symbol that does not start with a leading underscore.

The bug is that dlsym is ever being asked to resolve "approx_tab" at all. The Apple documentation is a bit confusing on this point, but here's my understanding of it:

The documentation uses "external" in two different senses: 1) "external" relocations, indicated by reloc->r_extern == 1 mean that a symbol must be looked up in the symbol table at offset reloc->r_symbolnum. (Non-external relocations simply involve computing offsets within the current section of the object module.) 2) "External" symbols, indicated by the N_EXT flag in the symbol table entry (an nlist_64 struct). This sense of "external" means that the symbol is either a) defined within the current object module and available for import by other object modules, or b) that the symbol is undefined in the current object module and must be found elsewhere.

The symbol "approx_tab" is external but not External. It should be looked up in the symbol table ("external" in the first sense) but not resolved by lookupSymbol, since it is not "External" in the second sense. Instead, the symbol table entry contains the section number (nlist->n_sect) and offset within the section (nlist->n_value) of the symbol.

This is straightforward to fix, though it's puzzling why it ever worked. The code in rts/Linker.c assumes that relocations of type X86_64_RELOC_GOT and X86_64_RELOC_GOT_LOAD are always "External" in the second sense, which is wrong. It's also strange that a symbol reference internal to an object module goes through the global offset offset table (at least I think it's strange, maybe there is a good but not obvious reason for it.)

I'll get this fixed up and tested today.

comment:7 Changed 8 years ago by gwright

Status: newpatch

The attached patch detects failed symbol lookups. It also corrects the bad relocation a certain type of non-exported symbol (the "approx_tab" issue mentioned above).

The patch does not address the original question of why I had an undefined symbol reference crash ghci. However, at least ghci will give a proper error rather than a segfault.

The patch was validated against HEAD. There were 7 unexpected failures, the same unexpected failures seen in a validate run in which rts/Linker.c was not patched.

comment:8 Changed 8 years ago by gwright

I backported my patch to 7.2.2, rebuilt the dependencies for my library and tried my test program again. As expected, ghci now gives a error, indicating that it can't find a symbol:

gwright-macbook> ghci Test.hs
GHCi, version 7.2.2: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( Test.hs, interpreted )
Ok, modules loaded: Main.
*Main> x
Loading package bytestring-0.9.2.0 ... linking ... done.
Loading package transformers-0.2.2.0 ... linking ... done.
Loading package mtl-2.0.1.0 ... linking ... done.
Loading package array-0.3.0.3 ... linking ... done.
Loading package deepseq-1.2.0.1 ... linking ... done.
Loading package text-0.11.1.12 ... linking ... done.
Loading package parsec-3.1.2 ... linking ... done.
Loading package uniqueid-0.1.1 ... linking ... done.
Loading package Wheeler-0.1 ... linking ... <interactive>: 
lookupSymbol failed in relocateSection (relocate external)
/Users/gwright/.cabal/lib/Wheeler-0.1/ghc-7.2.2/HSWheeler-0.1.o: unknown symbol `_Wheelerzm0zi1_MathziSymbolicziWheelerziComplexity_Real_closure'
ghc: unable to load package `Wheeler-0.1'
*Main> 

The unresolved symbol above is an algebraic data type used as a record field to indicate if a symbol is real or complex. The original segfault I saw was a failure to resolve the Show instance for the whole record.

So we're back to the original problem, without the embarrassing segfault. One possibly relevant observation is that I had to write a number of .hs-boot files to break circular module dependencies. I needed to do this because my expression data type can, for example, contain tensor symbols, and those tensor symbol can in turn have components that are expressions. Could my .hs-boot files be at the root of this?

comment:9 Changed 8 years ago by simonmar

Thanks, I'll look at your patch.

It's hard to know whether .hs-boot files are causing your missing symbol, but if all you're doing is compiling a library and loading it into GHCi, a missing symbol definitely indicates a bug in GHC. Can you reduce the example as much as possible and attach it?

comment:10 in reply to:  9 Changed 8 years ago by gwright

Replying to simonmar:

Thanks, I'll look at your patch.

It's hard to know whether .hs-boot files are causing your missing symbol, but if all you're doing is compiling a library and loading it into GHCi, a missing symbol definitely indicates a bug in GHC. Can you reduce the example as much as possible and attach it?

It seems that the original bug is not a bug, but an infelicity in cabal. The library is built using cabal. Cabal compiles all of the modules, but if a module is left off the Other-modules list, it isn't linked. This causes the unresolved symbol. I had been refactoring the modules in my library and hadn't noticed that one was left off the Other-modules list.

It's annoying that cabal doesn't produce a warning when it fails to link a module that has been compiled. However, the only real bug here --- the linker segfault --- is fixed by the patch. When the patch is applied this ticket can be closed.

comment:11 Changed 8 years ago by simonmar

Milestone: 7.4.1
Owner: changed from gwright to igloo
Priority: normalhighest

Ian, could you commit this patch please?

comment:12 Changed 8 years ago by gwright@…

commit b56e7b20605d742536441ed721a4fa21598782d5

Author: Gregory Wright <gwright@antiope.com>
Date:   Sat Jan 7 09:58:00 2012 -0500

    check for failed external symbol lookups (partial fix for #5748)

 rts/Linker.c |   96 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 81 insertions(+), 15 deletions(-)

comment:13 Changed 8 years ago by igloo

Status: patchmerge

comment:14 Changed 8 years ago by igloo

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.