Opened 9 years ago

Closed 9 years ago

#4970 closed bug (fixed)

time002 and time004 (ghci) test failures on OS X 64 bit

Reported by: gwright Owned by: igloo
Priority: high Milestone: 7.2.1
Component: GHCi Version: 7.0.1
Keywords: Cc: AntoineLatter, william.knop.nospam@…, warrensomebody@…, p.giarrusso@…
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The time002 and time004 tests fail for ghci on OS X 64 bit. The failure doesn't happen every time, but when it does, the failure is the same:

plumbbob-franklin> inplace/bin/ghc-stage2 --interactive time002.hs
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( time002.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
Loading package array-0.3.0.2 ... linking ... done.
Loading package filepath-1.2.0.0 ... linking ... done.
Loading package old-locale-1.0.0.2 ... linking ... done.
Loading package old-time-1.0.0.6 ... linking ... done.
Loading package unix-2.4.1.0 ... linking ... done.
Loading package directory-1.1.0.0 ... linking ... done.
Loading package process-1.0.1.5 ... linking ... done.
Loading package time-1.2.0.3 ... linking ... done.
Loading package random-1.0.0.3 ... linking ... done.
Loading package haskell98-1.1.0.1 ... linking ... done.
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
Ok.
*Main> main
*** Exception: Time.toClockTime: picoseconds out of range
*Main> main
Ok.
*Main> 

If I modify the test to print the results of getClockTime and toCalendarTime, I see

plumbbob-franklin> inplace/bin/ghc-stage2 --interactive time002.hs
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( time002.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
Loading package array-0.3.0.2 ... linking ... done.
Loading package filepath-1.2.0.0 ... linking ... done.
Loading package old-locale-1.0.0.2 ... linking ... done.
Loading package old-time-1.0.0.6 ... linking ... done.
Loading package unix-2.4.1.0 ... linking ... done.
Loading package directory-1.1.0.0 ... linking ... done.
Loading package process-1.0.1.5 ... linking ... done.
Loading package time-1.2.0.3 ... linking ... done.
Loading package random-1.0.0.3 ... linking ... done.
Loading package haskell98-1.1.0.1 ... linking ... done.
Sat Feb 19 10:12:35 EST 2011
CalendarTime {ctYear = 2011, ctMonth = February, ctDay = 19, ctHour = 10, ctMin = 12, ctSec = 35, ctPicosec = 112417000000, ctWDay = Saturday, ctYDay = 49, ctTZName = "EST", ctTZ = -18000, ctIsDST = False}
Ok.
*Main> main
Sat Feb 19 10:12:37 EST 2011
CalendarTime {ctYear = 2011, ctMonth = February, ctDay = 19, ctHour = 10, ctMin = 12, ctSec = 37, ctPicosec = 4295408225000000, ctWDay = Saturday, ctYDay = 49, ctTZName = "EST", ctTZ = -18000, ctIsDST = False}
*** Exception: Time.toClockTime: picoseconds out of range
*Main> 

The ctPicosec field is too big.

The interesting thing is that if I dump the relocations for HSold-time-1.0.0.6.o, and grep for ctPicosec, I see

plumbbob-franklin> otool -rv HSold-time-1.0.0.6.o  | grep ctPicosec
00023c5d False long   True   SUB     False     _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_info_dsp
00023c5d False long   True   UNSIGND False     _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_info
00023c55 True  long   True   SIGNED  False     _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_closure
00001cf8 False quad   True   UNSIGND False     _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_info

The first relocation is an X86_64_RELOC_SUBTRACTOR. This is probably being done incorrectly. (It may be distantly related to #4013, which also seems to be a subtractor relocation gone awry, but in that case of x86. The details of the x86 and x86_64 cases are quite different, but the original code for both cases is from the same author.)

After the patch that fixed #4867, there between five and seven test failures that only involve ghci. I'm uncertain about two because I've been building unthreaded to simplify debugging and two tests require threads. I suspect linker bugs in all these cases.

I've found a convenient list of x86_64 relocation examples at http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/MachOTopics/1-Articles/dynamic_code.html#//apple_ref/doc/uid/TP40002528-SW1. I'll check the OS X 64 bit linker in its current state against these. Unfortunately, the list does not seem to be exhaustive. For instance, it doesn't contain the inter-section relocation that was at issue in #4867.

Attachments (9)

Time.hsc.dpatch (2.1 KB) - added by gwright 9 years ago.
CUSeconds_CSUSeconds.dpatch (72.6 KB) - added by altaic 9 years ago.
whitespace.dpatch (71.6 KB) - added by altaic 9 years ago.
getCPUTime_getCurrentTime.dpatch (72.2 KB) - added by altaic 9 years ago.
getClockTime.dpatch (3.1 KB) - added by altaic 9 years ago.
0001-Add-Haskell-types-for-C-types-useconds_t-and-susecon.patch (3.0 KB) - added by altaic 9 years ago.
git patch #1
0002-Whitespace-only.-This-patch-is-part-2-of-4-for-trac-.patch (2.9 KB) - added by altaic 9 years ago.
git patch #2
0003-Use-CSUSeconds-in-getCPUTime-and-getCurrentTime.-Fix.patch (2.6 KB) - added by altaic 9 years ago.
git patch #3
0001-Use-CSUSeconds-rather-than-CTime-in-getClockTime.-Fi.patch (1.2 KB) - added by altaic 9 years ago.
git patch #4

Download all attachments as: .zip

Change History (29)

comment:1 Changed 9 years ago by gwright

Owner: set to gwright

I've looked at the code in relocateSection in rts/Linker.c more carefully and there's just no way to tell if it handles all of the valid relocation cases correctly. Checking Apple's ld64 source code, there are about 20 different combinations of relocation type, external/non-external symbol and address length that are supported.

My plan is to rip out the current X86_64 code in relocateSection and replace it. The first benefit is that all invalid cases will be caught. The second is that the code path for each type of relocation will be distinct, so fixing one relocation type won't break another (which may be the happening with the linker as it is now).

comment:2 Changed 9 years ago by gwright

I've replaced the old X86_64 code in relocateSection with new code that provides more debug info. There is indeed an error in some of the X86_64_RELOC_SUBTRACTOR cases.

The below relocation is for the symbol ctPicosec which is read incorrectly by test002.

Here's what goes wrong:

relocateSection: relocation 4534 has
relocateSection:        r_address   = 146525 (0x23c5d)
relocateSection:        r_symbolnum = 4307 (0x10d3)
relocateSection:        r_pcrel     = 0
relocateSection:        r_length    = 2
relocateSection:        r_extern    = 1
relocateSection:        r_type      = 0 (0)
relocateSection:        fixupPtr = 0x10611fe8d, fixup = 0xf9ee01c0, pc = 0x10611fe91
relocateSection: type is X86_64_RELOC_UNSIGNED
findSymbolAddress: looking for symbol _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_info
findSymbolAddress:      _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_info is defined in this object file
findSymbolAddress:      oc->image = 0x1060fc000
findSymbolAddress:      section[0].offset = 0x230
findSymbolAddress:      section[0].addr = 0
findSymbolAddress:      symbol->n_value = 0x23c28
relocateSection: external symbol _oldzmtimezm1zi0zi0zi6_SystemziTime_ctPicosec_info found at 10611fe58
relocateSection: FINAL: reloc = 4534, addr = 23c5d, fixup = 200000018
relocateSection: 32 bit reloc, fixup 0x18 written to 0x10611fe8d

Note the last two lines. The one beginning with FINAL: has fixup 0x200000018, but the next line shows that 0x18 was written back to the fixup location. So the fixup has been truncated.

It's not clear from the Apple documentation what's supposed to happen here. In my code now and in Wolfgang Thaller's earlier code, the subtractor relocation subtracted the symbol address from the fixup, then wrote it back to the fixup, truncating to 32 bits (reloc->r_length == 2, which means the fixup is 32 bits long). The following X86_64_RELOC_UNSIGNED read the saved fixup again, added the next symbol value and truncated the result to 32 bits again. This is probably wrong. What I think should happen is that the first X86_64_RELOC_SUBTRACTOR should look up the target symbol, then save its full 64 bit value. The following X86_64_RELOC_UNSIGNED should look up its symbol, then do a 64 bit subtraction of the previous symbol's address from the current one. This difference should be added to the fixup, and the entire value truncated to 32 bits.

This is not hard to test, and I will also add checks that will barf if a relocation is truncated.

comment:3 Changed 9 years ago by gwright

I've fixed the truncated relocations, but this hasn't fixed the problem with the time002 test. Relocations are now checked for truncation in my linker code. Using signed ints for quantities that are signed (just as one should) cleaned things up.

This means that there is still a linker bug somewhere. I will go back to gdb to try to find out where time002 goes wrong.

comment:4 Changed 9 years ago by gwright

Hmm. Hmm. Getting rid of the truncated relocations increased the number of ghci test failures dramatically (from 12 to 150). So the arithmetic really should be done unsigned, even for quantities that are intrinsically signed. I backed my changes out and was able to reproduce my earlier result.

This does show that the details of the relocation calculations are fragile and easy to break, I suppose.

I will go back to poking around with gdb. Perhaps a clue is the fact that the ctPicosec field is the only Integer in the CalendarTime record. The other numeric fields are Ints.

comment:5 Changed 9 years ago by gwright

Some notes to self:

The error occurs before the microseconds values from gettimeofday is multiplied by 1000000 to convert to picoseconds. Here's the source in Time.hs (the preprocessed Time.hsc):

getClockTime :: IO ClockTime

{-# LINE 249 "libraries/old-time/./System/Time.hsc" #-}
getClockTime = do
  allocaBytes (16) $ \ p_timeval -> do
{-# LINE 251 "libraries/old-time/./System/Time.hsc" #-}
    throwErrnoIfMinus1_ "getClockTime" $ gettimeofday p_timeval nullPtr
    sec  <- ((\hsc_ptr -> peekByteOff hsc_ptr 0))  p_timeval :: IO CTime
{-# LINE 253 "libraries/old-time/./System/Time.hsc" #-}
    usec <- ((\hsc_ptr -> peekByteOff hsc_ptr 8)) p_timeval :: IO CTime
{-# LINE 254 "libraries/old-time/./System/Time.hsc" #-}
    return (TOD (realToInteger sec) ((realToInteger usec) * 1000000))

Either usec is not getting the right value or something is wrong with realToInteger. I'm guessing it's the former.

I've located where the multiplication by 1000000 is in the assembly code:

.text
.align 3
_s7ja_info_dsp:
        .quad   0
        .quad   32
_s7ja_info:
LcfET:
        movq %rbx,%r14
        leaq _r4N2_closure(%rip),%rax
        leaq 1(%rax),%rsi
        addq $8,%rbp
        jmp _integerzmgmp_GHCziInteger_timesInteger_info
        .long  _s7ja_info - _s7ja_info_dsp

and where gettimeofday is called:

.text
.align 3
_oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_info_dsp:
        .quad   _oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_srt-(_oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_info)+0
        .quad   8589934596
        .quad   0
        .quad   133143986191
.globl _oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_info
_oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_info:
LcfGM:
        leaq -16(%rbp),%rax
        cmpq %r15,%rax
        jb LcfGO
        movq %r14,%rax
        movq %rax,%rdi
        xorl %esi,%esi
        movl $0,%eax
        call ___hscore_gettimeofday
        movq %r14,-8(%rbp)
        movslq %eax,%rax
        cmpq $-1,%rax
        jne LcfGQ
        leaq _r4S3_closure(%rip),%r14
        leaq _s7jh_info(%rip),%rax
        movq %rax,-16(%rbp)
        addq $-16,%rbp
        jmp _base_ForeignziCziError_throwErrno1_info
LcfGO:
        leaq _oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_closure(%rip),%rbx
        jmp *-8(%r13)
LcfGQ:
        addq $-16,%rbp
        jmp _s69Q_info
        .long  _oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_info - _oldzmtimezm1zi0zi0zi6_SystemziTime_zdwa3_info_dsp

Now I need to do a bisection search to locate the bad relocation.

comment:6 Changed 9 years ago by gwright

Surprise! Not a relocation failure at all! I don't understand why it only occurs for the ghci tests, but I am able to fix the bug.

Here's what's going on:

I used the linker debug output to get the memory address of the call to __hscore_gettimeofday. Then I poked around with gdb's x/i command to get the memory addresses of the instructions just before and after the call. I set breakpoints at these instructions. (This may seem roundabout, but more on that later.)

I ran the test002 program a number of times, looking for where the time of day was returned from __hscore_gettimeofday. I guessed that the return values might be somewhere pointed to by the STG machine's R1 ($rbx on amd64), and it is:

*Main> main

Breakpoint 3, 0x00000001061208d1 in ?? ()
1: x/i $rip  0x1061208d1:       mov    $0x0,%eax
(gdb) p8 $rbx
0x105d28178:    0x302e302e322e302d
0x105d28170:    0x6d6972702d636867
0x105d28168:    0x10
0x105d28160:    0x101f34340 <stg_ARR_WORDS_info>
0x105d28158:    0x8
0x105d28150:    0x0
0x105d28148:    0x10
0x105d28140:    0x101f34340 <stg_ARR_WORDS_info>
(gdb) c
Continuing.

Breakpoint 1, 0x00000001061208db in ?? ()
1: x/i $rip  0x1061208db:       mov    %r14,-0x8(%rbp)
(gdb) p8 $rbx
0x105d28178:    0x302e302e322e302d
0x105d28170:    0x6d6972702d636867
0x105d28168:    0x10
0x105d28160:    0x101f34340 <stg_ARR_WORDS_info>
0x105d28158:    0xe300b
0x105d28150:    0x4d7118d0
0x105d28148:    0x10
0x105d28140:    0x101f34340 <stg_ARR_WORDS_info>
(gdb) 

The lines at $rbx + 16 (0x105d28150) and $rbx + 24 (0x105d28158) hold the sec and usec values, respectively. These are the seconds and microseconds since the Unix epoch on 1 Jan 1970. The current seconds count isn't obvious; I had to run date +%s in a shell, converted the value to hex then looked at the memory pointed to by various registers to find something starting with 0x4d7....

I tried to step through __hscore_gettimeofday to see the time values being written, but it doesn't work. The _gettimeofday system call loops endlessly calling _nanotime. I suspect that there is a consistency check to make sure the time returned is current, even if the process is interrupted or put to sleep. Single stepping interrupts each instruction, causing _gettimeofday to chase its tail.

However, on OS X we have a lucky break. gdb on this platform has hardware watchpoints, so I can set a watchpoint on the memory location that the usec value is written to, and it won't cause significant overhead. Here's a transcript with the watchpoint set:

*Main> main

Breakpoint 3, 0x00000001061208d1 in ?? ()
1: x/i $rip  0x1061208d1:       mov    $0x0,%eax
(gdb) p8 $rbx
0x105d283f8:    0xc
0x105d283f0:    0x101f34340 <stg_ARR_WORDS_info>
0x105d283e8:    0x8
0x105d283e0:    0x101f34340 <stg_ARR_WORDS_info>
0x105d283d8:    0x322e302e33
0x105d283d0:    0x2e302d7961727261
0x105d283c8:    0x10
0x105d283c0:    0x101f34340 <stg_ARR_WORDS_info>
(gdb) watch * 0x105d283d8
Hardware watchpoint 13: *4392649688
(gdb) c
Continuing.
Hardware watchpoint 13: *4392649688

Old value = 774909491
New value = 735977
0x00007fffffe00337 in __gettimeofday ()
1: x/i $rip  0x7fffffe00337 <__gettimeofday+87>:        xor    %eax,%eax
(gdb) c
Continuing.

Breakpoint 1, 0x00000001061208db in ?? ()
1: x/i $rip  0x1061208db:       mov    %r14,-0x8(%rbp)
(gdb) p8 $rbx
0x105d283f8:    0xc
0x105d283f0:    0x101f34340 <stg_ARR_WORDS_info>
0x105d283e8:    0x8
0x105d283e0:    0x101f34340 <stg_ARR_WORDS_info>
0x105d283d8:    0x32000b3ae9
0x105d283d0:    0x4d700913
0x105d283c8:    0x10
0x105d283c0:    0x101f34340 <stg_ARR_WORDS_info>
(gdb)

Here we see that a sensible value for microseconds is stored to 0x105d283d8, 735977 (0xb3ae9), or about three quarters of a second (look at the "New value =" line). But after __hscore_gettimeofday returns, memory location 0x105d283d8 contains 0x32000b3ae9. The low order 32 bits are correct, but the high order 32 bits contain junk. Continuing from this point gives the "picoseconds out of range" error.

The cause is simple, and not at all related to the linker. In the getClockTime function in libraries/old-time/System/Time.hsc,

#elif HAVE_GETTIMEOFDAY
getClockTime = do
  allocaBytes (#const sizeof(struct timeval)) $ \ p_timeval -> do
    throwErrnoIfMinus1_ "getClockTime" $ gettimeofday p_timeval nullPtr
    sec  <- (#peek struct timeval,tv_sec)  p_timeval :: IO CTime
    usec <- (#peek struct timeval,tv_usec) p_timeval :: IO CTime
    return (TOD (realToInteger sec) ((realToInteger usec) * 1000000))

the usec value is forced to be a CTime, which has 64 bits. But the tv_usec field in struct timeval is an __int32_t (see /usr/include/sys/_types.h). The watchpoint showed that only 32 bits were transferred. The fix is to change (on darwin) the type of usec from CTime to CInt. The attached patch does this.

After making this change, I've had no more failures of test002 or test004 in ghci.

Changed 9 years ago by gwright

Attachment: Time.hsc.dpatch added

comment:7 Changed 9 years ago by gwright

Status: newpatch

comment:8 Changed 9 years ago by igloo

Milestone: 7.2.1

Great stuff, thanks Greg!

comment:9 Changed 9 years ago by igloo

Owner: gwright deleted
Priority: normalhigh
Status: patchnew

It looks like the right thing to do is to define a Haskell type for the C suseconds_t, and use that. At least, http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/time.h.html and my Linux gettimeofday(2) agree on that.

comment:10 Changed 9 years ago by igloo

Owner: set to igloo

comment:11 in reply to:  9 Changed 9 years ago by gwright

Replying to igloo:

It looks like the right thing to do is to define a Haskell type for the C suseconds_t, and use that. At least, http://pubs.opengroup.org/onlinepubs/009695399/basedefs/sys/time.h.html and my Linux gettimeofday(2) agree on that.

That should work on OS X as well, since __darwin_suseconds_t is a typedef for suseconds_t (defined in /usr/include/sys/types.h); __darwin_suseconds_t is the type of the tv_usec field in struct timeval.

comment:12 Changed 9 years ago by igloo

See also #4247.

comment:13 Changed 9 years ago by AntoineLatter

Cc: AntoineLatter added

I'm getting this issue in compiled code - the benchmark suite in the binary package produces great values for MB/s throughput.

A simple:

import System.CPUTime
import Control.Concurrent

main :: IO ()
main = do
  t1 <- getCPUTime   -- in picoseconds
  threadDelay (10^6) -- in microseconds
  t2 <- getCPUTime
  print $ fromIntegral (t2 - t1) / 10^12  -- should print ~1

produces nearly random output when compiled. At minimum the output should never be negative :-)

Changed 9 years ago by altaic

Attachment: CUSeconds_CSUSeconds.dpatch added

Changed 9 years ago by altaic

Attachment: whitespace.dpatch added

Changed 9 years ago by altaic

Changed 9 years ago by altaic

Attachment: getClockTime.dpatch added

comment:14 Changed 9 years ago by altaic

Cc: william.knop.nospam@… added
Status: newpatch

I've created four bitesize patches against HEAD which add Haskell types for useconds_t and suseconds_t (respectively CUSeconds and CSUSeconds), and then fix getClockTime, getCPUTime, and getCurrentTime. The getCurrentTime bug was not reported, but a quick grep of uses of struct timeval revealed that GHC.Event.Clock used CLong for both fields, which would fail in the same way as the reported bugs.

comment:15 Changed 9 years ago by altaic

I'm in the process of moving over to the git repo, so here are equivalent git patches for your convenience.

comment:16 Changed 9 years ago by warren

Cc: warrensomebody@… added

comment:17 Changed 9 years ago by igloo

See also #5148

comment:18 Changed 9 years ago by igloo

New types proposed here, with 7 May deadline: http://www.haskell.org/pipermail/libraries/2011-April/016254.html

comment:19 Changed 9 years ago by Blaisorblade

Cc: p.giarrusso@… added

comment:20 Changed 9 years ago by igloo

Resolution: fixed
Status: patchclosed

No objections; applied.

Note: See TracTickets for help on using tickets.