Opened 8 years ago

Closed 8 years ago

#5559 closed bug (fixed)

heap profile character encoding confusion

Reported by: guest Owned by: simonmar
Priority: high Milestone: 7.4.1
Component: Profiling Version: 7.0.3
Keywords: heap profile, character encoding Cc: claudiusmaximus@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: profiling/T5559
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Heap profiling this UTF-8 source file (where ø is encoded as C3 B8) with ghc-7.0.3 on GNU/Linux with LANG=en_GB.utf8 seems to give an output .hp file in ISO-8859 encoding (where ø is encoded as F8).

føb :: Integer -> Integer
føb n
  | n == 0 = 0
  | n == 1 = 1
  | n >= 2 = føb (n - 1) + føb (n - 2)

main :: IO ()
main = print (føb 100)

hexdump extract from .hp file:

00000000  28 32 39 33 29 66 f8 62  2f 43 41 46 3a 6c 76 6c  |(293)f.b/CAF:lvl|
00000010  31 5f 72 50 70 09 34 30  0a                       |1_rPp.40.|
00000019

This causes some problems for heap profile visualization programs:

  • hp2ps: viewing the .ps in evince shows a wrong character (slashed-l instead of ø)
  • hp2pretty: viewing the .svg with rsvg aborts with an invalid utf8 error

hp2any-core seemed to handle the character encoding correctly in this test (displayed as "\248") with correct appearance in hp2any-graph's OpenGL window.

I'd like to know if ISO-8859 will always be used for .hp files, or if the ISO-8859 is a misfeature and UTF-8 will be used in future, or if it will eventually use the current locale settings.

I didn't find any documentation on character encoding here: http://www.haskell.org/ghc/docs/latest/html/users_guide/prof-heap.html

Change History (3)

comment:1 Changed 8 years ago by simonmar

Milestone: 7.4.1
Owner: set to simonmar
Priority: normalhigh

I'll look into it. I suspect we should be using UTF-8.

comment:2 Changed 8 years ago by marlowsd@…

commit 630b89551b14324fb1bfea853be700d8f32106c2

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Fri Nov 4 16:02:17 2011 +0000

    Cost centre names are now in UTF-8 (#5559)
    
    So the .prof file will be UTF-8.  This is mostly ok, except that the
    RTS doesn't calculate the column widths correctly (it assumes bytes =
    chars).
    
    hp2ps doesn't do anything sensible with Unicode strings, it just dumps
    the bytes into the .ps file.

 compiler/codeGen/CgProf.hs        |    8 +++++---
 compiler/codeGen/StgCmmProf.hs    |    8 +++++---
 compiler/profiling/CostCentre.lhs |   16 ++++++++++------
 3 files changed, 20 insertions(+), 12 deletions(-)

comment:3 Changed 8 years ago by simonmar

Component: DocumentationProfiling
Resolution: fixed
Status: newclosed
Test Case: profiling/T5559

Test added.

Note: See TracTickets for help on using tickets.