Ticket #20 (closed defect: fixed)

Opened 6 years ago

Last modified 17 months ago

We don't handle non-ASCII characters in doc comments

Reported by: waern Owned by:
Priority: major Milestone:
Version: Keywords:
Cc: ddssff@…, pho@…

Description

We don't handle non-ASCII characters in doc comments. (need to specify the encoding in the generated HTML too?).

Attachments

Haddock-Unicode.patch (4.9 kB) - added by batterseapower 19 months ago.
GHC-Haddock-Unicode.patch (3.0 kB) - added by batterseapower 19 months ago.

Change History

  Changed 6 years ago by ross

To handle non-ASCII characters in the source, you need to decide which encoding it is in. There is the encoding-independent workaround of using &#nnn; in the source.

The generated HTML doesn't need encoding, as non-ASCII characters are rendered as numeric entities by stringToHtmlString.

  Changed 5 years ago by david48

  • version changed from 0.x to 2.4.2

The workaround of using &#nnn; in the source is not usable. The comment becomes totally unreadable, in the case of comments in a foreign langage it's a real problem.

Haddock should at least be able to handle UTF-8 encoding of the source file without mangling the HTML output.

  Changed 4 years ago by yuriks

  • priority changed from minor to major
  • version 2.4.2 deleted

This is a major pain in the ass for anyone who isn't coding in english, I'm bumping this up.

  Changed 4 years ago by leonelfl

Non english programmers need this.

While not related to Haskell language capabilities, the fact of having tools that work universally gives credibility to the whole platform.

UTF-8 support is necessary. It must be stressed out that other people are programming, explaning their programs and having interfaces in languages others than english. They do this naturally and expect to so without any inconvient.

Let's do stop thinking that Haskell is just for Haskell programmers that program for fun and that are willing to show each others results in english (lingua franca). Haskell platform components need to usable in environments which purpose is not Haskell itself.

follow-up: ↓ 6   Changed 4 years ago by ppavel

I vote for this. I'm willing to hack but will need some directions to get started

in reply to: ↑ 5   Changed 4 years ago by waern

Replying to ppavel:

I vote for this. I'm willing to hack but will need some directions to get started

Hi Pavel,

I've looked at this briefly and I think it could be related to the fact that we use alexGetChar in the GHC lexer where we should alexGetChar' instead. You could try changing that and see if it helps.

The lexer is in compiler/parser/Lexer.x in the GHC source tree. Look for functions that read Haddock comments such as multiline_doc_comment, nested_doc_comment, etc.

  Changed 3 years ago by dsf

  • cc ddssff@… added

  Changed 3 years ago by dsf

I don't think alexGetChar' exists any more.

  Changed 3 years ago by PHO

  • cc pho@… added

I vote for this too. Personally I stick using English in docs while my native language is Japanese, but I'm really fond of UnicodeSyntax?. I want to use UnicodeSyntax? in code examples, not only the code itself.

  Changed 3 years ago by simonmar

Alex 3 can lex UTF-8 directly, which might make this easier. I made the changes to Haddock to make it work with Alex 3, ut I didn't add Unicode support at the time, because I wanted to keep it working with Alex 2.

  Changed 3 years ago by waern

Simon,

I made modifications to the GHC lexer so that Unicode characters are preserved in the comments fed to the Haddock lexer. I then tested with a simple Unicode comment and I can see that it appears in the documentation without getting mangled by the Haddock lexer.

However I'm assuming by your last comment that something still needs to be done in the Haddock lexer for this to work 100%. Do you think we could drop compatibility with Alex 2 by now, and if so could you explain what needs to be done in the lexer?

  Changed 3 years ago by simonmar

The comments from GHC are lexed again by Haddock using an Alex lexer, and I would expect that step to mangle the Unicode. From src\Lex.x:

alexGetByte :: AlexInput -> Maybe (Word8,AlexInput)
alexGetByte (p,c,[]) = Nothing
alexGetByte (p,_,(c:s))  = let p' = alexMove p c
                              in p' `seq`  Just (fromIntegral (ord c), (p', c, s))

-- for compat with Alex 2.x:
alexGetChar :: AlexInput -> Maybe (Char,AlexInput)
alexGetChar i = case alexGetByte i of
                  Nothing     -> Nothing
                  Just (b,i') -> Just (chr (fromIntegral b), i')

You can see we apply ord in alexGetByte and chr again in alexGetChar, so Unicode should be squashed to the low 8 bits.

  Changed 3 years ago by selinger

I agree that this should be fixed. It would be better to assume that all files are UTF8 than to assume all files are ASCII.

Either way, users that use another encoding first have to do an offline conversion before invoking Haddock. But conversion from, say, Latin1 to UTF8 is trivial to do, whereas conversion from Latin1 to ASCII with HTML entities requires offline parsing: non-ASCII characters in Haddock comments must be converted to HTML entities, and non-ASCII characters in the code itself must be converted to something else (UTF8?), because Haddock will croak if it encounters an HTML entity in the code itself.

Moreover, the current HTML entities encoding does not even work correctly; see bug #191.

follow-up: ↓ 15   Changed 3 years ago by SimonHengel

I can reproduce this with Haddock 2.9.2, the version of Haddock that ships with GHC 7.4.0.20111219 produces proper HTML entities for codepoints outside the ASCII range.

Are there still any issues left? And if yes, how would a minimal test case look like?

in reply to: ↑ 14   Changed 21 months ago by adzeitor

Replying to SimonHengel:

I can reproduce this with Haddock 2.9.2, the version of Haddock that ships with GHC 7.4.0.20111219 produces proper HTML entities for codepoints outside the ASCII range. Are there still any issues left? And if yes, how would a minimal test case look like?

Haddock version 2.12.0

-- | Это модуль mytime
module MyTime (Time(..),testFunc) where


-- ^  Тип данных время
data Time = Time{ hour :: Int -- ^ Часы
                , mins  :: Int -- ^ Минуты
                }
          deriving(Show)

-- |Тестовая функция, которая всегда возвращает 42
testFunc :: String -- ^ строка
            -> Int -- ^ возвращает число
testFunc x = 42
$ haddock 3.hs -html
Haddock coverage:
doc comment parse failed:   Тип данных время
doc comment parse failed: Тестовая функция, которая всегда возвращает 42
doc comment parse failed:  строка
doc comment parse failed:  возвращает число
  33% (  1 /  3) in 'MyTime'

Changed 19 months ago by batterseapower

Changed 19 months ago by batterseapower

  Changed 19 months ago by batterseapower

These patches implement support for this in Haddock by using Alex 3's native Unicode support.

  Changed 17 months ago by waern

  • status changed from new to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.