Opened 4 years ago

Closed 18 months ago

Last modified 18 months ago

#10412 closed bug (fixed)

isAlphaNum includes mark characters, but neither isAlpha nor isNumber do

Reported by: Artyom.Kazak Owned by: Azel
Priority: normal Milestone: 8.6.1
Component: libraries/base Version: 7.10.1
Keywords: unicode, newcomer Cc: hvr, ekmett, lelf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D4593
Wiki Page:

Description

> isMark '\768'
True

> isAlphaNum '\768'
True

> (isAlpha '\768', isNumber '\768')
(False,False)

This behavior comes from this piece in WCsubst.c:

unipred(u_iswalnum,(GENCAT_LT|GENCAT_LU|GENCAT_LL|GENCAT_LM|GENCAT_LO|
		    GENCAT_MC|GENCAT_ME|GENCAT_MN|
		    GENCAT_NO|GENCAT_ND|GENCAT_NL))

I'm not sure what should be done here. Is it a bug with isAlpaNum? Or with isAlpha? How does it correspond to iswalnum's behavior in C++?

(And if it's a feature and not a bug, then it should definitely be documented.)

Change History (11)

comment:1 Changed 4 years ago by hvr

For the record, this was already an issue on GHC 7.8.4 (through GHC 7.0.4):

GHCi, version 7.0.4: http://www.haskell.org/ghc/  :? for help
λ> import Data.Char 
λ> length $ filter isMark  $ filter (\c -> isAlphaNum c /= (isAlpha c && isNumber c)) ['\0'..]
1281
GHCi, version 7.8.4: http://www.haskell.org/ghc/  :? for help
λ> import Data.Char 
λ> length $ filter isMark  $ filter (\c -> isAlphaNum c /= (isAlpha c && isNumber c)) ['\0'..]
1498
GHCi, version 7.10.1.20150511: http://www.haskell.org/ghc/  :? for help
λ> import Data.Char 
λ> length $ filter isMark  $ filter (\c -> isAlphaNum c /= (isAlpha c && isNumber c)) ['\0'..]
1830

comment:2 Changed 20 months ago by lelf

Cc: lelf added

comment:3 Changed 20 months ago by bgamari

Keywords: newcomer added

comment:4 Changed 20 months ago by sighingnow

GENCAT_MC|GENCAT_ME|GENCAT_MN has been included in u_iswalnum since more than 10 years ago. However the documentation of isAlphaNum says "Selects alphabetic or numeric digit Unicode characters" and doesn't mention the "mark" characters.

Should we fix the documentation of isAlphaNum to include "mark" characters or keep the documentation as it is and fix u_iswalnum?

comment:5 Changed 18 months ago by Azel

From what I can see on various C and C++ documentations (i.e. Microsoft's, the glibc's or cppreference.com's which refers us here) iswalnum's behaviour should be to return True if either of iswalpha or iswdigit does, so I guess isAlphaNum ought to do the same. That is, keeping the documentation as it is and fixing u_iswalnum.

comment:6 Changed 18 months ago by Azel

Looking a bit farther afield, all languages I see who have an isAlphaNum equivalent define it as returning True if either of their isAlpha or isNumber equivalents do (e.g. Java's, the .NET Framework's, Common Lisp's, Python's — with the particularity in Python's documentation that they put three functions to match on numbers in isalnum's description but the first two are subsumed by the third… — or Ada's). So I'm willing to have a go at solving that ticket and would be in favour of fixing u_iswalnum and keeping the doc mostly as it is: it states that isAlphaNum selects alphabetic or numeric digit Unicode characters and currently, even if we remove the mark characters, it doesn't matches only that because it matches also GENCAT_NO and GENCAT_NL.

comment:7 Changed 18 months ago by Azel

Owner: set to Azel

comment:8 Changed 18 months ago by Azel

Differential Rev(s): Phab:D4593
Status: newpatch

comment:9 Changed 18 months ago by Ben Gamari <ben@…>

In a26983a3/ghc:

Fixes isAlphaNum re. isAlpha/isNumber and doc fix (trac issue #10412)

Corrects the inconsistency between Data.Char.isAlphaNum,
Data.Char.isAlpha and Data.Char.isNumber. Indeed, isAlphaNum was
returning True not only when isAlpha or isNumber returned True but
also when isMark did. The selectors for the Mn, Mc and Me general
categories where removed from the macro generating u_iswalnum in
ubconfc.

Also, Data.Char.isAlphaNum's documentation was changed to state that
isAlphaNum returns true not only for Unicode number digits but for
Unicode numbers in general in Unicode.hs.

Signed-off-by: ARJANEN Loïc Jean David <arjanen.loic@gmail.com>

Reviewers: hvr, ekmett, lelf, bgamari

Reviewed By: bgamari

Subscribers: thomie, carter

GHC Trac Issues: #10412

Differential Revision: https://phabricator.haskell.org/D4593

comment:10 Changed 18 months ago by bgamari

Milestone: 8.6.1
Resolution: fixed
Status: patchclosed

comment:11 Changed 18 months ago by Ben Gamari <ben@…>

In da74385/ghc:

base: Add a test for T10412

Expects the current behavior, will be updated by D4593 to reflect
desired behavior.

Reviewers: hvr

Subscribers: thomie, carter

GHC Trac Issues: #10412

Differential Revision: https://phabricator.haskell.org/D4610
Note: See TracTickets for help on using tickets.