Opened 11 years ago

Closed 9 years ago

#3307 closed bug (fixed)

System.IO and System.Directory functions not Unicode-aware under Unix

Reported by: YitzGale Owned by:
Priority: normal Milestone: 7.2.1
Component: libraries/base Version: 6.11
Keywords: directory unicode Cc: batterseapower@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Under Unix, file paths are represented as raw bytes in a String. That is not user-friendly, because a String is supposed to be decoded Unicode, and it is conventional in Unix to view those raw bytes as encoded according to the current locale. In addition, this is not consistent with Windows, where file paths are natively Unicode and represented as such in the String. (Well, they will be consistently once #3300 is completed.)

On the other hand, this raises various complications (what about encoding errors, and what if encode.decode is not the identity due to normalisation, etc.)

The following cases ought to work consistently for all file operations in System.IO and System.Directory:

  • A FilePath from getArgs
  • A FilePath from getDirectoryContents
  • A FilePath in Unicode from a String literal,
  • A FilePath read from a Handle and decoded into Unicode

See discussion in the thread http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html

Change History (14)

comment:1 Changed 11 years ago by YitzGale

This change needs to be coordinated with #3309 ("getArgs should return Unicode on Unix") so that it will still work to read a file paths from the command line and use them to access files.

comment:2 in reply to:  description ; Changed 11 years ago by duncan

Replying to YitzGale:

Under Unix, file paths are represented as raw bytes in a String. That is not user-friendly, because a String is supposed to be decoded Unicode, and it is conventional in Unix to view those raw bytes as encoded according to the current locale.

Unfortunately it is not conventional on Unix to interpret file names as Unicode, decoded from the current locale. When presenting file names to the user in a user interface some decoding is necessary, though there is not universal agreement that the locale is the right one. For example glib uses UTF-8 always, unless you set some special env var to tell it to use the current locale (the latter is considered a compatibility hack that will be phased out).

Certainly it's the case that FilePath as a Haskell String is not accurate for Unix paths (though it is for Windows and OSX). Something more accurate would be (an adt containing) a pair of the original binary filename and a Unicode human readable String decoding of it. It needs both because the decoding may be lossy. On Windows and OSX the binary part would not be needed because they use Unicode natively.

The problem with making getArgs and openFile return Unicode is it may be impossible to open certain files passed on the command line (those for which the decoding is lossy).

I would argue the solution is to move FilePath to being opaque, rather than towards it being properly interpreted as a Haskell Unicode String.

comment:3 in reply to:  2 ; Changed 11 years ago by YitzGale

Replying to duncan:

Unfortunately it is not conventional on Unix to interpret file names as Unicode, decoded from the current locale.

AFAIK shells running in all modern vterms and xterms display them this way.

For example glib uses UTF-8 always, unless you set some special env var to tell it to use the current locale (the latter is considered a compatibility hack that will be phased out).

Oh really? Is that because we can soon assume that all locales are UTF-8? If so, it makes our work easier, as Ketil pointed out.

What does Qt do?

Something more accurate would be (an adt...

Yes, a richer type would be a tremendous help. But simonmar has pointed out that it would break H98 compatibility, so it doesn't seem to be an option.

The problem with making getArgs and openFile return Unicode is it may be impossible to open certain files passed on the command line (those for which the decoding is lossy).

On the other hand, they are decoded on other platforms. We don't want to make it impossible to write platform-independent code for any program that reads its args.

Would that actually happen for users using any normal UI and any normal input method? It has always been possible in Unix to create weird file names that are very difficult to deal with, but it won't happen in normal usage. We can provide a Unix-specific hack for the odd case.

comment:4 in reply to:  3 Changed 11 years ago by duncan

Replying to YitzGale:

A good reference on what glib does and recommends is here: http://library.gnome.org/devel/glib/stable/glib-Character-Set-Conversion.html See the description section, after the synopsis.

The problem with making getArgs and openFile return Unicode is it may be impossible to open certain files passed on the command line (those for which the decoding is lossy).

On the other hand, they are decoded on other platforms.

They use the native Unicode representation on other platforms. I don't see that that is an argument to use a non-native representation on Unix platforms.

We don't want to make it impossible to write platform-independent code for any program that reads its args.

Unfortunately as it stands it is impossible for platform-independent code to have both of these properties simultaneously:

  • Read all files passed on the command line
  • Display file names to humans accurately in a user interface.

Currently we get the first property and you're proposing to drop that and switch to the second.

It's pretty well ingrained that FilePath is the type for specifying files eg to open them (it's specified by H98). It's a much more recent problem that we want to display Unicode file names in user interfaces. For portable code, how about we add a function:

filePathToString :: FilePath -> String

On Unix this would decode. On Windows and OSX it'd be the identity since on those platforms the string would already have been decoded.

It means we treat FilePath as if it were an ADT (with differing representation on different platforms) but without actually switching to an opaque type.

Would that actually happen for users using any normal UI and any normal input method?

Generating new names is not a huge problem. The user selects a name in Unicode and if the conversion to a FilePath is impossible or lossy then the user can be prompted to select a different name. Note that does need another function:

filePathFromString :: String -> Maybe FilePath

It has always been possible in Unix to create weird file names that are very difficult to deal with, but it won't happen in normal usage. We can provide a Unix-specific hack for the odd case.

The most frustrating thing for a user would be selecting a file, having the app read it, but be unable to save back to the exact same file because of lossy decoding. That's why such apps are supposed to save the real file name, and translate that into a string, but they must keep the original name because the decoding can be lossy.

Unfortunately that's not a case we can just provide Unix-specific hacks for, it can happen for almost any portable app. Eg consider apps that translate .foo file into .bar files (like, say a compiler, preprocessor). If we decode filename.foo into Unicode but it's a lossy conversion then saving filename.bar may work, but the file names will no longer correspond which could break things (think chars replaced by '?').

So my suggestion basically is, keep FilePath as a file path, and convert to/from String for human consumption.

comment:5 Changed 11 years ago by igloo

difficulty: Unknown
Milestone: 6.14.1

comment:6 Changed 9 years ago by igloo

Milestone: 7.0.17.0.2

comment:7 Changed 9 years ago by igloo

Milestone: 7.0.27.2.1

comment:8 Changed 9 years ago by batterseapower

Type of failure: None/Unknown

I have been investigating this issue and would like to add some observations.

  • Python 2 does what Haskell does at the moment: it reads command line arguments in as byte strings, and exposes them to the programmer as byte strings. This is consistent with the fact that Python strings aren't "really" a text type, and unicode strings are a separate type
  • Python 3 changed the behaviour to match its String type being a "real" string type. Now, command line arguments are decoded into UTF-8 according to the current locale for internal consumption. See the relevant issue at http://bugs.python.org/issue2128
  • Passing command line arguments encoded in any other than the current locale is weird and fragile. Here is some weirdness I discovered.

First, we create a file with a Big5 encoded name. Set your terminal to decode using Big 5 and then:

LC_ALL=zw_TW.big5 bash
touch zw你好 #Be careful here that your IME actually outputs Big5 bytes into a Big5 terminal. It did for me on Ubuntu but not on OS X

As expected, this name will work nicely if we ls. This reflects the fact that Unix stores the file with exactly the Big5 encoded name that we gave it, so when we ls it decodes perfectly in the Big5 terminal.

Now open another terminal set for UTF-8. Assuming your default locale is UTF8 as well, we can try some fun experiments. First, I wrote a program called encoding.c that let me observe the command line. Compile this file to ./bytes:

#include <stdio.h>

int main(int argc, char **argv)
{
    if (argc < 2) {
        printf("Not enough arguments\n");
        return 1;
    }
    
    int len = 0;
    for (char *c = argv[1]; *c; c++, len++) {
        printf("%d ", (int)(*c));
    }
    
    printf("\nLength: %d\n", len);
    
    return 0;
}

Now for the fun:

  1. ls. You should see some gibberish for the "zw" file because the Big5 doesn't get decoded cleanly as UTF-8 by your terminal. I saw the literal string "zw?A?n" printed.
  1. Type "./bytes zw" and then press tab, then enter. You will get 6 bytes printed because 你好 is 4 bytes long in Big5.
  1. Type "./bytes zw?A?n". Use literal question marks. This is where it gets really weird. The output is *exactly the same as before*. Bash has somehow detected that I "meant" to refer to the file in the current working directory and decided to substitute my 6 bytes of ASCII text (all characters <128) with the Big5 from before (which contains some characters >= 128). I have no idea what happens if the choice of filename is ambiguous. If you rm the file this stops happening, obviously.
  1. Type "./bytes foo=zw" and then press tab and enter. You get 10 bytes: 4 bytes for the Chinese and 6 bytes for the ASCII
  1. Type "./bytes foo=zw?A?n", with literal question marks. It shows *10 bytes of ASCII*. So Bash's weird encoding-fixing heuristic fails if command line arguments are more complex than just a file name by itself.

In my opinion this is absolutely bonkers behaviour :-).

IMHO C programs should be able to assume all of their command line arguments are in the same encoding - that of the current locale. But with this bash behaviour, some arguments will be in the locale encoding and some of them will be in another encoding (happens when tab-completing a filename in a non-locale encoding, or Bash's heuristics rewrite something the user wrote to a filename automatically). The user can't even necessarily predict in advance which ones will be which, because Bash's heuristic depends on at least the contents of the CWD!

I would like to argue that we should follow the Python 3 behaviour, and not support file names passed to the command line in any encoding other than the current locale. The reasons are:

  1. Support for this scenario is sort-of-but-not-quite there in other tools, including wildly-popular ones such as bash. So if it doesn't really work at the moment, we aren't causing much trouble by having Haskell not support it.
  1. The very popular language Python 3 has exactly the behaviour I propose and (apparently) noone has complained yet
  1. Most importantly, making this choice means that we don't do natural things like use the current locale to decode command line arguments. This penalises users of modern systems (i.e. those with UTF-8 everywhere) who expect international text to work seamlessly for the sake of supporting a very small group of legacy users (those who use non-UTF-8 encodings on non-Windows, non-OS X systems)

comment:9 Changed 9 years ago by batterseapower

Cc: batterseapower@… added

comment:10 in reply to:  8 Changed 9 years ago by simonmar

Replying to batterseapower:

I would like to argue that we should follow the Python 3 behaviour, and not support file names passed to the command line in any encoding other than the current locale.

Seems reasonable to me.

comment:11 in reply to:  8 ; Changed 9 years ago by tsuyoshi

Replying to batterseapower:

  1. Type "./bytes zw?A?n". Use literal question marks. This is where it gets really weird. The output is *exactly the same as before*. Bash has somehow detected that I "meant" to refer to the file in the current working directory and decided to substitute my 6 bytes of ASCII text (all characters <128) with the Big5 from before (which contains some characters >= 128). I have no idea what happens if the choice of filename is ambiguous. If you rm the file this stops happening, obviously.

This has almost nothing to do with character encoding. It happens because a question mark happens to be a special character in shell filename expansion (wildcard). Apparently in your case Bash substitutes each question mark to one byte, not one character.

  1. Type "./bytes foo=zw?A?n", with literal question marks. It shows *10 bytes of ASCII*. So Bash's weird encoding-fixing heuristic fails if command line arguments are more complex than just a file name by itself.

In this case, the shell tries to expand “foo=zw?A?n” to a file name. Since there is no such file, Bash passes this argument as it is.

comment:12 in reply to:  11 Changed 9 years ago by batterseapower

Replying to tsuyoshi:

This has almost nothing to do with character encoding. It happens because a question mark happens to be a special character in shell filename expansion (wildcard). Apparently in your case Bash substitutes each question mark to one byte, not one character.

That is interesting, thanks - I've never seen ? used as a shell wildcard. It's certainly a more reassuring explanation than what I wrongly thought was going on!

However, I don't think this changes the argument as to how we should decode the command line. As I see it, the only reasonable thing to do is assume that:

  1. All of argv has the same encoding
  1. The only reasonable encoding to pick is the locale encoding, as that should match the terminal's encoding and hence the encoding in which typed user input will arrive

It is unfortunate that Bash will tab-complete filenames without regard for the current encoding, thus creating a command line that may have mixed-encoding data with no way to tell which bit is which.

comment:13 Changed 9 years ago by batterseapower

Hmm, this is tricker than I thought. Python 3 still provides a way for the dedicated programmer to support filenames that are not decodable in the current locale by using "surrogate escapes" to tunnel undecodable bytes through strings -- see PEP 383 (http://www.python.org/dev/peps/pep-0383/)

The implications of PEP 383 are far reaching and I'm not sure that I want to implement it, but its existence has I think weakened the case for decoding-by-default. This is a real shame because we so clearly *do* want to decode any *text* entered on the command line by using the current locale.

Anyway, my patch for #3309 to implement locale-decoding behaviour for the CString FFI functions is still useful, even if we don't actually want to use the *CString family for filename decoding. I'm validating it now.

comment:14 Changed 9 years ago by batterseapower

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.