Proposal: Add 'text' to the Haskell Platform
Proposal Author: Don Stewart
Maintainer: Bryan O'Sullivan (submitted with his approval)
This is a proposal for the 'text' package to be included in the next major release of the Haskell platform.
Everyone is invited to review this proposal, following the standard procedure for proposing and reviewing packages.
Review comments should be sent to the libraries mailing list by October 1 so that we have time to discuss and resolve issues before the final deadline for a call for consensus in early November.
Proposal author and package maintainer: Bryan O'Sullivan, originally by Tom Harper, based on ByteString? and Vector (fusion) packages.
The following individuals contributed to the review process: Don Stewart, Johan Tibell
The 'text' package provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework.
The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed.
The 'Text' type provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from 'ByteStrings?', using several standard encodings (see the 'text-icu' package for a much larger variety of encoding functions). Efficient locale-sensitive support for text IO is also supported. This module is intended to be imported qualified, to avoid name clashes with Prelude functions, e.g.
import qualified Data.Text as T
Documentation and tarball from the hackage page:
darcs get http://code.haskell.org/text/
All package requirements are met.
While Haskell's Char type is capable of reprenting Unicode code points, the String sequence of such Chars has some drawbacks that prevent is general use:
- unicode-unaware case conversion (map toUpper is an unsafe case conversion)
- the representation is space inefficient.
- the data structure is element-level lazy, whereas a number of applications require either some level of additional strictness
An intermediate solution to these was via 'Data.ByteString?' (an efficient byte sequence type, that addresses points 2 and 3), which, when used in conjunction with utf8-string, provides very simple non-latin1 encoding support (though with significant drawbacks in terms of locale and encoding range).
The 'text' package addresses these shortcomings in a number of way:
- support whole-string case conversion (thus, type correct unicode transformations)
- a space and time efficient representation, based on unboxed Word16 arrays
- either fully strict, or chunk-level lazy data types (in the style of Data.ByteString?)
- full support for locale-sensitive, encoding-aware IO.
The 'text' library has rapidly become popular for a number of applications, and is used by more than 50 other Hackage packages. As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular), in particular, in web programming. It is used by:
- the blaze html pretty printing library
- the hstringtemplate file templating library
- *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
- the hexpat and libxml xml parsers
The design is based on experience from Data.Vector and Data.ByteString?:
- the underlying type is based on unpinned, packed arrays on the Haskell heap
with an ST interface for memory effects.
- pipelines of operations are optimized via converstion to and from the 'stream' abstraction
A large testsuite, with coverage data, is provided.
The API is broken into several logical pieces, which are self-explanatory:
- combinators for operating on strict, abstract 'text's.
- an equivalent API for chunk-element lazy 'text's.
- encoding transformations, to and from bytestrings:
- support for conversion to Ptr Word16:
- locale-aware IO layer:
- IO and pure combinators are in separate modules.
- Both a fully strict, and partially-strict type are provided.
- The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
- Unpinned arrays are used, to prevent fragmentation.
- Large numbers of additional encodings are delegated to the text-icu package.
- An 'IsString?' instance is provided for String literals.
- The implementation is OS and architecture neutral (portable).
- The implementation uses a number of language extensions:
CPP MagicHash UnboxedTuples BangPatterns Rank2Types RecordWildCards ScopedTypeVariables ExistentialQuantification DeriveDataTypeable
- The implementation is entirely Haskell (no additional C code or libraries).
- The package provides a QuickCheck?/HUnit testsuite, and coverage data.
- The package adds no new dependencies to the HP.
- The package builds with the Simple cabal way.
- There is no existing functionality for packed unicode text in the HP.
- The package has complexity annotations.
- The text-icu package is not part of this proposal, as adding it would make the platform depend on the ICU C library. This is not a blocker.
- Both the text package and the base package provide Unicode encoding/decoding functionality. Perhaps some of this functionality could be merged. This cannot be achieved until the base library makes some types non-abstract. This is not a blocker.
- Naming inconsistencies between bytestring, text and list. Some functions have similar names to functions in the bytestring package but have different types (other than ByteString vs Text.) Some functions have the same type but different names.
- Do we need both a strict and lazy version of Text? The strict version needs one less indirection, can be unpacked in function arguments and takes less space when stored in data types. The performance difference is substantial, and the non-strict version can stream large quantities of data in a small footprint in a way not possible with the strict kind. Not a blocker.
On the naming issue
- One proposal on how to fix the names, and the author's response
- A call for further discussion on the name/type matching issue.
The package maintainer has proposed an updated API. The substring functions are now named:
breakOn :: Text -> Text -> (Text, Text) breakOnEnd :: Text -> Text -> (Text, Text) breakOnAll :: Text -> Text -> [(Text, Text)] splitOn :: Text -> Text -> [Text]
The character predicate functions now match the List names:
break :: (Char -> Bool) -> Text -> (Text, Text) span :: (Char -> Bool) -> Text -> (Text, Text) partition :: (Char -> Bool) -> Text -> (Text, Text) find :: (Char -> Bool) -> Text -> Maybe Char split :: (Char -> Bool) -> Text -> [Text]
The count function remains unchanged, but there is the suggestion that the bytestring version of count could be generalised instead
count :: Text -> Text -> Int
The implementation consists of 30 modules, and relies on cabal's package hiding mechanism to expose only 5 modules. The implementation is around 8000 lines of text total.
The public modules expose none of these (?).
The Python standard library provides both a string and a unicode sequence type. These are somewhat analogous to the ByteString/String/Text? split.
: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts, Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.