Version 4 (modified by dons, 4 years ago)

--

Proposal: Add Data.Text to the Haskell Platform

Proposal Author: Don Stewart

Maintainer: Bryan O'Sullivan (submitted with his approval)

Introduction

This is a proposal for the 'text' package to be included in the next major release of the Haskell platform.

Everyone is invited to review this proposal, following the standard procedure for proposing and reviewing packages.

http://trac.haskell.org/haskell-platform/wiki/AddingPackages

Review comments should be sent to the libraries mailing list by October 1 so that we have time to discuss and resolve issues before the final deadline on November 1.

http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable

Credits

Proposal author and package maintainer: Bryan O'Sullivan, originally by Tom Harper, based on ByteString? and Vector (fusion) packages.

The following individuals contributed to the review process: Don Stewart, Johan Tibell

Abstract

The 'text' package provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework.

The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed.

The 'Text' type provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from 'ByteStrings?', using several standard encodings (see the 'text-icu' package for a much larger variety of encoding functions). Efficient locale-sensitive support for text IO is also supported. This module is intended to be imported qualified, to avoid name clashes with Prelude functions, e.g.

    import qualified Data.Text as T

Documentation and tarball from the hackage page:

    http://hackage.haskell.org/package/text

Development repo:

    darcs get http://code.haskell.org/text/

All package requirements are met.

Rationale

While Haskell's Char type is capable of reprenting Unicode code points, the String sequence of such Chars has some drawbacks that prevent is general use:

  1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
  2. the representation is space inefficient.
  3. the data structure is element-level lazy, whereas a number of applications require either some level of additional strictness

An intermediate solution to these was via 'Data.ByteString?' (an efficient byte sequence type, that addresses points 2 and 3), which, when used in conjunction with utf8-string, provides very simple non-latin1 encoding support (though with significant drawbacks in terms of locale and encoding range).

The 'text' package addresses these shortcomings in a number of way:

  1. support whole-string case conversion (thus, type correct unicode transformations)
  2. a space and time efficient representation, based on unboxed Word16 arrays
  3. either fully strict, or chunk-level lazy data types (in the style of Data.ByteString?)
  4. full support for locale-sensitive, encoding-aware IO.

The 'text' library has rapidly become popular for a number of applications, and is used by more than 50 other Hackage packages. As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular), in particular, in web programming. It is used by:

  • the blaze html pretty printing library
  • the hstringtemplate file templating library
  • *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
  • the hexpat and libxml xml parsers

The design is based on experience from Data.Vector and Data.ByteString?:

  • the underlying type is based on unpinned, packed arrays on the Haskell heap

with an ST interface for memory effects.

  • pipelines of operations are optimized via converstion to and from the 'stream' abstraction[1]

A large testsuite, with coverage data, is provided.

The API

The API is broken into several logical pieces, which are self-explanatory:

  • combinators for operating on strict, abstract 'text's.

http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.html

  • an equivalent API for chunk-element lazy 'text's.

http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy.html

  • encoding transformations, to and from bytestrings:

http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Encoding.html

  • support for conversion to Ptr Word16:

http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Foreign.html

  • locale-aware IO layer:

http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-IO.html http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-Lazy-IO.html

Design decisions

  • IO and pure combinators are in separate modules.
  • Both a fully strict, and partially-strict type are provided.
  • The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
  • Unpinned arrays are used, to prevent fragmentation.
  • Large numbers of additional encodings are delegated to the text-icu package.
  • An 'IsString?' instance is provided for String literals.
  • The implementation is OS and architecture neutral (portable).
  • The implementation uses a number of language extensions:
    CPP
    MagicHash
    UnboxedTuples
    BangPatterns
    Rank2Types
    RecordWildCards
    ScopedTypeVariables
    ExistentialQuantification
    DeriveDataTypeable
  • The implementation is entirely Haskell (no additional C code or libraries).
  • The package provides a QuickCheck?/HUnit testsuite, and coverage data.
  • The package adds no new dependencies to the HP.
  • The package builds with the Simple cabal way.
  • There is no existing functionality for packed unicode text in the HP.
  • The package has complexity annotations.

Open issues

The text-icu package is not part of this propposal.

Notes

The implementation consists of 30 modules, and relies on cabal's package hiding mechanism to expose only 5 modules. The implementation is around 8000 lines of text total.

The public modules expose none of these (?).

The Python standard library provides both a string and a unicode sequence type. These are somewhat analogous to the ByteString/String/Text? split.

References

[1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts, Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.