split package proposal
This is a proposal for the split package to be included in the next major release of the Haskell platform.
Everyone is invited to review this proposal, following the standard procedure for proposing and reviewing packages.
Review comments should be sent to the libraries mailing list by September 4.
Proposal author and package maintainer: Brent Yorgey <byorgey at cis.upenn.edu>
The Data.List.Split module contains a wide range of strategies for splitting lists with respect to some sort of delimiter, mostly implemented through a unified combinator interface. The goal is to be a flexible yet simple alternative to the standard 'split' function found in some other mainstream languages.
Documentation and tarball from the hackage page:
darcs get http://hub.darcs.net/byorgey/split
Splitting a list into chunks based on some sort of delimiter(s) is a common need, and is provided in the standard libraries of several mainstream languages (e.g. Python, Ruby, Java). Haskell beginners routinely ask whether such a function exists in the standard libraries. For a long time, the answer was no. Adding such a function to Haskell's standard libraries has been proposed multiple times over the years, but consensus was never reached on the design of such a function. (See, e.g. http://www.haskell.org/pipermail/libraries/2006-July/005504.html, http://www.haskell.org/pipermail/libraries/2006-October/006072.html, and http://www.haskell.org/pipermail/libraries/2008-January/008922.html.)
In December 2008 the split package was released, implementing not just a single split method, but a wide range of splitting strategies.
Since then the split package has gained wide acceptance, with almost 95 reverse dependencies, putting it in the top 40 for number of reverse dependencies on Hackage.
The package is quite stable. Since the 0.1.4 release in April 2011 only very minor updates have been made. It has a large suite of QuickCheck properties; to my recollection no bugs have ever been reported.
For a detailed description of the package API and example usage, see the Haddock documentation.
Most of the library is based around a (rather simple) combinator interface. Combinators are used to build up configuration records (recording options such as whether to keep delimiters, whether to keep blank segments, etc). A configuration record is finally handed off to a function which performs a generic maximally-information-preserving splitting algorithm and then does various postprocessing steps (based on the configuration) to selectively throw information away. It is probably not the fastest way to implement these methods, but speed is explicitly not a design goal: the aim is to provide a reasonably wide range of splitting strategies which can be used simply. Blazing speed (or more complex processing), when needed, can be obtained from a proper parsing package.
Initial discussion of the proposal can be found here.
I have made some changes to the package as a result of discussion, explained below. The changes can be seen in the darcs repository; a new version of the package has not yet been released.
Use of GADTs/ExistentialQuantification
Henning Thielemann asked whether the GADTs extension is really necessary. Twan van Laarhoven suggested a way around it. I have adopted a variant of Twan's proposal, and the package is now fully Haskell 2010 compliant.
Roman Cheplyaka and others expressed distaste at the presence of synonyms for several functions in the library, arguing that it unnecessarily complicates the API and makes it harder to read others' code.
Following a suggestion from Simon Hengel, I have adopted the solution of
- marking the synonyms as deprecated;
- passing the "prune" option to haddock so they are not documented as part of the API.
This effectively does away with any confusion the synonyms may cause, WITHOUT breaking any existing code, since the synonyms are still exported. Note, however, that I do plan to bump the major version number, as recommended by the package versioning policy.
Consistency with existing names
Bryan O'Sullivan pointed out that the text package has an existing function, essentially identical in functionality to splitEvery, called chunksOf. In the interest of consistency with an existing HP package, I have renamed splitEvery to chunksOf (and deprecated and pruned splitEvery, just like the other synonyms as explained above).
Use of GHC.Exts
At the request of a user, the 0.1.4.3 release switched from defining its own version of the standard 'build' function, to importing it from GHC.Exts. This allows GHC to do more optimization, resulting in reported speedups to uses of splitEvery, splitPlaces, and splitPlacesBlanks. However, this makes the library GHC-specific. I outlined the options in a message to the libraries mailing list. Several suggestions were made, including this suggestion by Henning Thielemann:
You could provide two private modules with the same name in different directories, one that re-exports 'build' from GHC.Exts and one with a custom definition of 'build' for non-GHC compilers. Then set 'Hs-Source-Dirs' in Cabal according to the impl(ghc). No CPP needed, only Cabal. One could even think of a separate package for this purpose.
This looks like it could work, but it adds a lot of complication for not much benefit.
In the end, in the interest of simplicity and full Haskell2010 compliance, I simply went back to defining build manually. Henning also suggested that there ought to be a package providing 'build' in a generic way, so that you get the right version depending which compiler you are using. If this ever happens I would be happy to depend on that package.
The specific way that the generic splitting algorithm is implemented does preclude some imaginable splitting strategies. For example, a few years ago I tried adding a strategy that used a predicate on pairs of elements, splitting down the middle of any pairs that satisfy the predicate, but gave up because it simply did not fit.