|Version 2 (modified by duncan, 8 years ago)|
Tar package proposal
This is a proposal for the 'tar' package to be included in the next major release of the Haskell platform.
Proposal author and package maintainer: Duncan Coutts <duncan at haskell.org>
The "tar" package library is for working with ".tar" archive files. It can read and write a range of common variations of archive format including V7, USTAR, POSIX and GNU formats. It provides support for packing and unpacking portable archives. This makes it suitable for distribution but not backup because details like file ownership and exact permissions are not preserved.
Manipulating tar files is a fairly common need. The tar format and its variants are not trivial so using an external library or program is sensible. Many existing programs that use the tar format call an external "tar" program. This is not satisfactory because the tar program differs between platforms and Windows does not come with a tar program. In particular, the format the "tar" program uses, varies somewhat between different systems. A better solution is to use a library where we have control over the format and we use the same code on all platforms.
A further advantage of using a library is that it allows tar files to be used without unpacking them. It also gives greater flexibility in the relationship between the location of files on disk and the file name paths within a ".tar" file. In particular programs that currently construct .tar files by preparing a temporary directory of file copies with the desired layout may be able to eliminate the extra set of temporary files and construct the tar file directly.
A particular uses case which come to mind is darcs. The "darcs dist" command calls an external "tar" program. On Windows this does not work unless the user has specially installed a tar.exe. On GNU systems the GNU tar program produces .tar files in GNU tar format which is not as widely portable as the standard USTAR/POSIX format.
The design and implementation have been tested in real-world use cases in the cabal-install and hackage-server programs.
The cabal-install tool uses (a bundled copy of) the tar code for:
- the "cabal sdist" feature
- unpacking .tar.gz cabal packages
- the hackage index file (00-index.tar.gz)
In particular the last case is one where we need more than simply unpacking a tar file. We read and examine the index file every time the user runs the configure or install commands to discover the set of available packages.
Introduction to the API
Note, the full reference documentation is available from the hackage page.
The API is structured so that simple uses only need to
import qualified Codec.Archive.Tar as Tar
Use cases that need more intimate access to the details of the tar format (such as file times, permissions etc) may also use
import qualified Codec.Archive.Tar.Entry as Tar import qualified Codec.Archive.Tar.Check as Tar
This protects the casual user against the complexity of the details of the tar format and the various versions of the tar format. Note that the API uses short names is designed to be used qualified.
Conceptually, ".tar" format files are just a sequence of entries. Entries represent things like files, directories and symlinks. Each entry has a name, some have content data. All entries have file meta-data like ownership and permissions.
There are four key operations. High level convenience functions and user-defined variations are defined in terms of these.
Firstly there are functions for converting between internal and external representations. The external representation is a lazy ByteString. The internal representation is as a sequence of Entry values:
read :: ByteString -> Entries write :: [Entry] -> ByteString
The Entries type is almost just [Entry] but it also handles the case of format errors.
The other key pair of operations are for packing and unpacking actual disk files, to and from this internal representation:
pack :: FilePath -> [FilePath] -> IO [Entry] unpack :: FilePath -> Entries -> IO ()
There are various functions provided, or that the user may define, that operate on 'Entries'. This is the main way that the API provides flexibility. In particular one may check for certain security or portability conditions as passes with type Entries -> Entries
For convenience there are also high level "all in one" operations:
create :: FilePath -> FilePath -> [FilePath] -> IO () extract :: FilePath -> FilePath -> IO ()
It is instructive to see how these are defined since they demonstrate the use of the above primitives:
create tar base paths = BS.writeFile tar . Tar.write =<< Tar.pack base paths extract dir tar = Tar.unpack dir . Tar.read =<< BS.readFile tar
The following are examples of variations on the above that the user may define:
createTarGz tar base paths = BS.writeFile tar . GZip.compress . Tar.write =<< Tar.pack base paths extractTarGz dir tar = Tar.unpack dir . Tar.read . GZip.decompress =<< BS.readFile tar
Note: these two are not provided by the library because the tar package does not depend on the zlib package. One could argue that it should but it would only save the above trivial compositions and the same argument would apply to a dependency on the bzlib package or other popular compression codecs.
A further example of use is the htar package which is an implementation of a subset of the features of the common 'tar' command line tool. It is a short demo program at only 200 lines (including command line handling) and covers creating, extracting, (de)compression in .tar.gz and .tar.bz2 formats and listing file contents (simple or extended).
- Separation of IO and pure operations. Intermediate data type provides flexibility.
The encoding and decoding of the tar format is completely pure. It uses an intermediate data structure and pure operations on it. This gives the API great flexibility without requiring a large number of primitives since it is possible to inspect, consume or modify the intermediate representation before doing an IO operation like packing or unpacking.
- Most operations are on lazy sequences
The API is fairly compositional yet allows constant space operations in many cases because it uses lazy sequences. This matches the tar format quite well which is designed to be processed linearly and using constant space.
- API partitioned into "simple" and "full" modules.
Christian complained that my original API was too complex. The problem is that unfortunately the tar format is more complex than we would wish and some applications do need to know about some of the details. In particular some applications need to know the format in use (V7, USTAR or GNU) and need access to meta-data like permissions and timestamps.
The solution we arrived at is the partition. Use cases that need more can import an extra module to get access to the details of what a tar Entry actually consists of.
- API allows constant-space operations and pure exception handling.
There is no need to use exceptions (catch) to handle errors in decoding tar files. It can be done purely. At the same time it is possible to process large tar files in constant space; e.g. create / extract. This is done using the Entries type which is essentially a list data type but with an extra alternative for decoding errors.
This same approach of marrying exceptions and lazyness is now used in the zlib package because the previous approach of using exceptions proved insufficient for some applications (notably darcs).
The current design has been through several iterations of API review with Christian Maeder (see the email thread). Some of the discussion is not on the public mailing list. I think we have addressed almost all the concerns that he and I raised in our discussions.
One remaining issue is that the package provides
getDirectoryContentsRecursive :: FilePath -> IO [FilePath]
which might be better in System.Directory. It's pretty useful for the case of constructing a tar archive where you want to use a non-default file path mapping or if you want to filter the list of files included.
Another issue that has been pointed out is that the lazy decoding relies essentially on lazy IO and some people prefer designs that do not use lazy IO. One possible solution would be to redesign the form of the decoder to use a continuation data type like:
data Unfolding = NeedMoreData (Strict.ByteString -> Unfolding) | OutputAvailable Tar.Entry Unfolding
One would start with tarDecode :: Unfolding and unfold it, pushing in blocks of data as requested and getting entries out when available. It should be possible to implement the existing interface on top of this layer and by exposing this layer (probably as a separate module) we might be able to help the people who prefer their left-fold enumerators (but without annoying everyone else ;-) ).
The point is, it should be possible to provide this feature (if there is ever demand) without a major redesign of the existing API. So it need not be a blocker for adding the package tar now.