decompress

2023-10-21

No Description

Upstream URL

github.com/se-mz/decompress

Author

Sebastian Melzer <semz@semelz.de>

Maintainer

Sebastian Melzer <semz@semelz.de>

License

MIT
README
decompress - A defensive and fast decompression library in pure Common Lisp

1Overview

:PROPERTIES::CUSTOM_ID: overview:END:

The title speaks for itself. More formats may be added in the future but for now none are planned. A compressor counterpart is highly unlikely. A download can be found here; there is also a Github mirror, but note that this Readme might not display correctly there. On Quicklisp, the corresponding system is called decompress, although the package it defines is semz.decompress.

1.1Features

:PROPERTIES::CUSTOM_ID: features:END:
  • Supported formats: Deflate, zlib, gzip, bzip2, LZMA, LZMA2, XZ.
  • Safety: Edge cases are properly detected and explicitly handled; malformeddata is rejected early.¹ The interface is designed to be hard to misuse; inparticular, single-member and multi-member data can be handled with separatefunctions, defaults try to provide the strongest guarantees, and built-inintegrity checks are run without user intervention. No FFI is used anywhere.
  • No mandatory overreads: When reading a single well-formed member from a bytestream, no byte beyond the end of the compressed data will be read, allowingthis library to be used as a part of complicated stream processing for otherformats.
  • Performance: To my knowledge, this is the second fastest Deflate decoder andthe fastest bzip2 decoder in pure Common Lisp. Under SBCL, file-to-filedecompression usually hits 30%-80% the speed of the C referenceimplementations (depending on the format) and among CL libraries, we're onlybeaten by 3bz. See the performance section for more details and other CLimplementations.

¹ This is a guideline; a hard guarantee would be infeasible.

1.2Quickstart

:PROPERTIES::CUSTOM_ID: quickstart:END:

From example.lisp:

The supported multi-member formats are gzip, bzip2 and XZ. Members of gzip and bzip2 are trivial to handle manually with decompress or make-decompression-stream if you need to do that for some reason; for XZ, you'll need some contortions to deal with padding, but since the spec explicitly recommends not to use multi-member XZ when embedding into a different format, you probably won't run into this.

For more details, see the reference.

2Comparison to other libraries

:PROPERTIES::CUSTOM_ID: comparison:END:

The other libraries under consideration are chipz, which is probably the most popular library; deflate, a compact implementation by Pierre Mai that is notably used by Quicklisp; and 3bz, which is notably used by pngload.

decompress-1.2.0 chipz-20230418 deflate-1.0.4 3bz-20230521
/ <
supported formats deflate, zlib, gzip, bzip2, lzma, lzma2, xz deflate, zlib, gzip, bzip2 deflate, zlib, gzip deflate, zlib, gzip
dependencies alexandria, trivial‑gray‑streams
alexandria, trivial‑features, nibbles, babel, cffi¹, mmap¹
license MIT BSD‑3 MIT MIT
input sources vector, stream simple specialized vector, list, stream, pathname stream (s. s.) vector, stream, foreign memory
output destinations new vector new vector, provided (s. s.) vector, stream, pathname stream (s. s.) vector
Gray stream wrapper yes only around other streams no no
checksum verification automatic automatic automatic if enabled, disabled by default automatic
multi-member handling manual or automatic single-only manual manual
edge case handling strict lenient lenient strict, no library error type
partial input library error library error end-of-file error status flag
trailing data ignored & unread, or library error ignored & read ignored & unread ignored; not read with stream input
preset zlib dictionaries supported unsupported unsupported unsupported
bzip2 block randomization supported unsupported n/a n/a

¹ Optional depending on implementation.

2.1Edge case handling

:PROPERTIES::CUSTOM_ID: edge-cases:END:

2.1.1Deflate / zlib / gzip

:PROPERTIES::CUSTOM_ID: edge-cases-deflate:END:

The Deflate spec has a few rough edges which we deal with as follows.

  • Some decompressors (e.g. GNU gzip) allow back references to go beyond thebeginning of the output data, pretending that the octets out of bounds wereall zeroes. This one always signals an error.
  • Huffman trees are uniquely determined by the associated sequence of codelengths, but not every such sequence corresponds to a tree - some leave codesunassigned and some make codes ambiguous. The spec isn't clear on what to dowith such sequences (outside of two special cases in §3.2.7). While one canargue that leaving codes unassigned is fine as long as they don't appear, wealways signal an error as soon as a sequence doesn't correspond to a(complete) Huffman tree.
  • According to §3.2.7, the situation where only a single distance code is usedmay be represented by having only one code length of one. We enforce that theunique distance code is always encoded as a zero bit and never as a one bit.
  • A dynamic Deflate block can omit the encoding for the end-of-block marker. Iam convinced that this is an oversight in the standard since the onlysituation where an encoder would want to output such a block is if it knows inadvance the input is infinite and that the miniscule inefficiency of encodingan EOB would be worse than the inefficiency that arises from being unable tochange the Huffman tree (no EOB means no new dynamic block).

    However, there's little reason to ban this; since Deflate already allows arbitrarily many blocks of arbitrary length, nothing stops a stream from being infinite anyway and a finite input is going to run into EOF either way.

  • Dynamic Deflate blocks can provide encodings for the illegal literal/lengthcodes 286 and 287 as well as the illegal distance codes 30 and 31, althoughthese may never occur in the data. We allow them in blocks since otherwise thefixed Deflate block code wouldn't be expressible dynamically. Somedecompressors (e.g. zlib) do not allow this at all.

Zlib and gzip do not add many edge cases by themselves; note however that the gzip file format can consist of multiple gzip streams which each come with their own filename/etc info in their respective header. You can access these with decompress if necessary, but most decoders ignore anything except maybe the first header, which is readily accessible via decompress-all.

A more detailed comparison of edge case handling across libraries can be found below. An error is an error of some library-defined type that could be handled by a library user; a basic error is something like a failed assertion or (error "foo"), where the problem is explicitly detected but can't be reasonably handled by a user; an internal error is something like a type error or an array bounds violation that is signalled by the implementation, but not necessarily reliably so.

Deviations from zlib behavior are boldfaced for convenience; note that deviation is not necessarily a bad thing in itself. We consider internal errors deviation since they may cause behavior changes under some implementations.

edge case zlib (zpipe) decompress 3bz deflate chipz
/ <
reserved block type error error basic error error error
uncompressed zero-length block handled handled handled handled handled
uncompressed block with wrong checksum error error basic error error error
uncompressed zero-length block with wrong checksum error error basic error error outputs data
reference goes beyond previous output error error basic error outputs data outputs data
uses illegal length code 286/287 error error basic error error error
uses illegal distance code 30/31 error error basic error error error
dynamic block provides encodings for 286/287 & 30/31 error outputs data internal error outputs data outputs data
dynamic block with one distance code (§3.2.7) handled handled handled handled handled
dynamic block with one distance code, uses the unassigned code error error basic error error error
dynamic block without distance codes (§3.2.7) handled handled handled handled handled
dynamic block without distance codes, tries to use len code error error basic error error error
Huffman tree with too many items error error basic error outputs data outputs data
Huffman tree with too few items error error basic error outputs data outputs data
Huffman tree with too few items, uses an unassigned code n/a n/a n/a error error
dynamic block has code lengths expand across len/dist boundary handled handled handled handled handled
dynamic block has code lengths expand out of bounds error error basic error error error
dynamic block starts by repeating the last code length error error basic error error error
reference stays in bounds but goes beyond zlib window size outputs data error outputs data outputs data outputs data
zlib preset dictionary required but none provided error error basic error error outputs data
wrong Adler-32 checksum error error basic error outputs data by default error
wrong gzip magic bytes error error basic error error error
gzip data with wrong CRC-32 checksum error error basic error outputs data by default error
gzip data with right CRC-32 checksum but wrong modular length error error outputs data outputs data outputs data
gzip data with header checksum handled handled handled handled handled
gzip data with incorrect header checksum error error basic error outputs data outputs data
gzip extra fields handled handled basic error handled basic error
inconsistent length fields in extra fields section outputs data error n/a outputs data n/a

2.1.2Bzip2

:PROPERTIES::CUSTOM_ID: edge-cases-bzip2:END:

Bzip2 has no spec, so we mostly try to match the reference implementation. There is one notable exception: The reference implementation accepts underfull Huffman trees and only complains once unassigned codes are used; we reject such trees right off the bat since it simplifies the code.

I am not aware of any compressor that outputs such trees; Joe Tsai claims they exist, but provides no names. There isn't a good reason why a compressor would output such trees since bzip2 cannot have alphabets of size one, and whenever you have a branch with only one child, it's more efficient to just give the unique child a code that's one bit shorter. The resulting reshuffling of codes has negligible overhead (none if you only assign them at the end). I suppose this case is something to think about once someone runs into it while using this library.

Chipz is essentially a translation of the reference implementation, but doesn't bother to support the block randomization mechanism that has been deprecated for a good two decades. We do because our approach makes it very easy to add. You can find a test file here; it consists of 100MB of highest quality zero octets. The testfile in lbzip2 doesn't actually test block randomization since it is too short for any effects to show up.

Other than this, the format does not allow for many edge cases to begin with since most of its transformations simply cannot have invalid inputs. Just like the reference implementation, we do not care about wasteful encodings such as e.g. defining six Huffman trees despite only using three.

2.1.3LZMA

:PROPERTIES::CUSTOM_ID: edge-cases-lzma:END:

LZMA has a spec, but it's essentially just a commented C++ reference implementation. It leaves very little freedom to the implementer but nevertheless contains some minor pitfalls:

  • The range decoder's finishing state is also a valid intermediate state, so itis necessary to define some EOF handling scheme. LZMA defines two and XZdefines another one; there are no other possible schemes. We default to themost permissive scheme (which can handle all valid inputs for the other two)but let the user optionally override this.
  • Two of the EOF handling schemes require that the decompressed size is known inadvance. If the last Lempel-Ziv match expands beyond the declared decompressedsize, one might either signal an error or truncate it and finish gracefully;we signal an error, just like the reference implementation and XZ Utils.
  • The range decoder's finishing state cannot result in an EOF marker whentreated like an intermediate state. As a result, the "optional EOF marker"scheme (the one we default to) can only be implemented in the way described inthe spec: When the declared size is reached, only try to consume an EOF markerif the range decoder is not yet in a finishing state.

2.1.4LZMA2

:PROPERTIES::CUSTOM_ID: edge-cases-lzma2:END:

LZMA2 has no spec, but is a relatively straightforward wrapper on top of LZMA that adds uncompressed blocks, flush points, and LZMA state resets. We don't expose flush points to the user.

We follow XZ Utils' implementation; even though there have been subtle compatibility issues with the LZMA reference implementation in the past, the XZ Utils implementation is stricter and avoids certain edge cases entirely. It is also much cleaner; when testing some of the edge cases below, the LZMA SDK's implementation would often simply segfault.

LZMA2 is subject to the following edge cases:

  • The data must start with a flush point; the first LZMA block after a flushpoint must specify new LZMA parameters.
  • Every block declares both compressed and decompressed sizes. These could beused to support partial decompression, so they should be verified. We doverify them even though we don't support partial decompression.
  • The underlying data of an LZMA block might contain an EOF marker. XZ Utilssignals an error when this happens, as do we. If an implementation allowsthis, care must be taken to either skip trailing data or signal an error ifany exists.
  • An LZMA state reset does not reset the posState variable. Furthermore,uncompressed blocks will update it appropriately.

    Note that implementing this correctly only matters in the extremely unusual case where an LZMA block is followed by some uncompressed blocks and then another LZMA block, with no flush points in between, and where the second LZMA block doesn't reset the LZMA state; if the second LZMA block resets the state, then the value of posState is completely irrelevant because it merely denotes the starting position in an array of probabilities that are all initialized to the same value anyway.

    LZMA2 data with this sequence of blocks is extremely unlikely to be produced in practice since the encoder can't really know that the second LZMA block benefits from state reuse despite the uncompressed data in the middle. Nevertheless, a base64-encoded XZ test file that triggers this case can be found below. It decodes to the string LOL, but fails to decompress if posState isn't updated by uncompressed blocks.

      /Td6WFoAAAD/EtlBAgAhAQoAAABTxyq54AAAAAUJACX//AAAAgAAT4AAAAAFACfRR0AAAAABKAM7StLkBnKeegEAAAAAAFla
    

2.1.5XZ

:PROPERTIES::CUSTOM_ID: edge-cases-xz:END:

XZ has a decent spec and is mostly a container format. XZ Utils supports partial decompression and skipping the checksum verification of unknown checksum types; we don't. Note that we implement all standard filters, not just LZMA2. A list of edge cases is below; most of these can be found in XZ Utils' elaborate test suite as well.

  • XZ duplicates stream header information and length information for all blocksat the end of the file to support partial decompression; while we do notsupport partial decompression, we nevertheless verify the lengths at the end.
  • XZ's variable length integers must be encoded with the shortest possibleencoding. This is easy to miss because the spec doesn't explicitly mention it,but the example code and the test suite both check for it.
  • XZ files can consist of multiple concatenated members, and there may bepadding after each member, including the final one. The padding must come inblocks of four and consist of zero bytes. Padding is not allowed at the startof the file.
  • Fields whose size isn't necessarily a multiple of four are padded with theminimal number of zero bytes to ensure such a size. The sole exception tothis is the block header padding, which consists of however many bytes arerequired to match the declared size.
  • The total compressed and decompressed data of a single member are each limitedto at most 2^{63} - 1 bytes (~8 exabytes). XZ Utils detects this before theblock that pushes the size over the edge is even decoded; we only detect itonce the read/output data exceeds this limit.
  • SHA-256 is limited to 2^{61} - 1 bytes of input and therefore cannot be usedwith extremely large XZ files. We signal an error when a file using SHA-256exceeds this limit.
  • LZMA2 can only be the last filter in a filter chain; the other standardfilters (i.e. delta and the BCJ filters) can never be the last filter.
  • Embedded LZMA2 data might signal EOF before the declared end of the data. Wesignal an error in this case, as does XZ Utils.

2.2Performance funny numbers

:PROPERTIES::CUSTOM_ID: performance:END:

We're usually the fastest with a significant improvement over chipz; under SBCL, 3bz is slightly faster for gzip data and noticeably faster for zlib data. Preserving trailing data hurts a lot for Deflate but not much for the other formats. Under ABCL, Mai's Deflate is faster for heavily compressed files and preserving trailing data is rather expensive.

comparison implementation SBCL CCL ECL ABCL
/ <c> <c> <c> <c>
/ <
chipz (deflate/zlib/gzip) 1.79x-2.76x 1.34x-1.87x 1.61x-2.30x 1.16x-4.03x
chipz (bzip2) 1.01x-1.32x 2.06x-2.70x 2.17x-2.47x 2.52x-8.52x
deflate (with checksum) 2.04x-3.41x 1.30x-3.91x 1.87x-3.24x 0.48x-1.80x
deflate (no checksum) 1.33x-3.16x 1.21x-1.73x 1.75x-2.09x 0.46x-1.72x
3bz 0.63x-0.93x 1.07x-1.55x 2.12x-7.23x 1.20x-4.65x
zlib 0.31x-0.54x 0.04x-0.12x 1.00%-2.65% 0.18%-0.28%
bzip2 0.66x-0.82x 0.20x-0.22x 4.15%-5.39% 1.08%-2.29%
xz 0.49x-0.62x 0.10x-0.17x 1.04%-1.69% 0.21%-0.42%

2.2.1Details

:PROPERTIES::CUSTOM_ID: performance-details:END:

The performance measurements below are taken on an older x86-64 machine, taking the best of five attempts, using each library's idiomatic method of file-to-file decompression:

  • For decompress, this means with-open-file + Alexandria's copy-stream. Weprovide measurements for both single member decompression and multiple memberdecompression. (None of the benchmark files contain multiple members; thismerely demonstrates the performance loss when preserving trailing data.)
  • For chipz, this means decompress with pathname arguments.
  • For Mai's Deflate, this means with-open-file + the inflate-*-streamfunctions. We provide measurements with and without checksums to reflect afair comparison and the default respectively.
  • For 3bz, we use this code. Note that this involves manual buffering. We don'tprovide a measurement with the significantly slower stream input interfacesince it wouldn't be very informative.

We always use the implementation's default optimization settings. Note that under ECL, one library's declamations can easily affect the optimization settings other libraries are compiled under; special care was taken to use ECL's default ((safety 2) (space 0) (speed 3) (debug 3)) for all libraries.

I tried to keep the benchmark files varied since decompression benchmarks can highly input-dependent, but take the results with a grain of salt regardless. As a rough baseline, measurements with zlib-1.2.13, bzip2-1.0.8, and xz-5.4.2 are provided, respectively using the canonical zpipe example program, bunzip2, and xz.

We use the following real world tarballs:

file input size output size refimpl
/ <
<r> <r> <r>
openjdk‑17.0.6_p10.tar.gz 105,221,267B 672,163,840B 2.618s
openh264‑2.3.1.tar.gz 60,290,897B 73,205,760B 0.280s
gimp‑2.10.32.tar.bz2 31,397,425B 207,185,920B 6.687s
gcc‑11.3.0.tar.xz 81,141,364B 824,309,760B 7.263s
ccl‑1.12.1‑linuxx86.tar.gz 20,872,508B 80,752,640B 0.427s
sbcl‑2.3.3-source.tar.bz2 7,408,264B 43,386,880B 1.408s
Python‑3.11.4.tar.xz 19,954,828B 99,696,640B 1.618s

We also have a bunch of example data that we use to emulate common use cases. html_standard.html is HTML as you'd commonly find it on the web; pepper.dat and mario.dat are PNG pixel data from a well-compressible and and badly-compressible image respectively.

base file base size zlib (.z) bzip2 (.bz2) xz (.lzma) xz (.xz)
/ <
html_standard.html 12.80MB 0.053s 0.373s 0.115s 0.121s
pepper.dat 51.27MB 0.148s 0.512s 0.238s 0.262s
mario.dat 3.053MB 0.018s 0.110s 0.065s 0.070s

2.2.2SBCL-2.3.8

:PROPERTIES::CUSTOM_ID: performance-sbcl:END:

SBCL is unsurprisingly the fastest of the tested implementations. We primarily optimize for it; notably we use a different implementation of the various CRCs.

file refimpl 3bz decom (M) decom (S) chipz defl defl (C)
/ <
<r> <r> <r> <r> <r> <r> <r>
openjdk-17.0.6_p10.tar.gz 2.618s 5.530s 5.981s 7.801s 13.84s 10.07s 12.21s
openh264-2.3.1.tar.gz 0.280s 0.831s 0.892s 1.333s 1.594s 2.815s 3.042s
gimp-2.10.32.tar.bz2 6.687s
9.857s 10.34s 10.19s
gcc-11.3.0.tar.xz 7.263s
14.93s 14.94s
ccl-1.12.1-linuxx86.tar.gz 0.427s 0.923s 1.028s 1.386s 2.182s 1.950s 2.206s
sbcl-2.3.3-source.tar.bz2 1.408s
1.979s 2.098s 2.507s
Python-3.11.4.tar.xz 1.618s
3.090s 3.083s
html_standard.html.z 0.053s 0.073s 0.107s 0.138s 0.270s 0.180s 0.254s
pepper.z 0.148s 0.171s 0.273s 0.311s 0.754s 0.364s 0.657s
mario.z 0.018s 0.032s 0.039s 0.055s 0.091s 0.081s 0.098s
html_standard.html.bz2 0.373s
0.566s 0.587s 0.571s
pepper.bz2 0.512s
0.714s 0.740s 0.802s
mario.bz2 0.110s
0.134s 0.146s 0.177s
html_standard.html.lzma 0.115s
0.186s 0.205s
pepper.lzma 0.238s
0.387s 0.416s
mario.lzma 0.065s
0.108s 0.119s
html_standard.html.xz 0.121s
0.222s 0.222s
pepper.xz 0.262s
0.529s 0.529s
mario.xz 0.070s
0.117s 0.117s

2.2.3CCL-1.12.2

:PROPERTIES::CUSTOM_ID: performance-ccl:END:

CCL has notoriously slow bit operations, resulting in an overall 6.5x runtime for Deflate decompression compared to SBCL and less pronounced differences between libraries. This is apparently being taken care of currently and I'd expect performance to change drastically once these fixes land. Preserving trailing data makes almost no difference.

Bzip2 and LZMA runtimes are about 3-4x those of SBCL, which is pretty typical. XZ runtimes suffer from CCL's weakness around 64-bit arithmetic, as required by CRC-64.

file 3bz decom (M) decom (S) chipz defl defl (C)
/ <
<r> <r> <r> <r> <r> <r>
openjdk-17.0.6_p10.tar.gz 52.87s 36.56s 39.65s 63.12s 48.52s 54.23s
openh264-2.3.1.tar.gz 7.327s 6.236s 6.751s 9.911s 10.04s 10.65s
gimp-2.10.32.tar.bz2
33.84s 34.74s 73.68s
gcc-11.3.0.tar.xz
67.94s 68.16s
ccl-1.12.1-linuxx86.tar.gz 8.518s 6.963s 7.209s 10.18s 8.416s 9.075s
sbcl-2.3.3-source.tar.bz2
6.693s 6.903s 18.04s
Python-3.11.4.tar.xz
13.10s 13.10s
html_standard.html.z 0.847s 0.646s 0.689s 0.940s 0.895s 1.582s
pepper.z 1.907s 1.227s 1.281s 2.292s 2.128s 4.802s
mario.z 0.292s 0.273s 0.293s 0.365s 0.343s 0.503s
html_standard.html.bz2
1.685s 1.725s 3.779s
pepper.bz2
2.307s 2.363s 4.754s
mario.bz2
0.494s 0.515s 1.317s
html_standard.html.lzma
0.696s 0.752s
pepper.lzma
1.418s 1.491s
mario.lzma
0.431s 0.458s
html_standard.html.xz
1.013s 1.017s
pepper.xz
2.602s 2.592s
mario.xz
0.503s 0.503s

2.2.4ECL-21.2.1

:PROPERTIES::CUSTOM_ID: performance-ecl:END:

Deflate decompression sits at some 25x-50x the runtime of SBCL, which is rough, but still usable for smaller data. Bzip2 ends up at a nicer 20x-30x runtime. LZMA/XZ decompression suffers from heavy type check overhead and is slower than bzip2 as a result.

file 3bz decom (M) decom (S) chipz defl defl (C)
/ <
<r> <r> <r> <r> <r> <r>
ccl-1.12.1-linuxx86.tar.gz 128.7s 41.39s 53.99s 76.07s 76.02s 77.23s
sbcl-2.3.3-source.tar.bz2
30.60s 34.43s 75.46s
Python-3.11.4.tar.xz
155.4s 156.2s
html_standard.html.z 16.27s 3.901s 4.938s 6.539s 6.845s 8.401s
pepper.z 40.32s 5.576s 6.706s 12.83s 11.63s 18.06s
mario.z 3.823s 1.804s 2.269s 2.899s 3.211s 3.548s
html_standard.html.bz2
6.931s 7.612s 16.25s
pepper.bz2
10.15s 10.99s 23.02s
mario.bz2
2.652s 3.038s 5.764s
html_standard.html.lzma
8.427s 9.199s
pepper.lzma
14.06s 15.26s
mario.lzma
5.468s 5.886s
html_standard.html.xz
10.28s 10.33s
pepper.xz
21.72s 21.67s
mario.xz
5.898s 5.897s

2.2.5ABCL-1.9.2

:PROPERTIES::CUSTOM_ID: performance-abcl:END:

ABCL's replace and manual array ops are obscenely slow, which punishes copying and Lisp-side buffering. As a result, the straightforward unbuffered code of Mai's Deflate outperforms several buffered implementations, especially for heavily compressed files.

In general, Deflate performance is close to the point of unusability; you're likely better off using java.util.zip. Our bzip2/XZ performance isn't that bad compared to ECL, likely due to the lower amount of bit diddling; note however how harshly the runtime increases for the (highly-compressed, hence copy-heavy) pepper.dat.

file 3bz decom (M) decom (S) chipz defl defl (C)
/ <
<r> <r> <r> <r> <r> <r>
html_standard.html.z 62.56s 23.40s 34.95s 55.66s 39.57s 42.04s
pepper.z 98.34s 81.41s 95.19s 94.55s 37.10s 39.35s
mario.z 29.39s 6.324s 12.18s 25.46s 10.86s 11.01s
html_standard.html.bz2
17.22s 26.33s 103.5s
pepper.bz2
47.34s 57.43s 119.2s
mario.bz2
4.798s 9.454s 40.87s
html_standard.html.lzma
32.37s 41.58s
pepper.lzma
111.3s 130.4s
mario.lzma
17.09s 22.34s
html_standard.html.xz
35.51s 34.80s
pepper.xz
121.5s 120.9s
mario.xz
16.86s 16.83s

3Reference

:PROPERTIES::CUSTOM_ID: reference:END:

In what follows, an octet vector is a one-dimensional array containing octets, i.e. integers x with 0 ≤ x ≤ 255. Octet vectors that are passed to the library need not be specialized or simple; octet vectors that are returned by the library are always both. When we speak of bounding index designators, we use the term with the same meaning as in Common Lisp, but additionally allow a starting index of nil to denote the beginning. Whenever a function takes optional bounding index designators, the default is to denote the entire sequence.

3.1Function: decompress-all

:PROPERTIES::CUSTOM_ID: fn-decompress-all:END:decompress-all format input&key start end output-size=> decompressed-buffer, header
  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is avector. Ignored if input is a stream.
  • output-size: A non-negative integer or nil. The default is nil.
  • decompressed-buffer: An octet vector.
  • header: A property list with keyword keys.

The most convenient interface to this library. Returns a fresh octet vector that contains the decompressed form of the format-compressed data in input. If the format allows multiple members, all members are decompressed and concatenated. Trailing data results in a decompression-error.

header is the header information of the first member. See decompression-stream-header.

output-size, if not nil, should be the predicted size of the decompressed data. If correct, this prevents unnecessary copying at the end.

In addition to the options listed above, decompress-all can take format-specific options.

This function is equivalent to calling decompress with all-members-p set; it exists to prevent the subtle errors that result from a forgotten all-members-p. If you want to decompress single members or preserve trailing data, use decompress instead. To handle large data in a streaming fashion, use make-full-decompression-stream.

3.2Function: decompress

:PROPERTIES::CUSTOM_ID: fn-decompress:END:

decompress format input &key start end output-size all-members-p allow-overreads-p => decompressed-buffer, header

  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is avector. Ignored if input is a stream.
  • output-size: A non-negative integer or nil. The default is nil.
  • all-members-p: A generalized boolean. The default is nil.
  • allow-overreads-p: A generalized boolean. The default is nil.
  • decompressed-buffer: An octet vector.
  • header: A property list with keyword keys.

Returns a fresh octet vector that contains the decompressed form of the format-compressed data in input. If the format allows multiple members, only the first member is decompressed. Trailing data is ignored and will not be read from stream inputs, so that it can be handled by other code.

header is the header information of the decompressed member. See decompression-stream-header.

output-size, if not nil, should be the predicted size of the decompressed data. If correct, this prevents unnecessary copying at the end.

If all-members-p is true, act like decompress-all instead. If the value of all-members-p is known to be constant and true, it is recommended to use decompress-all over decompress since forgetting to set all-members-p can result in subtle bugs down the line.

If allow-overreads-p is true and input is a stream, trailing data may be read, resulting in a minor speedup. This option rarely has to be specified since it is implied by all-members-p.

In addition to the options listed above, decompress can take format-specific options.

To handle large data in a streaming fashion, use make-decompression-stream. To handle all members at once, use decompress-all.

3.3Condition: decompression-error

:PROPERTIES::CUSTOM_ID: c-decompression-error:END:

Direct superclasses: simple-error

General superclass for errors related to decompression. While the error message gives detailed information about the exact error, we do not distinguish between e.g. failed checksums, out of bounds references or malformed Huffman trees on the type level since this information is basically useless to a programmatic user and usually just an artifact of corruption.

Errors related to trailing data are also a decompression-error since it isn't possible to distinguish a malformed member from trailing data.

Gross user errors such as passing in invalid bounding index designators or returning invalid dictionaries from a dictionary function do not signal a decompression-error.

3.4Condition: eof

:PROPERTIES::CUSTOM_ID: c-eof:END:

Direct superclasses: decompression-error

Signalled when the input stream/buffer is exhausted. Notably implies that we did not detect errors in the data up until this point, but this is not a hard guarantee that the data can be continued in a valid manner since it would be infeasible to verify this.

Even when the input is a stream, it is this condition which is signalled, not end-of-file.

3.5Condition: unrecognized-zlib-dictionary

:PROPERTIES::CUSTOM_ID: c-unrecognized-zlib-dictionary:END:

Direct superclasses: decompression-error

Signalled when zlib decompression involves a preset dictionary whose checksum isn't recognized by the dictionary function.

Note: This condition is only signalled when preset dictionaries are enabled. See format-specific options.

3.6Function: unrecognized-zlib-dictionary-checksum

:PROPERTIES::CUSTOM_ID: fn-unrecognized-zlib-dictionary-checksum:END:

unrecognized-zlib-dictionary-checksum uzd => checksum

Returns the unrecognized checksum that was encountered during zlib decompression, in the form adler-32 outputs. See format-specific options.

3.7Function: list-supported-formats

:PROPERTIES::CUSTOM_ID: fn-list-supported-formats:END:

list-supported-formats => list

  • list: A list of symbols.

Returns a list of symbols that are valid format specifiers, i.e. can be used as format arguments to make-decompression-stream and decompress to specify a compression format.

3.8Class: decompression-stream

:PROPERTIES::CUSTOM_ID: c-decompression-stream:END:

Direct superclasses: fundamental-binary-input-stream

Gray stream wrapper that can be used for complicated stream processing. The end of file is considered reached when the underlying compressed data ends and all associated decompressed data has been read. Users need not close these streams.

3.9Function: make-decompression-stream

:PROPERTIES::CUSTOM_ID: fn-make-decompression-stream:END:

make-decompression-stream format input &key start end allow-overreads-p all-members-p => decompression-stream

  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is avector. Ignored if input is a stream.
  • all-members-p: A generalized boolean. The default is nil.
  • allow-overreads-p: A generalized boolean. The default is nil.
  • decompression-stream: A decompression-stream.

The most general interface to this library. All other interfaces are implemented in terms of this one.

Returns a decompression-stream whence the decompressed form of the format-compressed data in input can be read. If the format allows multiple members, only the first member is decompressed. Trailing data is ignored and will not be read from stream inputs, so that it can be handled by other code.

input should not be modified or read from until decompression-stream has reached end of file. If input is a stream, then the caller is responsible for closing it.

Calling this function will read header information from input and make it available via decompression-stream-header, but no decompression will be performed until data is read from decompression-stream.

If all-members-p is true, the stream will decompress and concatenate all compressed members from the stream (if the format defines multiple members) and signal a decompression-error if any trailing data is detected, similar to decompress-all. The provided header information is always that of the first member. If the value of all-members-p is known to be constant and true, it is recommended to use make-full-decompression-stream over make-decompression-stream since forgetting to set all-members-p can result in subtle bugs down the line.

If allow-overreads-p is true and input is a stream, trailing data may be read, resulting in a minor speedup. This option rarely has to be specified since it is implied by all-members-p.

In addition to the options listed above, make-decompression-stream can take format-specific options.

3.10Function: make-full-decompression-stream

:PROPERTIES::CUSTOM_ID: fn-make-full-decompression-stream:END:

make-full-decompression-stream format input &key start end => decompression-stream

  • format: A format specifier. See list-supported-formats.
  • input: Either a binary-input-stream or an octet vector.
  • start, end: Bounding index designators for input if the latter is avector. Ignored if input is a stream.
  • decompression-stream: A decompression-stream.

Wrapper around make-decompression-stream that always passes all-members-p.

3.11Function: decompression-stream-format

:PROPERTIES::CUSTOM_ID: fn-decompression-stream-format:END:

decompression-stream-format stream => format

Returns the format argument used to create stream. format is always an element of the list returned by list-supported-formats.

3.12Function: decompression-stream-header

:PROPERTIES::CUSTOM_ID: fn-decompression-stream-header:END:

decompression-stream-header stream => header

Returns the header metadata of the data associated to stream. Depending on the format, the possible entries are as below.

3.12.1Deflate

:PROPERTIES::CUSTOM_ID: header-deflate:END:

Deflate data doesn't have a header.

3.12.2Zlib

:PROPERTIES::CUSTOM_ID: header-zlib:END:
  • window-size: The declared window size, in bytes.
  • level: The "decompression level", a rough measure of whether speed or sizewere prioritized during compression. Goes from 0 to 3, with 0 meaning fastestand 3 meaning smallest.
  • dictionary: The Adler-32 checksum (as returned by adler-32) of the presetdictionary, if applicable, or nil.

3.12.3Gzip

:PROPERTIES::CUSTOM_ID: header-gzip:END:

We do not bother converting IDs and enumerations into symbols because this doesn't meaningfully reduce complexity - the uses for the affected fields are so niche that someone who wants to use them is probably more familiar with the raw data than anything we come up with.

  • textp: Boolean. If true, the data is supposed to be text and might requireline ending conversion.
  • extra-fields: Alist with two-character-string keys and octet vector values.See gzip spec. Note that keys may be repeated; they have the same order theyhad in the file.
  • filename: String or nil. Filename of the original file, if provided.
  • comment: String or nil. Comment field, if provided.
  • modification-time: Integer. Modification time of the original file or timeof compression, given in Unix time.
  • extra-flags: Integer. See gzip spec.
  • operating-system: Integer. See gzip spec.

3.12.4Bzip2

:PROPERTIES::CUSTOM_ID: header-bzip2:END:
  • block-size: Integer. The BWT block size in bytes.

3.12.5Raw LZMA

:PROPERTIES::CUSTOM_ID: header-raw-lzma:END:

Raw LZMA data doesn't have a header.

3.12.6LZMA

:PROPERTIES::CUSTOM_ID: header-lzma:END:
  • lc, lp, pb: Integer. The corresponding LZMA parameter of the same name.
  • window-size: The declared window size, in bytes.
  • decompressed-size: The declared output size, in bytes. A decompression-error issignalled when this doesn't match the actual output size, so you can use it toprepare suitable buffers.

3.12.7Raw LZMA2

:PROPERTIES::CUSTOM_ID: header-raw-lzma2:END:

Raw LZMA2 data doesn't have a header.

3.12.8LZMA2

:PROPERTIES::CUSTOM_ID: header-lzma2:END:
  • window-size: The declared window size, in bytes.

3.12.9XZ

:PROPERTIES::CUSTOM_ID: header-xz:END:
  • checksum-type: An integer from 0 to 15 indicating the checksum used.Preassigned numbers are 0 for none, 1 for CRC-32, 4 for CRC-64, and 10 forSHA-256.

3.13Format-specific options

:PROPERTIES::CUSTOM_ID: format-specific-options:END:

The following keyword arguments can be passed to make-decompression-stream, make-full-decompression-stream, decompress and decompress-all for specific formats. They are ignored for all other formats.

format mandatory optional
deflate
window-size, prefix, prefix-start, prefix-end
zlib
dictionary
gzip
raw-lzma lc, lp, pb, window-size decompressed-size, eof-mode
lzma
eof-mode
raw-lzma2 window-size
lzma2
xz
  • window-size (deflate, raw-lzma, raw-lzma2): A non-negative integer.Determines how far back references can reach; references that go back furtherresult in a decompression-error. For deflate, the default is 2^{15}, whichresults in no restrictions whatsoever. For raw-lzma and raw-lzma2, thisparameter has no default; it is mandatory.
  • prefix, prefix-start, prefix-end (deflate): Either nil or an octetvector with optional bounding index designators. If not nil, makes theoctets in prefix available to back references as if they were previousdecompressor output. The default is nil.
  • dictionary (zlib): A function or nil. Implements preset dictionaries. Ifnil (the default), preset dictionaries are disabled and signal a genericdecompression-error when encountered. The easiest way to use this option isvia make-simple-zlib-dictionary.

    In general, a function must take a single (unsigned-byte 32), representing an Adler-32 checksum as returned by adler-32, and return three values prefix, prefix-start, prefix-end that are used like the options of the same name for Deflate. If prefix is nil, an unrecognized-zlib-dictionary condition is signalled.

  • lc, lp, pb (raw-lzma): Integers such that 0 ≤ lc ≤ 8 and 0 ≤ lp, pb ≤ 4.Correspond to the LZMA parameters of the same name. These parameters do nothave defaults; they are mandatory.
  • eof-mode (raw-lzma, lzma): One of the keywords :always, :maybe, or:never. Determines whether data whose decompressed size is known isterminated by an end of file marker. The default is :maybe, which can handleall valid inputs from the other two modes; the only reason to change thisoption is to be stricter about inputs.
  • decompressed-size (raw-lzma): Either nil or a non-negative integer. If notnil, this is the declared decompressed size of the data, as in the header dataof the same name for lzma; if the real size differs, a decompression-erroris signalled. Note that this interacts with eof-mode. Not to be confusedwith the output-size option for decompress and decompress-all, which ismerely an optimization hint.

3.14Function: make-simple-zlib-dictionary

:PROPERTIES::CUSTOM_ID: fn-make-simple-zlib-dictionary:END:

make-simple-zlib-dictionary buffers => dictionary

  • buffers: A sequence of octet vectors.
  • dictionary: A function.

Returns a function suitable to be passed as a dictionary argument for zlib decompression which recognizes every octet vector in the sequence buffers and no others. The vectors should not be modified afterwards. An error is signalled if multiple vectors with distinct contents have the same checksum. See format-specific options for details and how to make more complicated dictionaries.

3.15Function: adler-32

:PROPERTIES::CUSTOM_ID: fn-adler-32:END:

adler-32 data &key start end => checksum

  • data: An octet vector.
  • start, end: Bounding index designators for data.
  • checksum: An (unsigned-byte 32).

Returns the Adler-32 checksum of the given data, in canonical s2~ × 2^{16} + ~s1 form. This function is provided to help with setting up more complicated zlib dictionaries; see format-specific options.

Dependencies (2)

  • alexandria
  • trivial-gray-streams

Dependents (0)

    • GitHub
    • Quicklisp