decompress
2024-10-12
No Description
Upstream URL
Author
Maintainer
License
1Overview
:PROPERTIES::CUSTOM_ID: overview:END:The title speaks for itself. More formats may be added in the future but for now
none are planned. A compressor counterpart is highly unlikely. A download can be
found here; there is also a Github mirror, but note that this Readme might not
display correctly there. On Quicklisp, the corresponding system is called
decompress
, although the package it defines is semz.decompress
.
1.1Features
:PROPERTIES::CUSTOM_ID: features:END:- Supported formats: Deflate, zlib, gzip, bzip2, LZMA, LZMA2, XZ.
- Safety: Edge cases are properly detected and explicitly handled; malformeddata is rejected early.¹ The interface is designed to be hard to misuse; inparticular, single-member and multi-member data can be handled with separatefunctions, defaults try to provide the strongest guarantees, and built-inintegrity checks are run without user intervention. No FFI is used anywhere.
- No mandatory overreads: When reading a single well-formed member from a bytestream, no byte beyond the end of the compressed data will be read, allowingthis library to be used as a part of complicated stream processing for otherformats.
- Performance: To my knowledge, this is the second fastest Deflate decoder andthe fastest bzip2 decoder in pure Common Lisp. Under SBCL, file-to-filedecompression usually hits 30%-80% the speed of the C referenceimplementations (depending on the format) and among CL libraries, we're onlybeaten by 3bz. See the performance section for more details and other CLimplementations.
¹ This is a guideline; a hard guarantee would be infeasible.
1.2Quickstart
:PROPERTIES::CUSTOM_ID: quickstart:END:From example.lisp
:
The supported multi-member formats are gzip, bzip2 and XZ. Members of gzip and bzip2 are trivial to handle manually with decompress or make-decompression-stream if you need to do that for some reason; for XZ, you'll need some contortions to deal with padding, but since the spec explicitly recommends not to use multi-member XZ when embedding into a different format, you probably won't run into this.
For more details, see the reference.
2Comparison to other libraries
:PROPERTIES::CUSTOM_ID: comparison:END:The other libraries under consideration are chipz, which is probably the most popular library; deflate, a compact implementation by Pierre Mai that is notably used by Quicklisp; and 3bz, which is notably used by pngload.
decompress-1.2.0 | chipz-20230418 | deflate-1.0.4 | 3bz-20230521 | |
---|---|---|---|---|
/ | < | |||
supported formats | deflate, zlib, gzip, bzip2, lzma, lzma2, xz | deflate, zlib, gzip, bzip2 | deflate, zlib, gzip | deflate, zlib, gzip |
dependencies | alexandria, trivial‑gray‑streams | alexandria, trivial‑features, nibbles, babel, cffi¹, mmap¹ | ||
license | MIT | BSD‑3 | MIT | MIT |
input sources | vector, stream | simple specialized vector, list, stream, pathname | stream | (s. s.) vector, stream, foreign memory |
output destinations | new vector | new vector, provided (s. s.) vector, stream, pathname | stream | (s. s.) vector |
Gray stream wrapper | yes | only around other streams | no | no |
checksum verification | automatic | automatic | automatic if enabled, disabled by default | automatic |
multi-member handling | manual or automatic | single-only | manual | manual |
edge case handling | strict | lenient | lenient | strict, no library error type |
partial input | library error | library error | end-of-file error |
status flag |
trailing data | ignored & unread, or library error | ignored & read | ignored & unread | ignored; not read with stream input |
preset zlib dictionaries | supported | unsupported | unsupported | unsupported |
bzip2 block randomization | supported | unsupported | n/a | n/a |
¹ Optional depending on implementation.
2.1Edge case handling
:PROPERTIES::CUSTOM_ID: edge-cases:END:2.1.1Deflate / zlib / gzip
:PROPERTIES::CUSTOM_ID: edge-cases-deflate:END:The Deflate spec has a few rough edges which we deal with as follows.
- Some decompressors (e.g. GNU gzip) allow back references to go beyond thebeginning of the output data, pretending that the octets out of bounds wereall zeroes. This one always signals an error.
- Huffman trees are uniquely determined by the associated sequence of codelengths, but not every such sequence corresponds to a tree - some leave codesunassigned and some make codes ambiguous. The spec isn't clear on what to dowith such sequences (outside of two special cases in §3.2.7). While one canargue that leaving codes unassigned is fine as long as they don't appear, wealways signal an error as soon as a sequence doesn't correspond to a(complete) Huffman tree.
- According to §3.2.7, the situation where only a single distance code is usedmay be represented by having only one code length of one. We enforce that theunique distance code is always encoded as a zero bit and never as a one bit.
- A dynamic Deflate block can omit the encoding for the end-of-block marker. Iam convinced that this is an oversight in the standard since the onlysituation where an encoder would want to output such a block is if it knows inadvance the input is infinite and that the miniscule inefficiency of encodingan EOB would be worse than the inefficiency that arises from being unable tochange the Huffman tree (no EOB means no new dynamic block).
However, there's little reason to ban this; since Deflate already allows arbitrarily many blocks of arbitrary length, nothing stops a stream from being infinite anyway and a finite input is going to run into EOF either way.
- Dynamic Deflate blocks can provide encodings for the illegal literal/lengthcodes 286 and 287 as well as the illegal distance codes 30 and 31, althoughthese may never occur in the data. We allow them in blocks since otherwise thefixed Deflate block code wouldn't be expressible dynamically. Somedecompressors (e.g. zlib) do not allow this at all.
Zlib and gzip do not add many edge cases by themselves; note however that the
gzip file format can consist of multiple gzip streams which each come with
their own filename/etc info in their respective header. You can access these
with decompress
if necessary, but most decoders ignore anything except maybe
the first header, which is readily accessible via decompress-all
.
A more detailed comparison of edge case handling across libraries can be found
below. An error is an error of some library-defined type that could be handled
by a library user; a basic error is something like a failed assertion or
(error "foo")
, where the problem is explicitly detected but can't be
reasonably handled by a user; an internal error is something like a type error
or an array bounds violation that is signalled by the implementation, but not
necessarily reliably so.
Deviations from zlib behavior are boldfaced for convenience; note that deviation is not necessarily a bad thing in itself. We consider internal errors deviation since they may cause behavior changes under some implementations.
edge case | zlib (zpipe) | decompress | 3bz | deflate | chipz |
---|---|---|---|---|---|
/ | < | ||||
reserved block type | error | error | basic error | error | error |
uncompressed zero-length block | handled | handled | handled | handled | handled |
uncompressed block with wrong checksum | error | error | basic error | error | error |
uncompressed zero-length block with wrong checksum | error | error | basic error | error | outputs data |
reference goes beyond previous output | error | error | basic error | outputs data | outputs data |
uses illegal length code 286/287 | error | error | basic error | error | error |
uses illegal distance code 30/31 | error | error | basic error | error | error |
dynamic block provides encodings for 286/287 & 30/31 | error | outputs data | internal error | outputs data | outputs data |
dynamic block with one distance code (§3.2.7) | handled | handled | handled | handled | handled |
dynamic block with one distance code, uses the unassigned code | error | error | basic error | error | error |
dynamic block without distance codes (§3.2.7) | handled | handled | handled | handled | handled |
dynamic block without distance codes, tries to use len code | error | error | basic error | error | error |
Huffman tree with too many items | error | error | basic error | outputs data | outputs data |
Huffman tree with too few items | error | error | basic error | outputs data | outputs data |
Huffman tree with too few items, uses an unassigned code | n/a | n/a | n/a | error | error |
dynamic block has code lengths expand across len/dist boundary | handled | handled | handled | handled | handled |
dynamic block has code lengths expand out of bounds | error | error | basic error | error | error |
dynamic block starts by repeating the last code length | error | error | basic error | error | error |
reference stays in bounds but goes beyond zlib window size | outputs data | error | outputs data | outputs data | outputs data |
zlib preset dictionary required but none provided | error | error | basic error | error | outputs data |
wrong Adler-32 checksum | error | error | basic error | outputs data by default | error |
wrong gzip magic bytes | error | error | basic error | error | error |
gzip data with wrong CRC-32 checksum | error | error | basic error | outputs data by default | error |
gzip data with right CRC-32 checksum but wrong modular length | error | error | outputs data | outputs data | outputs data |
gzip data with header checksum | handled | handled | handled | handled | handled |
gzip data with incorrect header checksum | error | error | basic error | outputs data | outputs data |
gzip extra fields | handled | handled | basic error | handled | basic error |
inconsistent length fields in extra fields section | outputs data | error | n/a | outputs data | n/a |
2.1.2Bzip2
:PROPERTIES::CUSTOM_ID: edge-cases-bzip2:END:Bzip2 has no spec, so we mostly try to match the reference implementation. There is one notable exception: The reference implementation accepts underfull Huffman trees and only complains once unassigned codes are used; we reject such trees right off the bat since it simplifies the code.
I am not aware of any compressor that outputs such trees; Joe Tsai claims they exist, but provides no names. There isn't a good reason why a compressor would output such trees since bzip2 cannot have alphabets of size one, and whenever you have a branch with only one child, it's more efficient to just give the unique child a code that's one bit shorter. The resulting reshuffling of codes has negligible overhead (none if you only assign them at the end). I suppose this case is something to think about once someone runs into it while using this library.
Chipz is essentially a translation of the reference implementation, but doesn't bother to support the block randomization mechanism that has been deprecated for a good two decades. We do because our approach makes it very easy to add. You can find a test file here; it consists of 100MB of highest quality zero octets. The testfile in lbzip2 doesn't actually test block randomization since it is too short for any effects to show up.
Other than this, the format does not allow for many edge cases to begin with since most of its transformations simply cannot have invalid inputs. Just like the reference implementation, we do not care about wasteful encodings such as e.g. defining six Huffman trees despite only using three.
2.1.3LZMA
:PROPERTIES::CUSTOM_ID: edge-cases-lzma:END:LZMA has a spec, but it's essentially just a commented C++ reference implementation. It leaves very little freedom to the implementer but nevertheless contains some minor pitfalls:
- The range decoder's finishing state is also a valid intermediate state, so itis necessary to define some EOF handling scheme. LZMA defines two and XZdefines another one; there are no other possible schemes. We default to themost permissive scheme (which can handle all valid inputs for the other two)but let the user optionally override this.
- Two of the EOF handling schemes require that the decompressed size is known inadvance. If the last Lempel-Ziv match expands beyond the declared decompressedsize, one might either signal an error or truncate it and finish gracefully;we signal an error, just like the reference implementation and XZ Utils.
- The range decoder's finishing state cannot result in an EOF marker whentreated like an intermediate state. As a result, the "optional EOF marker"scheme (the one we default to) can only be implemented in the way described inthe spec: When the declared size is reached, only try to consume an EOF markerif the range decoder is not yet in a finishing state.
2.1.4LZMA2
:PROPERTIES::CUSTOM_ID: edge-cases-lzma2:END:LZMA2 has no spec, but is a relatively straightforward wrapper on top of LZMA that adds uncompressed blocks, flush points, and LZMA state resets. We don't expose flush points to the user.
We follow XZ Utils' implementation; even though there have been subtle compatibility issues with the LZMA reference implementation in the past, the XZ Utils implementation is stricter and avoids certain edge cases entirely. It is also much cleaner; when testing some of the edge cases below, the LZMA SDK's implementation would often simply segfault.
LZMA2 is subject to the following edge cases:
- The data must start with a flush point; the first LZMA block after a flushpoint must specify new LZMA parameters.
- Every block declares both compressed and decompressed sizes. These could beused to support partial decompression, so they should be verified. We doverify them even though we don't support partial decompression.
- The underlying data of an LZMA block might contain an EOF marker. XZ Utilssignals an error when this happens, as do we. If an implementation allowsthis, care must be taken to either skip trailing data or signal an error ifany exists.
- An LZMA state reset does not reset the
posState
variable. Furthermore,uncompressed blocks will update it appropriately.Note that implementing this correctly only matters in the extremely unusual case where an LZMA block is followed by some uncompressed blocks and then another LZMA block, with no flush points in between, and where the second LZMA block doesn't reset the LZMA state; if the second LZMA block resets the state, then the value of
posState
is completely irrelevant because it merely denotes the starting position in an array of probabilities that are all initialized to the same value anyway.LZMA2 data with this sequence of blocks is extremely unlikely to be produced in practice since the encoder can't really know that the second LZMA block benefits from state reuse despite the uncompressed data in the middle. Nevertheless, a base64-encoded XZ test file that triggers this case can be found below. It decodes to the string
LOL
, but fails to decompress ifposState
isn't updated by uncompressed blocks./Td6WFoAAAD/EtlBAgAhAQoAAABTxyq54AAAAAUJACX//AAAAgAAT4AAAAAFACfRR0AAAAABKAM7StLkBnKeegEAAAAAAFla
2.1.5XZ
:PROPERTIES::CUSTOM_ID: edge-cases-xz:END:XZ has a decent spec and is mostly a container format. XZ Utils supports partial decompression and skipping the checksum verification of unknown checksum types; we don't. Note that we implement all standard filters, not just LZMA2. A list of edge cases is below; most of these can be found in XZ Utils' elaborate test suite as well.
- XZ duplicates stream header information and length information for all blocksat the end of the file to support partial decompression; while we do notsupport partial decompression, we nevertheless verify the lengths at the end.
- XZ's variable length integers must be encoded with the shortest possibleencoding. This is easy to miss because the spec doesn't explicitly mention it,but the example code and the test suite both check for it.
- XZ files can consist of multiple concatenated members, and there may bepadding after each member, including the final one. The padding must come inblocks of four and consist of zero bytes. Padding is not allowed at the startof the file.
- Fields whose size isn't necessarily a multiple of four are padded with theminimal number of zero bytes to ensure such a size. The sole exception tothis is the block header padding, which consists of however many bytes arerequired to match the declared size.
- The total compressed and decompressed data of a single member are each limitedto at most 2^{63} - 1 bytes (~8 exabytes). XZ Utils detects this before theblock that pushes the size over the edge is even decoded; we only detect itonce the read/output data exceeds this limit.
- SHA-256 is limited to 2^{61} - 1 bytes of input and therefore cannot be usedwith extremely large XZ files. We signal an error when a file using SHA-256exceeds this limit.
- LZMA2 can only be the last filter in a filter chain; the other standardfilters (i.e. delta and the BCJ filters) can never be the last filter.
- Embedded LZMA2 data might signal EOF before the declared end of the data. Wesignal an error in this case, as does XZ Utils.
2.2Performance funny numbers
:PROPERTIES::CUSTOM_ID: performance:END:We're usually the fastest with a significant improvement over chipz; under SBCL, 3bz is slightly faster for gzip data and noticeably faster for zlib data. Preserving trailing data hurts a lot for Deflate but not much for the other formats. Under ABCL, Mai's Deflate is faster for heavily compressed files and preserving trailing data is rather expensive.
comparison implementation | SBCL | CCL | ECL | ABCL |
---|---|---|---|---|
/ | <c> | <c> | <c> | <c> |
/ | < | |||
chipz (deflate/zlib/gzip) | 1.79x-2.76x | 1.34x-1.87x | 1.61x-2.30x | 1.16x-4.03x |
chipz (bzip2) | 1.01x-1.32x | 2.06x-2.70x | 2.17x-2.47x | 2.52x-8.52x |
deflate (with checksum) | 2.04x-3.41x | 1.30x-3.91x | 1.87x-3.24x | 0.48x-1.80x |
deflate (no checksum) | 1.33x-3.16x | 1.21x-1.73x | 1.75x-2.09x | 0.46x-1.72x |
3bz | 0.63x-0.93x | 1.07x-1.55x | 2.12x-7.23x | 1.20x-4.65x |
zlib | 0.31x-0.54x | 0.04x-0.12x | 1.00%-2.65% | 0.18%-0.28% |
bzip2 | 0.66x-0.82x | 0.20x-0.22x | 4.15%-5.39% | 1.08%-2.29% |
xz | 0.49x-0.62x | 0.10x-0.17x | 1.04%-1.69% | 0.21%-0.42% |
2.2.1Details
:PROPERTIES::CUSTOM_ID: performance-details:END:The performance measurements below are taken on an older x86-64 machine, taking the best of five attempts, using each library's idiomatic method of file-to-file decompression:
- For decompress, this means
with-open-file
+ Alexandria'scopy-stream
. Weprovide measurements for both single member decompression and multiple memberdecompression. (None of the benchmark files contain multiple members; thismerely demonstrates the performance loss when preserving trailing data.) - For chipz, this means
decompress
with pathname arguments. - For Mai's Deflate, this means
with-open-file
+ theinflate-*-stream
functions. We provide measurements with and without checksums to reflect afair comparison and the default respectively. - For 3bz, we use this code. Note that this involves manual buffering. We don'tprovide a measurement with the significantly slower stream input interfacesince it wouldn't be very informative.
We always use the implementation's default optimization settings. Note that
under ECL, one library's declamations can easily affect the optimization
settings other libraries are compiled under; special care was taken to use ECL's
default ((safety 2) (space 0) (speed 3) (debug 3))
for all libraries.
I tried to keep the benchmark files varied since decompression benchmarks can
highly input-dependent, but take the results with a grain of salt regardless. As
a rough baseline, measurements with zlib-1.2.13, bzip2-1.0.8, and xz-5.4.2 are
provided, respectively using the canonical zpipe
example program, bunzip2
,
and xz
.
We use the following real world tarballs:
file | input size | output size | refimpl |
---|---|---|---|
/ | < | ||
<r> | <r> | <r> | |
openjdk‑17.0.6_p10.tar.gz | 105,221,267B | 672,163,840B | 2.618s |
openh264‑2.3.1.tar.gz | 60,290,897B | 73,205,760B | 0.280s |
gimp‑2.10.32.tar.bz2 | 31,397,425B | 207,185,920B | 6.687s |
gcc‑11.3.0.tar.xz | 81,141,364B | 824,309,760B | 7.263s |
ccl‑1.12.1‑linuxx86.tar.gz | 20,872,508B | 80,752,640B | 0.427s |
sbcl‑2.3.3-source.tar.bz2 | 7,408,264B | 43,386,880B | 1.408s |
Python‑3.11.4.tar.xz | 19,954,828B | 99,696,640B | 1.618s |
We also have a bunch of example data that we use to emulate common use cases.
html_standard.html
is HTML as you'd commonly find it on the web; pepper.dat
and mario.dat
are PNG pixel data from a and
and image respectively.
base file | base size | zlib (.z) | bzip2 (.bz2) | xz (.lzma) | xz (.xz) |
---|---|---|---|---|---|
/ | < | ||||
html_standard.html | 12.80MB | 0.053s | 0.373s | 0.115s | 0.121s |
pepper.dat | 51.27MB | 0.148s | 0.512s | 0.238s | 0.262s |
mario.dat | 3.053MB | 0.018s | 0.110s | 0.065s | 0.070s |
2.2.2SBCL-2.3.8
:PROPERTIES::CUSTOM_ID: performance-sbcl:END:SBCL is unsurprisingly the fastest of the tested implementations. We primarily optimize for it; notably we use a different implementation of the various CRCs.
file | refimpl | 3bz | decom (M) | decom (S) | chipz | defl | defl (C) |
---|---|---|---|---|---|---|---|
/ | < | ||||||
<r> | <r> | <r> | <r> | <r> | <r> | <r> | |
openjdk-17.0.6_p10.tar.gz | 2.618s | 5.530s | 5.981s | 7.801s | 13.84s | 10.07s | 12.21s |
openh264-2.3.1.tar.gz | 0.280s | 0.831s | 0.892s | 1.333s | 1.594s | 2.815s | 3.042s |
gimp-2.10.32.tar.bz2 | 6.687s | 9.857s | 10.34s | 10.19s | |||
gcc-11.3.0.tar.xz | 7.263s | 14.93s | 14.94s | ||||
ccl-1.12.1-linuxx86.tar.gz | 0.427s | 0.923s | 1.028s | 1.386s | 2.182s | 1.950s | 2.206s |
sbcl-2.3.3-source.tar.bz2 | 1.408s | 1.979s | 2.098s | 2.507s | |||
Python-3.11.4.tar.xz | 1.618s | 3.090s | 3.083s | ||||
html_standard.html.z | 0.053s | 0.073s | 0.107s | 0.138s | 0.270s | 0.180s | 0.254s |
pepper.z | 0.148s | 0.171s | 0.273s | 0.311s | 0.754s | 0.364s | 0.657s |
mario.z | 0.018s | 0.032s | 0.039s | 0.055s | 0.091s | 0.081s | 0.098s |
html_standard.html.bz2 | 0.373s | 0.566s | 0.587s | 0.571s | |||
pepper.bz2 | 0.512s | 0.714s | 0.740s | 0.802s | |||
mario.bz2 | 0.110s | 0.134s | 0.146s | 0.177s | |||
html_standard.html.lzma | 0.115s | 0.186s | 0.205s | ||||
pepper.lzma | 0.238s | 0.387s | 0.416s | ||||
mario.lzma | 0.065s | 0.108s | 0.119s | ||||
html_standard.html.xz | 0.121s | 0.222s | 0.222s | ||||
pepper.xz | 0.262s | 0.529s | 0.529s | ||||
mario.xz | 0.070s | 0.117s | 0.117s |
2.2.3CCL-1.12.2
:PROPERTIES::CUSTOM_ID: performance-ccl:END:CCL has notoriously slow bit operations, resulting in an overall 6.5x runtime for Deflate decompression compared to SBCL and less pronounced differences between libraries. This is apparently being taken care of currently and I'd expect performance to change drastically once these fixes land. Preserving trailing data makes almost no difference.
Bzip2 and LZMA runtimes are about 3-4x those of SBCL, which is pretty typical. XZ runtimes suffer from CCL's weakness around 64-bit arithmetic, as required by CRC-64.
file | 3bz | decom (M) | decom (S) | chipz | defl | defl (C) |
---|---|---|---|---|---|---|
/ | < | |||||
<r> | <r> | <r> | <r> | <r> | <r> | |
openjdk-17.0.6_p10.tar.gz | 52.87s | 36.56s | 39.65s | 63.12s | 48.52s | 54.23s |
openh264-2.3.1.tar.gz | 7.327s | 6.236s | 6.751s | 9.911s | 10.04s | 10.65s |
gimp-2.10.32.tar.bz2 | 33.84s | 34.74s | 73.68s | |||
gcc-11.3.0.tar.xz | 67.94s | 68.16s | ||||
ccl-1.12.1-linuxx86.tar.gz | 8.518s | 6.963s | 7.209s | 10.18s | 8.416s | 9.075s |
sbcl-2.3.3-source.tar.bz2 | 6.693s | 6.903s | 18.04s | |||
Python-3.11.4.tar.xz | 13.10s | 13.10s | ||||
html_standard.html.z | 0.847s | 0.646s | 0.689s | 0.940s | 0.895s | 1.582s |
pepper.z | 1.907s | 1.227s | 1.281s | 2.292s | 2.128s | 4.802s |
mario.z | 0.292s | 0.273s | 0.293s | 0.365s | 0.343s | 0.503s |
html_standard.html.bz2 | 1.685s | 1.725s | 3.779s | |||
pepper.bz2 | 2.307s | 2.363s | 4.754s | |||
mario.bz2 | 0.494s | 0.515s | 1.317s | |||
html_standard.html.lzma | 0.696s | 0.752s | ||||
pepper.lzma | 1.418s | 1.491s | ||||
mario.lzma | 0.431s | 0.458s | ||||
html_standard.html.xz | 1.013s | 1.017s | ||||
pepper.xz | 2.602s | 2.592s | ||||
mario.xz | 0.503s | 0.503s |
2.2.4ECL-21.2.1
:PROPERTIES::CUSTOM_ID: performance-ecl:END:Deflate decompression sits at some 25x-50x the runtime of SBCL, which is rough, but still usable for smaller data. Bzip2 ends up at a nicer 20x-30x runtime. LZMA/XZ decompression suffers from heavy type check overhead and is slower than bzip2 as a result.
file | 3bz | decom (M) | decom (S) | chipz | defl | defl (C) |
---|---|---|---|---|---|---|
/ | < | |||||
<r> | <r> | <r> | <r> | <r> | <r> | |
ccl-1.12.1-linuxx86.tar.gz | 128.7s | 41.39s | 53.99s | 76.07s | 76.02s | 77.23s |
sbcl-2.3.3-source.tar.bz2 | 30.60s | 34.43s | 75.46s | |||
Python-3.11.4.tar.xz | 155.4s | 156.2s | ||||
html_standard.html.z | 16.27s | 3.901s | 4.938s | 6.539s | 6.845s | 8.401s |
pepper.z | 40.32s | 5.576s | 6.706s | 12.83s | 11.63s | 18.06s |
mario.z | 3.823s | 1.804s | 2.269s | 2.899s | 3.211s | 3.548s |
html_standard.html.bz2 | 6.931s | 7.612s | 16.25s | |||
pepper.bz2 | 10.15s | 10.99s | 23.02s | |||
mario.bz2 | 2.652s | 3.038s | 5.764s | |||
html_standard.html.lzma | 8.427s | 9.199s | ||||
pepper.lzma | 14.06s | 15.26s | ||||
mario.lzma | 5.468s | 5.886s | ||||
html_standard.html.xz | 10.28s | 10.33s | ||||
pepper.xz | 21.72s | 21.67s | ||||
mario.xz | 5.898s | 5.897s |
2.2.5ABCL-1.9.2
:PROPERTIES::CUSTOM_ID: performance-abcl:END:ABCL's replace
and manual array ops are obscenely slow, which punishes copying
and Lisp-side buffering. As a result, the straightforward unbuffered code of
Mai's Deflate outperforms several buffered implementations, especially for
heavily compressed files.
In general, Deflate performance is close to the point of unusability; you're
likely better off using java.util.zip
. Our bzip2/XZ performance isn't that bad
compared to ECL, likely due to the lower amount of bit diddling; note however
how harshly the runtime increases for the (highly-compressed, hence copy-heavy)
pepper.dat
.
file | 3bz | decom (M) | decom (S) | chipz | defl | defl (C) |
---|---|---|---|---|---|---|
/ | < | |||||
<r> | <r> | <r> | <r> | <r> | <r> | |
html_standard.html.z | 62.56s | 23.40s | 34.95s | 55.66s | 39.57s | 42.04s |
pepper.z | 98.34s | 81.41s | 95.19s | 94.55s | 37.10s | 39.35s |
mario.z | 29.39s | 6.324s | 12.18s | 25.46s | 10.86s | 11.01s |
html_standard.html.bz2 | 17.22s | 26.33s | 103.5s | |||
pepper.bz2 | 47.34s | 57.43s | 119.2s | |||
mario.bz2 | 4.798s | 9.454s | 40.87s | |||
html_standard.html.lzma | 32.37s | 41.58s | ||||
pepper.lzma | 111.3s | 130.4s | ||||
mario.lzma | 17.09s | 22.34s | ||||
html_standard.html.xz | 35.51s | 34.80s | ||||
pepper.xz | 121.5s | 120.9s | ||||
mario.xz | 16.86s | 16.83s |
3Reference
:PROPERTIES::CUSTOM_ID: reference:END:In what follows, an octet vector is a one-dimensional array containing
octets, i.e. integers x with 0 ≤ x ≤ 255. Octet vectors that are passed to the
library need not be specialized or simple; octet vectors that are returned by
the library are always both. When we speak of bounding index designators, we
use the term with the same meaning as in Common Lisp, but additionally allow a
starting index of nil
to denote the beginning. Whenever a function takes
optional bounding index designators, the default is to denote the entire
sequence.
3.1Function: decompress-all
:PROPERTIES::CUSTOM_ID: fn-decompress-all:END:decompress-all format input&key start end output-size=> decompressed-buffer, headerformat
: A format specifier. See list-supported-formats.input
: Either abinary-input-stream
or an octet vector.start
,end
: Bounding index designators forinput
if the latter is avector. Ignored ifinput
is a stream.output-size
: A non-negative integer ornil
. The default isnil
.decompressed-buffer
: An octet vector.header
: A property list with keyword keys.
The most convenient interface to this library. Returns a fresh octet vector that
contains the decompressed form of the format
-compressed data in input
. If
the format allows multiple members, all members are decompressed and
concatenated. Trailing data results in a decompression-error
.
header
is the header information of the first member.
See decompression-stream-header.
output-size
, if not nil
, should be the predicted size of the decompressed
data. If correct, this prevents unnecessary copying at the end.
In addition to the options listed above, decompress-all
can
take format-specific options.
This function is equivalent to calling decompress with all-members-p
set; it
exists to prevent the subtle errors that result from a forgotten
all-members-p
. If you want to decompress single members or preserve trailing
data, use decompress instead. To handle large data in a streaming fashion,
use make-full-decompression-stream.
3.2Function: decompress
:PROPERTIES::CUSTOM_ID: fn-decompress:END:decompress format input &key start end output-size all-members-p allow-overreads-p => decompressed-buffer, header
format
: A format specifier. See list-supported-formats.input
: Either abinary-input-stream
or an octet vector.start
,end
: Bounding index designators forinput
if the latter is avector. Ignored ifinput
is a stream.output-size
: A non-negative integer ornil
. The default isnil
.all-members-p
: A generalized boolean. The default isnil
.allow-overreads-p
: A generalized boolean. The default isnil
.decompressed-buffer
: An octet vector.header
: A property list with keyword keys.
Returns a fresh octet vector that contains the decompressed form of the
format
-compressed data in input
. If the format allows multiple members, only
the first member is decompressed. Trailing data is ignored and will not be read
from stream inputs, so that it can be handled by other code.
header
is the header information of the decompressed member.
See decompression-stream-header.
output-size
, if not nil
, should be the predicted size of the decompressed
data. If correct, this prevents unnecessary copying at the end.
If all-members-p
is true, act like decompress-all instead. If the value of
all-members-p
is known to be constant and true, it is recommended to
use decompress-all over decompress
since forgetting to set all-members-p
can
result in subtle bugs down the line.
If allow-overreads-p
is true and input
is a stream, trailing data may be
read, resulting in a minor speedup. This option rarely has to be specified since
it is implied by all-members-p
.
In addition to the options listed above, decompress
can
take format-specific options.
To handle large data in a streaming fashion, use make-decompression-stream. To handle all members at once, use decompress-all.
3.3Condition: decompression-error
:PROPERTIES::CUSTOM_ID: c-decompression-error:END:Direct superclasses: simple-error
General superclass for errors related to decompression. While the error message gives detailed information about the exact error, we do not distinguish between e.g. failed checksums, out of bounds references or malformed Huffman trees on the type level since this information is basically useless to a programmatic user and usually just an artifact of corruption.
Errors related to trailing data are also a decompression-error
since it isn't
possible to distinguish a malformed member from trailing data.
Gross user errors such as passing in invalid bounding index designators or
returning invalid dictionaries from a dictionary function do not signal a
decompression-error
.
3.4Condition: eof
:PROPERTIES::CUSTOM_ID: c-eof:END:Direct superclasses: decompression-error
Signalled when the input stream/buffer is exhausted. Notably implies that we did not detect errors in the data up until this point, but this is not a hard guarantee that the data can be continued in a valid manner since it would be infeasible to verify this.
Even when the input is a stream, it is this condition which is signalled, not
end-of-file
.
3.5Condition: unrecognized-zlib-dictionary
:PROPERTIES::CUSTOM_ID: c-unrecognized-zlib-dictionary:END:Direct superclasses: decompression-error
Signalled when zlib decompression involves a preset dictionary whose checksum isn't recognized by the dictionary function.
Note: This condition is only signalled when preset dictionaries are enabled. See format-specific options.
3.6Function: unrecognized-zlib-dictionary-checksum
:PROPERTIES::CUSTOM_ID: fn-unrecognized-zlib-dictionary-checksum:END:unrecognized-zlib-dictionary-checksum uzd => checksum
uzd
: An unrecognized-zlib-dictionary condition.checksum
: An(unsigned-byte 32)
.
Returns the unrecognized checksum that was encountered during zlib decompression, in the form adler-32 outputs. See format-specific options.
3.7Function: list-supported-formats
:PROPERTIES::CUSTOM_ID: fn-list-supported-formats:END:list-supported-formats => list
list
: A list of symbols.
Returns a list of symbols that are valid format specifiers, i.e. can be used
as format
arguments to make-decompression-stream and decompress to specify a
compression format.
3.8Class: decompression-stream
:PROPERTIES::CUSTOM_ID: c-decompression-stream:END:Direct superclasses: fundamental-binary-input-stream
Gray stream wrapper that can be used for complicated stream processing. The end of file is considered reached when the underlying compressed data ends and all associated decompressed data has been read. Users need not close these streams.
3.9Function: make-decompression-stream
:PROPERTIES::CUSTOM_ID: fn-make-decompression-stream:END:make-decompression-stream format input &key start end allow-overreads-p all-members-p => decompression-stream
format
: A format specifier. See list-supported-formats.input
: Either abinary-input-stream
or an octet vector.start
,end
: Bounding index designators forinput
if the latter is avector. Ignored ifinput
is a stream.all-members-p
: A generalized boolean. The default isnil
.allow-overreads-p
: A generalized boolean. The default isnil
.decompression-stream
: A decompression-stream.
The most general interface to this library. All other interfaces are implemented in terms of this one.
Returns a decompression-stream whence the decompressed form of the
format
-compressed data in input
can be read. If the format allows multiple
members, only the first member is decompressed. Trailing data is ignored and
will not be read from stream inputs, so that it can be handled by other code.
input
should not be modified or read from until decompression-stream
has
reached end of file. If input
is a stream, then the caller is responsible for
closing it.
Calling this function will read header information from input
and make it
available via decompression-stream-header, but no decompression will be
performed until data is read from decompression-stream
.
If all-members-p
is true, the stream will decompress and concatenate all
compressed members from the stream (if the format defines multiple members) and
signal a decompression-error if any trailing data is detected, similar
to decompress-all. The provided header information is always that of the first
member. If the value of all-members-p
is known to be constant and true, it is
recommended to use make-full-decompression-stream over
make-decompression-stream
since forgetting to set all-members-p
can result
in subtle bugs down the line.
If allow-overreads-p
is true and input
is a stream, trailing data may be
read, resulting in a minor speedup. This option rarely has to be specified since
it is implied by all-members-p
.
In addition to the options listed above, make-decompression-stream
can
take format-specific options.
3.10Function: make-full-decompression-stream
:PROPERTIES::CUSTOM_ID: fn-make-full-decompression-stream:END:make-full-decompression-stream format input &key start end => decompression-stream
format
: A format specifier. See list-supported-formats.input
: Either abinary-input-stream
or an octet vector.start
,end
: Bounding index designators forinput
if the latter is avector. Ignored ifinput
is a stream.decompression-stream
: A decompression-stream.
Wrapper around make-decompression-stream that always passes all-members-p
.
3.11Function: decompression-stream-format
:PROPERTIES::CUSTOM_ID: fn-decompression-stream-format:END:decompression-stream-format stream => format
stream
: A decompression-stream.format
: A format specifier.
Returns the format
argument used to create stream
. format
is always an
element of the list returned by list-supported-formats.
3.12Function: decompression-stream-header
:PROPERTIES::CUSTOM_ID: fn-decompression-stream-header:END:decompression-stream-header stream => header
stream
: A decompression-stream.header
: A property list with keyword keys.
Returns the header metadata of the data associated to stream
. Depending on the
format, the possible entries are as below.
3.12.1Deflate
:PROPERTIES::CUSTOM_ID: header-deflate:END:Deflate data doesn't have a header.
3.12.2Zlib
:PROPERTIES::CUSTOM_ID: header-zlib:END:window-size
: The declared window size, in bytes.level
: The "decompression level", a rough measure of whether speed or sizewere prioritized during compression. Goes from 0 to 3, with 0 meaning fastestand 3 meaning smallest.dictionary
: The Adler-32 checksum (as returned by adler-32) of the presetdictionary, if applicable, ornil
.
3.12.3Gzip
:PROPERTIES::CUSTOM_ID: header-gzip:END:We do not bother converting IDs and enumerations into symbols because this doesn't meaningfully reduce complexity - the uses for the affected fields are so niche that someone who wants to use them is probably more familiar with the raw data than anything we come up with.
textp
: Boolean. If true, the data is supposed to be text and might requireline ending conversion.extra-fields
: Alist with two-character-string keys and octet vector values.See gzip spec. Note that keys may be repeated; they have the same order theyhad in the file.filename
: String ornil
. Filename of the original file, if provided.comment
: String ornil
. Comment field, if provided.modification-time
: Integer. Modification time of the original file or timeof compression, given in Unix time.extra-flags
: Integer. See gzip spec.operating-system
: Integer. See gzip spec.
3.12.4Bzip2
:PROPERTIES::CUSTOM_ID: header-bzip2:END:block-size
: Integer. The BWT block size in bytes.
3.12.5Raw LZMA
:PROPERTIES::CUSTOM_ID: header-raw-lzma:END:Raw LZMA data doesn't have a header.
3.12.6LZMA
:PROPERTIES::CUSTOM_ID: header-lzma:END:lc
,lp
,pb
: Integer. The corresponding LZMA parameter of the same name.window-size
: The declared window size, in bytes.decompressed-size
: The declared output size, in bytes. A decompression-error issignalled when this doesn't match the actual output size, so you can use it toprepare suitable buffers.
3.12.7Raw LZMA2
:PROPERTIES::CUSTOM_ID: header-raw-lzma2:END:Raw LZMA2 data doesn't have a header.
3.12.8LZMA2
:PROPERTIES::CUSTOM_ID: header-lzma2:END:window-size
: The declared window size, in bytes.
3.12.9XZ
:PROPERTIES::CUSTOM_ID: header-xz:END:checksum-type
: An integer from 0 to 15 indicating the checksum used.Preassigned numbers are 0 for none, 1 for CRC-32, 4 for CRC-64, and 10 forSHA-256.
3.13Format-specific options
:PROPERTIES::CUSTOM_ID: format-specific-options:END:The following keyword arguments can be passed to make-decompression-stream, make-full-decompression-stream, decompress and decompress-all for specific formats. They are ignored for all other formats.
format | mandatory | optional |
---|---|---|
deflate | window-size , prefix , prefix-start , prefix-end |
|
zlib | dictionary |
|
gzip | ||
raw-lzma | lc , lp , pb , window-size |
decompressed-size , eof-mode |
lzma | eof-mode |
|
raw-lzma2 | window-size |
|
lzma2 | ||
xz |
window-size
(deflate, raw-lzma, raw-lzma2): A non-negative integer.Determines how far back references can reach; references that go back furtherresult in a decompression-error. Fordeflate
, the default is 2^{15}, whichresults in no restrictions whatsoever. Forraw-lzma
andraw-lzma2
, thisparameter has no default; it is mandatory.prefix
,prefix-start
,prefix-end
(deflate): Eithernil
or an octetvector with optional bounding index designators. If notnil
, makes theoctets inprefix
available to back references as if they were previousdecompressor output. The default isnil
.dictionary
(zlib): A function ornil
. Implements preset dictionaries. Ifnil
(the default), preset dictionaries are disabled and signal a genericdecompression-error when encountered. The easiest way to use this option isvia make-simple-zlib-dictionary.In general, a function must take a single
(unsigned-byte 32)
, representing an Adler-32 checksum as returned by adler-32, and return three valuesprefix
,prefix-start
,prefix-end
that are used like the options of the same name for Deflate. Ifprefix
isnil
, an unrecognized-zlib-dictionary condition is signalled.lc
,lp
,pb
(raw-lzma): Integers such that 0 ≤ lc ≤ 8 and 0 ≤ lp, pb ≤ 4.Correspond to the LZMA parameters of the same name. These parameters do nothave defaults; they are mandatory.eof-mode
(raw-lzma, lzma): One of the keywords:always
,:maybe
, or:never
. Determines whether data whose decompressed size is known isterminated by an end of file marker. The default is:maybe
, which can handleall valid inputs from the other two modes; the only reason to change thisoption is to be stricter about inputs.decompressed-size
(raw-lzma): Either nil or a non-negative integer. If notnil, this is the declared decompressed size of the data, as in the header dataof the same name forlzma
; if the real size differs, a decompression-erroris signalled. Note that this interacts witheof-mode
. Not to be confusedwith theoutput-size
option for decompress and decompress-all, which ismerely an optimization hint.
3.14Function: make-simple-zlib-dictionary
:PROPERTIES::CUSTOM_ID: fn-make-simple-zlib-dictionary:END:make-simple-zlib-dictionary buffers => dictionary
buffers
: A sequence of octet vectors.dictionary
: A function.
Returns a function suitable to be passed as a dictionary
argument for zlib
decompression which recognizes every octet vector in the sequence buffers
and
no others. The vectors should not be modified afterwards. An error is signalled
if multiple vectors with distinct contents have the same checksum.
See format-specific options for details and how to make more complicated
dictionaries.
3.15Function: adler-32
:PROPERTIES::CUSTOM_ID: fn-adler-32:END:adler-32 data &key start end => checksum
data
: An octet vector.start
,end
: Bounding index designators fordata
.checksum
: An(unsigned-byte 32)
.
Returns the Adler-32 checksum of the given data, in canonical
s2~ × 2^{16} + ~s1
form. This function is provided to help with setting up
more complicated zlib dictionaries; see format-specific options.