From: linux@horizon.com
To: git@vger.kernel.org
Subject: Re: Compression and dictionaries
Date: 15 Aug 2006 04:33:03 -0400 [thread overview]
Message-ID: <20060815083303.13253.qmail@science.horizon.com> (raw)
Just to inform this discussion, a brief description of the zlib
"deflate" algorithm. A full description is in RFC1951.
This is LZ77 compression, meaning that what is actually encoded is
a series of instruction to either
- Insert a specified byte, or
- Copy a run of n bytes, starting m bytes ago.
I.e. roughly
union {
struct { bool is_literal, char literal};
struct { bool is_literal; uint8_t length; uint16_t distance; } copy;
};
Note that 100 repetitions of a character can be unambiguously encoded
as the character, followed by "copy 99 chars starting at offset -1".
There are also 257 possible literals, the 256 bytes plus a special EOF token.
The valid copy lengths are 3..258, and the valid distances are 1..32768.
Anyway, the deflate algorithm operates in a couple of steps:
- Translate the input into an (uncompressed) series of literals and
copies.
- Take the literals and the lengths and put them into one big Huffman tree.
- Take the distances (actually, the simplified distances; large
distances are grouped into ranges, which are followed by a binary suffix)
and put them in a separate Huffman tree.
- A Huffman tree can be transmitted by encoding the number of bits assigned
to each symbol, with a apecial value (infinity is mathematically
the most logical but deflate calls it zero) for a symbol that does
not appear in the tree at all.
- The way that deflate sends the lengths involves run-length encoding
and a third Huffman tree, but I won't go into it now.
The finding of the copies is the most expensive part. The minimum
copy length is 3 bytes, so the 3 bytes starting at each position are
hashed and the result put in a hash table, then you search the hash
chain looking for the longest match. There are various heuristics
for aborting the search early, which the compression level adjusts.
-1 doesn't look very hard at all, while -9 always does exhaustive search.
(For example, when searching at all aggressively, when you've found a
match, you search starting at the next byte position in the input as well.
If that produces a longer match, you emit the skipped byte as a literal
and take the longer match and start looking for an improvement on it.)
Anyway, input is divided into blocks. Each block is has a 2-bit
prefix that encodes one of three possibilities:
- Send uncompressed
- Send compressed, with fixed (specified in the standard) literal/length &
distance Huffman code. This "static Huffman tree" case is useful for
short blocks where the overhead of encoding the Huffman tree is
significant.
- Send compressed, using dynamic (encoded in the file) Huffman codes.
This is the most commonly used case for large, compressible files.
Note that "copy" intructions can span blocks; the block boundaries are
only opportunities to change the way that the copy instructions are
coded if the file changes character (like text vs. data parts of an
executable). Deciding when to end one block and start a new one is
another set of heuristics.
Only thing like a "dictionary" is the codes used for Huffman
compression.
Now, what you *could* do is pre-load the "copy from this text" buffer
(hash table) with up to 32K of example data, before you start encoding
data for real. Basically, run some example text through the compresser
to warm it up, but throw away the compressed form.
This isn't a "dictionary" in the LZ78 sense (to start with, there are
no boundaries between entries), but serves the same purpose. Note that
since the distance to copy from is limited to 32768, there's no point
to having more warm-up text than that.
next reply other threads:[~2006-08-15 8:33 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-15 8:33 linux [this message]
2006-08-15 13:29 ` Compression and dictionaries Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16 0:37 ` linux
[not found] ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16 5:50 ` Jon Smirl
2006-08-16 6:33 ` Johannes Schindelin
2006-08-16 6:55 ` Shawn Pearce
2006-08-16 7:09 ` Johannes Schindelin
2006-08-16 14:43 ` Jon Smirl
2006-08-17 22:33 ` linux
-- strict thread matches above, loose matches on Subject: below --
2006-08-14 3:37 Jon Smirl
2006-08-14 3:56 ` Shawn Pearce
2006-08-14 4:07 ` Jon Smirl
2006-08-14 4:17 ` Shawn Pearce
2006-08-14 7:48 ` Alex Riesen
2006-08-14 10:06 ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08 ` Jon Smirl
2006-08-14 14:45 ` Johannes Schindelin
2006-08-14 16:15 ` Jon Smirl
2006-08-14 16:32 ` David Lang
2006-08-14 16:55 ` Jakub Narebski
2006-08-14 17:15 ` Jeff Garzik
2006-08-14 17:34 ` David Lang
2006-08-14 17:50 ` Jeff Garzik
2006-08-14 18:48 ` Jon Smirl
2006-08-14 19:08 ` David Lang
2006-08-14 19:38 ` Johannes Schindelin
2006-08-14 15:14 ` Alex Riesen
2006-08-14 15:26 ` Johannes Schindelin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060815083303.13253.qmail@science.horizon.com \
--to=linux@horizon.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).