Re: Compression and dictionaries

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: linux@horizon.com
To: git@vger.kernel.org
Subject: Re: Compression and dictionaries
Date: 15 Aug 2006 04:33:03 -0400	[thread overview]
Message-ID: <20060815083303.13253.qmail@science.horizon.com> (raw)

Just to inform this discussion, a brief description of the zlib
"deflate" algorithm.  A full description is in RFC1951.

This is LZ77 compression, meaning that what is actually encoded is
a series of instruction to either
- Insert a specified byte, or
- Copy a run of n bytes, starting m bytes ago.

I.e. roughly

union {
	struct { bool is_literal, char literal};
	struct { bool is_literal; uint8_t length; uint16_t distance; } copy;
};

Note that 100 repetitions of a character can be unambiguously encoded
as the character, followed by "copy 99 chars starting at offset -1".

There are also 257 possible literals, the 256 bytes plus a special EOF token.

The valid copy lengths are 3..258, and the valid distances are 1..32768.

Anyway, the deflate algorithm operates in a couple of steps:

- Translate the input into an (uncompressed) series of literals and
  copies.
- Take the literals and the lengths and put them into one big Huffman tree.
- Take the distances (actually, the simplified distances; large
  distances are grouped into ranges, which are followed by a binary suffix)
  and put them in a separate Huffman tree.
- A Huffman tree can be transmitted by encoding the number of bits assigned
  to each symbol, with a apecial value (infinity is mathematically
  the most logical but deflate calls it zero) for a symbol that does
  not appear in the tree at all.
- The way that deflate sends the lengths involves run-length encoding
  and a third Huffman tree, but I won't go into it now.

The finding of the copies is the most expensive part.  The minimum
copy length is 3 bytes, so the 3 bytes starting at each position are
hashed and the result put in a hash table, then you search the hash
chain looking for the longest match.  There are various heuristics
for aborting the search early, which the compression level adjusts.
-1 doesn't look very hard at all, while -9 always does exhaustive search.

(For example, when searching at all aggressively, when you've found a
match, you search starting at the next byte position in the input as well.
If that produces a longer match, you emit the skipped byte as a literal
and take the longer match and start looking for an improvement on it.)

Anyway, input is divided into blocks.  Each block is has a 2-bit
prefix that encodes one of three possibilities:
- Send uncompressed
- Send compressed, with fixed (specified in the standard) literal/length &
  distance Huffman code.  This "static Huffman tree" case is useful for
  short blocks where the overhead of encoding the Huffman tree is
  significant.
- Send compressed, using dynamic (encoded in the file) Huffman codes.
  This is the most commonly used case for large, compressible files.

Note that "copy" intructions can span blocks; the block boundaries are
only opportunities to change the way that the copy instructions are
coded if the file changes character (like text vs. data parts of an
executable).  Deciding when to end one block and start a new one is
another set of heuristics.

Only thing like a "dictionary" is the codes used for Huffman
compression.

Now, what you *could* do is pre-load the "copy from this text" buffer
(hash table) with up to 32K of example data, before you start encoding
data for real.   Basically, run some example text through the compresser
to warm it up, but throw away the compressed form.

This isn't a "dictionary" in the LZ78 sense (to start with, there are
no boundaries between entries), but serves the same purpose.  Note that
since the distance to copy from is limited to 32768, there's no point
to having more warm-up text than that.

next             reply	other threads:[~2006-08-15  8:33 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-15  8:33 linux [this message]
2006-08-15 13:29 ` Compression and dictionaries Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16  0:37   ` linux
     [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16  5:50       ` Jon Smirl
2006-08-16  6:33         ` Johannes Schindelin
2006-08-16  6:55           ` Shawn Pearce
2006-08-16  7:09             ` Johannes Schindelin
2006-08-16 14:43               ` Jon Smirl
2006-08-17 22:33           ` linux
  -- strict thread matches above, loose matches on Subject: below --
2006-08-14  3:37 Jon Smirl
2006-08-14  3:56 ` Shawn Pearce
2006-08-14  4:07   ` Jon Smirl
2006-08-14  4:17     ` Shawn Pearce
2006-08-14  7:48       ` Alex Riesen
2006-08-14 10:06     ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08   ` Jon Smirl
2006-08-14 14:45     ` Johannes Schindelin
2006-08-14 16:15       ` Jon Smirl
2006-08-14 16:32         ` David Lang
2006-08-14 16:55           ` Jakub Narebski
2006-08-14 17:15             ` Jeff Garzik
2006-08-14 17:34               ` David Lang
2006-08-14 17:50                 ` Jeff Garzik
2006-08-14 18:48           ` Jon Smirl
2006-08-14 19:08             ` David Lang
2006-08-14 19:38               ` Johannes Schindelin
2006-08-14 15:14     ` Alex Riesen
2006-08-14 15:26       ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060815083303.13253.qmail@science.horizon.com \
    --to=linux@horizon.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).