git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Compression and dictionaries
@ 2006-08-15  8:33 linux
  2006-08-15 13:29 ` Jon Smirl
  2006-08-15 14:55 ` Jon Smirl
  0 siblings, 2 replies; 30+ messages in thread
From: linux @ 2006-08-15  8:33 UTC (permalink / raw)
  To: git

Just to inform this discussion, a brief description of the zlib
"deflate" algorithm.  A full description is in RFC1951.

This is LZ77 compression, meaning that what is actually encoded is
a series of instruction to either
- Insert a specified byte, or
- Copy a run of n bytes, starting m bytes ago.

I.e. roughly

union {
	struct { bool is_literal, char literal};
	struct { bool is_literal; uint8_t length; uint16_t distance; } copy;
};

Note that 100 repetitions of a character can be unambiguously encoded
as the character, followed by "copy 99 chars starting at offset -1".

There are also 257 possible literals, the 256 bytes plus a special EOF token.

The valid copy lengths are 3..258, and the valid distances are 1..32768.

Anyway, the deflate algorithm operates in a couple of steps:

- Translate the input into an (uncompressed) series of literals and
  copies.
- Take the literals and the lengths and put them into one big Huffman tree.
- Take the distances (actually, the simplified distances; large
  distances are grouped into ranges, which are followed by a binary suffix)
  and put them in a separate Huffman tree.
- A Huffman tree can be transmitted by encoding the number of bits assigned
  to each symbol, with a apecial value (infinity is mathematically
  the most logical but deflate calls it zero) for a symbol that does
  not appear in the tree at all.
- The way that deflate sends the lengths involves run-length encoding
  and a third Huffman tree, but I won't go into it now.

The finding of the copies is the most expensive part.  The minimum
copy length is 3 bytes, so the 3 bytes starting at each position are
hashed and the result put in a hash table, then you search the hash
chain looking for the longest match.  There are various heuristics
for aborting the search early, which the compression level adjusts.
-1 doesn't look very hard at all, while -9 always does exhaustive search.

(For example, when searching at all aggressively, when you've found a
match, you search starting at the next byte position in the input as well.
If that produces a longer match, you emit the skipped byte as a literal
and take the longer match and start looking for an improvement on it.)


Anyway, input is divided into blocks.  Each block is has a 2-bit
prefix that encodes one of three possibilities:
- Send uncompressed
- Send compressed, with fixed (specified in the standard) literal/length &
  distance Huffman code.  This "static Huffman tree" case is useful for
  short blocks where the overhead of encoding the Huffman tree is
  significant.
- Send compressed, using dynamic (encoded in the file) Huffman codes.
  This is the most commonly used case for large, compressible files.

Note that "copy" intructions can span blocks; the block boundaries are
only opportunities to change the way that the copy instructions are
coded if the file changes character (like text vs. data parts of an
executable).  Deciding when to end one block and start a new one is
another set of heuristics.


Only thing like a "dictionary" is the codes used for Huffman
compression.

Now, what you *could* do is pre-load the "copy from this text" buffer
(hash table) with up to 32K of example data, before you start encoding
data for real.   Basically, run some example text through the compresser
to warm it up, but throw away the compressed form.

This isn't a "dictionary" in the LZ78 sense (to start with, there are
no boundaries between entries), but serves the same purpose.  Note that
since the distance to copy from is limited to 32768, there's no point
to having more warm-up text than that.

^ permalink raw reply	[flat|nested] 30+ messages in thread
* Compression and dictionaries
@ 2006-08-14  3:37 Jon Smirl
  2006-08-14  3:56 ` Shawn Pearce
  2006-08-14 12:33 ` Johannes Schindelin
  0 siblings, 2 replies; 30+ messages in thread
From: Jon Smirl @ 2006-08-14  3:37 UTC (permalink / raw)
  To: git

>From what I remember from long ago most compression schemes build
dictionaries as a way of achieving significant compression. If so,
since we zlib compress each entry in a pack individually, are there
many copies of very similar dictionaries in the pack?

Some compression schemes support being initialized with a fixed
dictionary and sharing it over all entries. A fixed dictionary could
be built by analysing a large pack file. Sharing a compression
dictionary would probably be a win for me since I have 1M+ entries in
a pack.

Poking around in the zlib it appears that zlib supports precomputed
dictionaries.
http://www.zlib.net/manual.html

  int deflateSetDictionary (z_streamp strm, const Bytef *dictionary,
uInt dictLength);

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2006-08-17 22:33 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-15  8:33 Compression and dictionaries linux
2006-08-15 13:29 ` Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16  0:37   ` linux
     [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16  5:50       ` Jon Smirl
2006-08-16  6:33         ` Johannes Schindelin
2006-08-16  6:55           ` Shawn Pearce
2006-08-16  7:09             ` Johannes Schindelin
2006-08-16 14:43               ` Jon Smirl
2006-08-17 22:33           ` linux
  -- strict thread matches above, loose matches on Subject: below --
2006-08-14  3:37 Jon Smirl
2006-08-14  3:56 ` Shawn Pearce
2006-08-14  4:07   ` Jon Smirl
2006-08-14  4:17     ` Shawn Pearce
2006-08-14  7:48       ` Alex Riesen
2006-08-14 10:06     ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08   ` Jon Smirl
2006-08-14 14:45     ` Johannes Schindelin
2006-08-14 16:15       ` Jon Smirl
2006-08-14 16:32         ` David Lang
2006-08-14 16:55           ` Jakub Narebski
2006-08-14 17:15             ` Jeff Garzik
2006-08-14 17:34               ` David Lang
2006-08-14 17:50                 ` Jeff Garzik
2006-08-14 18:48           ` Jon Smirl
2006-08-14 19:08             ` David Lang
2006-08-14 19:38               ` Johannes Schindelin
2006-08-14 15:14     ` Alex Riesen
2006-08-14 15:26       ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).