Re: Compression and dictionaries

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Compression and dictionaries
@ 2006-08-15  8:33 linux
  2006-08-15 13:29 ` Jon Smirl
  2006-08-15 14:55 ` Jon Smirl
  0 siblings, 2 replies; 30+ messages in thread
From: linux @ 2006-08-15  8:33 UTC (permalink / raw)
  To: git

Just to inform this discussion, a brief description of the zlib
"deflate" algorithm.  A full description is in RFC1951.

This is LZ77 compression, meaning that what is actually encoded is
a series of instruction to either
- Insert a specified byte, or
- Copy a run of n bytes, starting m bytes ago.

I.e. roughly

union {
	struct { bool is_literal, char literal};
	struct { bool is_literal; uint8_t length; uint16_t distance; } copy;
};

Note that 100 repetitions of a character can be unambiguously encoded
as the character, followed by "copy 99 chars starting at offset -1".

There are also 257 possible literals, the 256 bytes plus a special EOF token.

The valid copy lengths are 3..258, and the valid distances are 1..32768.

Anyway, the deflate algorithm operates in a couple of steps:

- Translate the input into an (uncompressed) series of literals and
  copies.
- Take the literals and the lengths and put them into one big Huffman tree.
- Take the distances (actually, the simplified distances; large
  distances are grouped into ranges, which are followed by a binary suffix)
  and put them in a separate Huffman tree.
- A Huffman tree can be transmitted by encoding the number of bits assigned
  to each symbol, with a apecial value (infinity is mathematically
  the most logical but deflate calls it zero) for a symbol that does
  not appear in the tree at all.
- The way that deflate sends the lengths involves run-length encoding
  and a third Huffman tree, but I won't go into it now.

The finding of the copies is the most expensive part.  The minimum
copy length is 3 bytes, so the 3 bytes starting at each position are
hashed and the result put in a hash table, then you search the hash
chain looking for the longest match.  There are various heuristics
for aborting the search early, which the compression level adjusts.
-1 doesn't look very hard at all, while -9 always does exhaustive search.

(For example, when searching at all aggressively, when you've found a
match, you search starting at the next byte position in the input as well.
If that produces a longer match, you emit the skipped byte as a literal
and take the longer match and start looking for an improvement on it.)

Anyway, input is divided into blocks.  Each block is has a 2-bit
prefix that encodes one of three possibilities:
- Send uncompressed
- Send compressed, with fixed (specified in the standard) literal/length &
  distance Huffman code.  This "static Huffman tree" case is useful for
  short blocks where the overhead of encoding the Huffman tree is
  significant.
- Send compressed, using dynamic (encoded in the file) Huffman codes.
  This is the most commonly used case for large, compressible files.

Note that "copy" intructions can span blocks; the block boundaries are
only opportunities to change the way that the copy instructions are
coded if the file changes character (like text vs. data parts of an
executable).  Deciding when to end one block and start a new one is
another set of heuristics.

Only thing like a "dictionary" is the codes used for Huffman
compression.

Now, what you *could* do is pre-load the "copy from this text" buffer
(hash table) with up to 32K of example data, before you start encoding
data for real.   Basically, run some example text through the compresser
to warm it up, but throw away the compressed form.

This isn't a "dictionary" in the LZ78 sense (to start with, there are
no boundaries between entries), but serves the same purpose.  Note that
since the distance to copy from is limited to 32768, there's no point
to having more warm-up text than that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-15  8:33 Compression and dictionaries linux
@ 2006-08-15 13:29 ` Jon Smirl
  2006-08-15 14:55 ` Jon Smirl
  1 sibling, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2006-08-15 13:29 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git

On 15 Aug 2006 04:33:03 -0400, linux@horizon.com <linux@horizon.com> wrote:
> Just to inform this discussion, a brief description of the zlib
> "deflate" algorithm.  A full description is in RFC1951.
>
> This is LZ77 compression, meaning that what is actually encoded is
> a series of instruction to either
> - Insert a specified byte, or
> - Copy a run of n bytes, starting m bytes ago.

Shawn and I did some testing overnight. A simple 4KB preload table
yielded a 4% size reduction in my 850MB pack. I'm still hoping for 10%
using an optimal 32K table.

Paper on computing the optimal preload table:
http://www.eecs.harvard.edu/~michaelm/postscripts/dcc2001a.pdf

If anyone is interested in doing some research, it would be
interesting to build a pack engine based on full-text search
technology. For example cLucene,
http://clucene.sourceforge.net/index.php/Main_Page The full-text
search app contains the code for extracting the dictionary and then
encoding all of the blobs using it.

My prediction is that a large global dictionary based scheme will
compress significantly better than zlib is capable of.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-15  8:33 Compression and dictionaries linux
  2006-08-15 13:29 ` Jon Smirl
@ 2006-08-15 14:55 ` Jon Smirl
  2006-08-16  0:37   ` linux
  1 sibling, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2006-08-15 14:55 UTC (permalink / raw)
  To: linux@horizon.com; +Cc: git

I explained our situation to Mark Adler, zlib author, and he recommend
checking out this book for techniques that can be used.

http://www.cs.mu.oz.au/mg/

-----
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-15 14:55 ` Jon Smirl
@ 2006-08-16  0:37   ` linux
       [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
  0 siblings, 1 reply; 30+ messages in thread
From: linux @ 2006-08-16  0:37 UTC (permalink / raw)
  To: jonsmirl; +Cc: git, linux

> I explained our situation to Mark Adler, zlib author, and he recommend
> checking out this book for techniques that can be used.
> 
> http://www.cs.mu.oz.au/mg/

I was about to suggest the same thing!  It's an excellent book.

It's about a piece of software and the choices made there, but it
explains in detail many alternatives and why they weren't chosen for
the mg software, but what they might be useful for.

The mg software itself is designed for human-language text, and does
indexing as well.  So it does a fixed breakdown into alternate word and
non-word tokens, builds a lexicon with frequency tables, then uses the
frequencies to build Huffman trees, and Huffman-compresses each token.
The "word" dictionary is also used as part of the search index, as each
word has a (very cleverly compressed to less than one byte on average)
pointer to each document it appears in with a count of the number of
appearances.

The word encoding is straight zero-order Huffman, so inter-word
correlations are not used.

For software, with lots of punctuation and multi-word idioms
("for (i = 0; i < n; i++) {"), the basic design is not very well suited.
(To say nothing of binary data, like some people who want to
check images or multi-megabyte video files into git.)

But there is a good description of many possible algorithms, with
lots of emphasis on practicalities.

And the software *does* support dynamic collections, via auxiliary indexes
and escape codes for any new words not found in the main dictionary.
In fact, generating the Huffman tree from as little as 1/8 of the material
to be compressed only loses you 5.7% of the compression.  (Table 9011,
p. 360.)

However, that is for English text, with a pretty Zipf-like distribution.
In code, we generate new words (new function names) frequently, and
proceed to use them heavily.

It's worth noting the similarity between generating a good base dictionary
with finding a good base version of a file for delta-encoding.
You may end up having to divide the documents into classes (different
source languages for example - C vs. asm vs. perl vs. python vs.
docs vs. GIFs), and use a different dictionary per class.  But that
drives up the cost of compression (you have to see which class the
file to be compressed falls into).

Anyway, highly recommended.

^ permalink raw reply	[flat|nested] 30+ messages in thread

[parent not found: <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>]

* Re: Compression and dictionaries
       [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
@ 2006-08-16  5:50       ` Jon Smirl
  2006-08-16  6:33         ` Johannes Schindelin
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2006-08-16  5:50 UTC (permalink / raw)
  To: John Rigby; +Cc: linux@horizon.com, git

On 8/16/06, John Rigby <jcrigby@gmail.com> wrote:
> Sorry if this is off topic, but could the dictionary be used to make
> git-grep alot faster?

It would be almost instant.

Inverted full-text indices are what Google uses to grep 20B web pages in 1 sec.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-16  5:50       ` Jon Smirl
@ 2006-08-16  6:33         ` Johannes Schindelin
  2006-08-16  6:55           ` Shawn Pearce
  2006-08-17 22:33           ` linux
  0 siblings, 2 replies; 30+ messages in thread
From: Johannes Schindelin @ 2006-08-16  6:33 UTC (permalink / raw)
  To: Jon Smirl; +Cc: John Rigby, linux@horizon.com, git

Hi,

On Wed, 16 Aug 2006, Jon Smirl wrote:

> On 8/16/06, John Rigby <jcrigby@gmail.com> wrote:
> > Sorry if this is off topic, but could the dictionary be used to make
> > git-grep alot faster?
> 
> It would be almost instant.

But only if you are not using a regular expression, but a single word.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-16  6:33         ` Johannes Schindelin
@ 2006-08-16  6:55           ` Shawn Pearce
  2006-08-16  7:09             ` Johannes Schindelin
  2006-08-17 22:33           ` linux
  1 sibling, 1 reply; 30+ messages in thread
From: Shawn Pearce @ 2006-08-16  6:55 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jon Smirl, John Rigby, linux@horizon.com, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
> 
> On Wed, 16 Aug 2006, Jon Smirl wrote:
> 
> > On 8/16/06, John Rigby <jcrigby@gmail.com> wrote:
> > > Sorry if this is off topic, but could the dictionary be used to make
> > > git-grep alot faster?
> > 
> > It would be almost instant.
> 
> But only if you are not using a regular expression, but a single word.

Yes and no.  If the inverted index contains terms broken by some
known pattern (e.g. break on word-type boundaries) and the regex
in question has constant sections (it should, otherwise it might
as well just be '.') then you can reduce your search space to a
fraction of the overall data by looking at the inverted index to
select likely terms, select the related revisions containing those
possible terms, then run the regex only on those revisions.

Sure you would be possibly pulling out a number of false positives
but if the constant sequence(s) in the regex reduce your search
space to below 1/2 of the overall data that's probably a lot less
I/O and CPU required to complete the query, even if you have to
read the entire dictionary and apply each term in the dictionary
to the regex to look for those possible matches.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-16  6:55           ` Shawn Pearce
@ 2006-08-16  7:09             ` Johannes Schindelin
  2006-08-16 14:43               ` Jon Smirl
  0 siblings, 1 reply; 30+ messages in thread
From: Johannes Schindelin @ 2006-08-16  7:09 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, John Rigby, linux@horizon.com, git

Hi,

On Wed, 16 Aug 2006, Shawn Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > Hi,
> > 
> > On Wed, 16 Aug 2006, Jon Smirl wrote:
> > 
> > > On 8/16/06, John Rigby <jcrigby@gmail.com> wrote:
> > > > Sorry if this is off topic, but could the dictionary be used to make
> > > > git-grep alot faster?
> > > 
> > > It would be almost instant.
> > 
> > But only if you are not using a regular expression, but a single word.
> 
> Yes and no.  If the inverted index contains terms broken by some
> known pattern (e.g. break on word-type boundaries) and the regex
> in question has constant sections (it should, otherwise it might
> as well just be '.') then you can reduce your search space to a
> fraction of the overall data by looking at the inverted index to
> select likely terms, select the related revisions containing those
> possible terms, then run the regex only on those revisions.
> 
> Sure you would be possibly pulling out a number of false positives
> but if the constant sequence(s) in the regex reduce your search
> space to below 1/2 of the overall data that's probably a lot less
> I/O and CPU required to complete the query, even if you have to
> read the entire dictionary and apply each term in the dictionary
> to the regex to look for those possible matches.

So it would speed up the search, but no, in case of regular expressions, 
particularly any interesting one, the result would not be instantaneous.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-16  7:09             ` Johannes Schindelin
@ 2006-08-16 14:43               ` Jon Smirl
  0 siblings, 0 replies; 30+ messages in thread
From: Jon Smirl @ 2006-08-16 14:43 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Shawn Pearce, John Rigby, linux@horizon.com, git

On 8/16/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Wed, 16 Aug 2006, Shawn Pearce wrote:
>
> > Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> > > Hi,
> > >
> > > On Wed, 16 Aug 2006, Jon Smirl wrote:
> > >
> > > > On 8/16/06, John Rigby <jcrigby@gmail.com> wrote:
> > > > > Sorry if this is off topic, but could the dictionary be used to make
> > > > > git-grep alot faster?
> > > >
> > > > It would be almost instant.
> > >
> > > But only if you are not using a regular expression, but a single word.
> >
> > Yes and no.  If the inverted index contains terms broken by some
> > known pattern (e.g. break on word-type boundaries) and the regex
> > in question has constant sections (it should, otherwise it might
> > as well just be '.') then you can reduce your search space to a
> > fraction of the overall data by looking at the inverted index to
> > select likely terms, select the related revisions containing those
> > possible terms, then run the regex only on those revisions.
> >
> > Sure you would be possibly pulling out a number of false positives
> > but if the constant sequence(s) in the regex reduce your search
> > space to below 1/2 of the overall data that's probably a lot less
> > I/O and CPU required to complete the query, even if you have to
> > read the entire dictionary and apply each term in the dictionary
> > to the regex to look for those possible matches.
>
> So it would speed up the search, but no, in case of regular expressions,
> particularly any interesting one, the result would not be instantaneous.

Instant is a relative term. Google is instant compared to running grep
over 10TB of data. How long would that take, a month?

Shawn is correct, the inverted indexes are used to eliminate as many
files as possible. So the response time is a more of a function of how
many hits you have instead of how big the data set is. Of course if
you give it a pattern that matches everything it will just as slow as
grep. Give it a pattern that is only in one file and detectable by the
index and it will be very fast. If you are going to give it a bunch of
patterns that aren't in the index, then we need to adjust how the
index is built.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-16  6:33         ` Johannes Schindelin
  2006-08-16  6:55           ` Shawn Pearce
@ 2006-08-17 22:33           ` linux
  1 sibling, 0 replies; 30+ messages in thread
From: linux @ 2006-08-17 22:33 UTC (permalink / raw)
  To: Johannes.Schindelin, jonsmirl; +Cc: git, jcrigby, linux

Just a note: there are index structures that support regular expression
searching.  In particular, a PAT tree, usually represented implicitly as
a PAT array, can be walked by a finite automaton to find all the places
it matches.

However, there's a lot of code complexity associated with that.  And a
PAT array assumes efficient random access to the text being indexed,
as it does not keep a copy of the text.

Perhaps most importantly, this would be a big change to "git grep", as
it would search every object in the database, not a particular commit.
And mapping objects back to filenames in trees and commits requires
another index.

Compression dictionaries and indexes have some opposing points.  In a
compression dictionary, you prefer common words that appear in many places.
For an index, you prefer rare words that identify a small set of files.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Compression and dictionaries
@ 2006-08-14  3:37 Jon Smirl
  2006-08-14  3:56 ` Shawn Pearce
  2006-08-14 12:33 ` Johannes Schindelin
  0 siblings, 2 replies; 30+ messages in thread
From: Jon Smirl @ 2006-08-14  3:37 UTC (permalink / raw)
  To: git

>From what I remember from long ago most compression schemes build
dictionaries as a way of achieving significant compression. If so,
since we zlib compress each entry in a pack individually, are there
many copies of very similar dictionaries in the pack?

Some compression schemes support being initialized with a fixed
dictionary and sharing it over all entries. A fixed dictionary could
be built by analysing a large pack file. Sharing a compression
dictionary would probably be a win for me since I have 1M+ entries in
a pack.

Poking around in the zlib it appears that zlib supports precomputed
dictionaries.
http://www.zlib.net/manual.html

  int deflateSetDictionary (z_streamp strm, const Bytef *dictionary,
uInt dictLength);

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14  3:37 Jon Smirl
@ 2006-08-14  3:56 ` Shawn Pearce
  2006-08-14  4:07   ` Jon Smirl
  2006-08-14 12:33 ` Johannes Schindelin
  1 sibling, 1 reply; 30+ messages in thread
From: Shawn Pearce @ 2006-08-14  3:56 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> From what I remember from long ago most compression schemes build
> dictionaries as a way of achieving significant compression. If so,
> since we zlib compress each entry in a pack individually, are there
> many copies of very similar dictionaries in the pack?

Yes, possibly.  Every object in the pack has its own dictionary.

But I'm not sure if there would be any savings from sharing
dictionaries.  One problem is you probably don't want a single
massive dictionary for the entire pack as it could be very large,
plus updating it with additions would likely require recompressing
every entry.  Typically once an entry in the pack has been compressed
GIT won't recompress it.

However whenever possible deltas get used between objects.
This allows an object to copy content from another object, with
copy commands typically taking just a couple of bytes to copy a
whole range of bytes from the other object.  This works pretty
well when the current revision of a file is stored with just zlib
compression and older revisions copy their content from the current
revision using the delta format.

I should note that delta compression works on trees, commits and
tags too, however it gets the most benefit out of trees when only
a fraction of the files in the tree are modified.  Commits and tags
are harder to delta as they tend to be mostly different.

My fast-import computes deltas in the order you are feeding
it objects, so each blob is deltafied against the prior object.
Since you are feeding them in reverse RCS order (newest to oldest)
you are probably getting a reasonably good delta compression.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14  3:56 ` Shawn Pearce
@ 2006-08-14  4:07   ` Jon Smirl
  2006-08-14  4:17     ` Shawn Pearce
  2006-08-14 10:06     ` Erik Mouw
  0 siblings, 2 replies; 30+ messages in thread
From: Jon Smirl @ 2006-08-14  4:07 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git

On 8/13/06, Shawn Pearce <spearce@spearce.org> wrote:
> Jon Smirl <jonsmirl@gmail.com> wrote:
> > From what I remember from long ago most compression schemes build
> > dictionaries as a way of achieving significant compression. If so,
> > since we zlib compress each entry in a pack individually, are there
> > many copies of very similar dictionaries in the pack?
>
> Yes, possibly.  Every object in the pack has its own dictionary.
>
> But I'm not sure if there would be any savings from sharing
> dictionaries.  One problem is you probably don't want a single
> massive dictionary for the entire pack as it could be very large,
> plus updating it with additions would likely require recompressing
> every entry.  Typically once an entry in the pack has been compressed
> GIT won't recompress it.

The zlib doc says to put your most common strings into the fixed
dictionary. If a string isn't in the fixed dictionary it will get
handled with an internal dictionary entry.  By default zlib runs with
an empty fixed dictionary and handles everything with the internal
dictionary.

Since we are encoding C many strings will always be present (if,
static, define, const, char, include, int, void, while, continue,
etc).  Do you have any tools to identify the top 500 strings in C
code? The fixed dictionary would get hardcoded into the git apps.

A fixed dictionary could conceivably take 5-10% off the size of each entry.

> However whenever possible deltas get used between objects.
> This allows an object to copy content from another object, with
> copy commands typically taking just a couple of bytes to copy a
> whole range of bytes from the other object.  This works pretty
> well when the current revision of a file is stored with just zlib
> compression and older revisions copy their content from the current
> revision using the delta format.
>
> I should note that delta compression works on trees, commits and
> tags too, however it gets the most benefit out of trees when only
> a fraction of the files in the tree are modified.  Commits and tags
> are harder to delta as they tend to be mostly different.
>
> My fast-import computes deltas in the order you are feeding
> it objects, so each blob is deltafied against the prior object.
> Since you are feeding them in reverse RCS order (newest to oldest)
> you are probably getting a reasonably good delta compression.
>
> --
> Shawn.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14  4:07   ` Jon Smirl
@ 2006-08-14  4:17     ` Shawn Pearce
  2006-08-14  7:48       ` Alex Riesen
  2006-08-14 10:06     ` Erik Mouw
  1 sibling, 1 reply; 30+ messages in thread
From: Shawn Pearce @ 2006-08-14  4:17 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Jon Smirl <jonsmirl@gmail.com> wrote:
> The zlib doc says to put your most common strings into the fixed
> dictionary. If a string isn't in the fixed dictionary it will get
> handled with an internal dictionary entry.  By default zlib runs with
> an empty fixed dictionary and handles everything with the internal
> dictionary.
 
> Since we are encoding C many strings will always be present (if,
> static, define, const, char, include, int, void, while, continue,
> etc).  Do you have any tools to identify the top 500 strings in C
> code? The fixed dictionary would get hardcoded into the git apps.

Actually GIT itself may also benefit from other strings beyond
those common found in C-like languages:

	'10644 '
	'40000 '
	'parent '
	'tree '
	'author '
	'committer '

as these occur frequently in trees and commits.
 
> A fixed dictionary could conceivably take 5-10% off the size of each entry.

Could be an interesting experiment to see if that's really true
for common loads (e.g. the kernel repo).  I don't think anyone has
tried it.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14  4:17     ` Shawn Pearce
@ 2006-08-14  7:48       ` Alex Riesen
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Riesen @ 2006-08-14  7:48 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Jon Smirl, git

On 8/14/06, Shawn Pearce <spearce@spearce.org> wrote:
> > A fixed dictionary could conceivably take 5-10% off the size of each entry.
>
> Could be an interesting experiment to see if that's really true
> for common loads (e.g. the kernel repo).  I don't think anyone has
> tried it.

BTW, kenel repo rapidly becomes "less than common load" for git ;)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14  4:07   ` Jon Smirl
  2006-08-14  4:17     ` Shawn Pearce
@ 2006-08-14 10:06     ` Erik Mouw
  1 sibling, 0 replies; 30+ messages in thread
From: Erik Mouw @ 2006-08-14 10:06 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Shawn Pearce, git

On Mon, Aug 14, 2006 at 12:07:45AM -0400, Jon Smirl wrote:
> Since we are encoding C many strings will always be present (if,
> static, define, const, char, include, int, void, while, continue,
> etc).  Do you have any tools to identify the top 500 strings in C
> code? The fixed dictionary would get hardcoded into the git apps.

We are not only encoding C anymore. Git might have started as a tool to
maintain the linux kernel tree, but its use got beyond that.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14  3:37 Jon Smirl
  2006-08-14  3:56 ` Shawn Pearce
@ 2006-08-14 12:33 ` Johannes Schindelin
  2006-08-14 14:08   ` Jon Smirl
  1 sibling, 1 reply; 30+ messages in thread
From: Johannes Schindelin @ 2006-08-14 12:33 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Hi,

On Sun, 13 Aug 2006, Jon Smirl wrote:

> > From what I remember from long ago most compression schemes build
> dictionaries as a way of achieving significant compression. If so,
> since we zlib compress each entry in a pack individually, are there
> many copies of very similar dictionaries in the pack?

No, there are no dictionaries in the pack. At least no explicit ones.

The trick is that the dictionary builds up with the data incrementally: 
both the encoding process and the decoding process construct it 
identically, on the fly.

Example: A very primitive example of compression is the arithmetic coder 
of single characters. Given a probability distribution of the characters, 
each character is encoded such that the length of the encoding of one 
character is reciprocal to its probability *Footnote 1*. If you want to 
compress context-free, i.e. without looking at the characters before or 
after the current character when encoding or decoding, this is provably 
optimal.

Now, if you do not have a probability distribution (because you haven't 
looked at the file yet), you can build an histogram as approximation on 
the fly. Whenever a character is encoded, you take the current histogram 
to approximate the probability distribution, and adjust the histogram for 
that character.

For zlib, it is a little more involved (it is not a simple histogram), but 
the principle holds: if you do not have a dictionary, you just build one 
from the data seen so far.

BTW I doubt that an explicit dictionary would be good: Either you 
distribute it with git, and have many people complaining that it is either 
too small, or too big, and in most cases does not fit their data well, or 
you have to create it for each local repository, which takes time.

Further, if the pack-file becomes corrupt, you usually still have the 
pack index, or the start of the pack-file, and can reconstruct most of the 
objects. If you use a dictionary, and just one bit flips in it, you're 
screwed.

Ciao,
Dscho

Footnote 1: If the probabilities are all powers of two (with negative 
exponent), the encodings can be fixed bit strings (because the lengths are 
integer numbers); if that is not the case, it gets a little more 
complicated; basically, you store a fractional bit as a state; that is why 
it is called arithmetic coding.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 12:33 ` Johannes Schindelin
@ 2006-08-14 14:08   ` Jon Smirl
  2006-08-14 14:45     ` Johannes Schindelin
  2006-08-14 15:14     ` Alex Riesen
  0 siblings, 2 replies; 30+ messages in thread
From: Jon Smirl @ 2006-08-14 14:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 13 Aug 2006, Jon Smirl wrote:
>
> > > From what I remember from long ago most compression schemes build
> > dictionaries as a way of achieving significant compression. If so,
> > since we zlib compress each entry in a pack individually, are there
> > many copies of very similar dictionaries in the pack?
>
> No, there are no dictionaries in the pack. At least no explicit ones.
>
> The trick is that the dictionary builds up with the data incrementally:
> both the encoding process and the decoding process construct it
> identically, on the fly.
>
> Example: A very primitive example of compression is the arithmetic coder
> of single characters. Given a probability distribution of the characters,
> each character is encoded such that the length of the encoding of one
> character is reciprocal to its probability *Footnote 1*. If you want to
> compress context-free, i.e. without looking at the characters before or
> after the current character when encoding or decoding, this is provably
> optimal.
>
> Now, if you do not have a probability distribution (because you haven't
> looked at the file yet), you can build an histogram as approximation on
> the fly. Whenever a character is encoded, you take the current histogram
> to approximate the probability distribution, and adjust the histogram for
> that character.
>
> For zlib, it is a little more involved (it is not a simple histogram), but
> the principle holds: if you do not have a dictionary, you just build one
> from the data seen so far.

Does a zlib dictionary just changes the probabilities in the histogram
or does it turn the dictionary into a pre-loaded encoding tree?

The other compression schemes I looked at let you load in a
precomputed huffman/arithmetic encoding tree. By preloading an
encoding tree you avoid storing the encoding of "void => 010101' in
every  item. Removing 1M encoding maps and using one common one should
be a win. Items not in the map would still be stored using internal
additions to the map.

Changing the probabilities probably won't help much, but there may be
good gains from partially eliminating 1M encoding maps.

>
> BTW I doubt that an explicit dictionary would be good: Either you
> distribute it with git, and have many people complaining that it is either
> too small, or too big, and in most cases does not fit their data well, or
> you have to create it for each local repository, which takes time.
>
> Further, if the pack-file becomes corrupt, you usually still have the
> pack index, or the start of the pack-file, and can reconstruct most of the
> objects. If you use a dictionary, and just one bit flips in it, you're
> screwed.
>
> Ciao,
> Dscho
>
> Footnote 1: If the probabilities are all powers of two (with negative
> exponent), the encodings can be fixed bit strings (because the lengths are
> integer numbers); if that is not the case, it gets a little more
> complicated; basically, you store a fractional bit as a state; that is why
> it is called arithmetic coding.
>


-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 14:08   ` Jon Smirl
@ 2006-08-14 14:45     ` Johannes Schindelin
  2006-08-14 16:15       ` Jon Smirl
  2006-08-14 15:14     ` Alex Riesen
  1 sibling, 1 reply; 30+ messages in thread
From: Johannes Schindelin @ 2006-08-14 14:45 UTC (permalink / raw)
  To: Jon Smirl; +Cc: git

Hi,

On Mon, 14 Aug 2006, Jon Smirl wrote:

> Does a zlib dictionary just changes the probabilities in the histogram 
> or does it turn the dictionary into a pre-loaded encoding tree?

I have to admit that I do not know zlib well enough to tell off the top of 
my head, but I guess it would make more sense to have it as a preloaded 
encoding tree.

> The other compression schemes I looked at let you load in a
> precomputed huffman/arithmetic encoding tree. By preloading an
> encoding tree you avoid storing the encoding of "void => 010101' in
> every  item. Removing 1M encoding maps and using one common one should
> be a win. Items not in the map would still be stored using internal
> additions to the map.
> 
> Changing the probabilities probably won't help much, but there may be
> good gains from partially eliminating 1M encoding maps.

I _think_ that it would not matter much. The deltas have a more important 
impact.

> > Further, if the pack-file becomes corrupt, you usually still have the 
> > pack index, or the start of the pack-file, and can reconstruct most of 
> > the objects. If you use a dictionary, and just one bit flips in it, 
> > you're screwed.

I still think that this is important to think through: Is it worth a 
couple of kilobytes (I doubt that it would be as much as 1MB in _total_), 
and be on the unsafe side?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 14:45     ` Johannes Schindelin
@ 2006-08-14 16:15       ` Jon Smirl
  2006-08-14 16:32         ` David Lang
  0 siblings, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2006-08-14 16:15 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> I still think that this is important to think through: Is it worth a
> couple of kilobytes (I doubt that it would be as much as 1MB in _total_),
> and be on the unsafe side?

The maps look something like this:
void => 10101010
char => 111101
int => 1110101
tree => 11010111
commit => 101011001

These maps are repeated in every one of my 1M revisions including the
deltas. I have 1GB pack files with 1M entries in them - 1K each entry.
Each byte saved out of a zlib entry take 1MB off my pack.

Note that the current internal maps aren't the same in each of each of
the zlib blobs since the algorithm that builds the internal maps
depends on the order the identifiers were encountered.

If the git tools add a global dictionary the tools would still be able
to read existing packs. If old tools try to read a new dictionary
based pack they will get the zlib NEED_DICT error.

If the entire file was one big zlib blob there would only be one
dictionary and adding a fixed dictionary wouldn't make any difference.
But since it is 1M little zlib blobs it is has 1M dictionaries.

The only "unsafe" aspect I see to this is if the global dictionary
doesn't contain any of the words in the documents being encoded. In
that case the global dictionary will occupy the short huffman keys
forcing longer internal keys.  The keys for the words in the document
would be longer by a about a bit on average.

A solution for making this work over time would be to store the global
dictionary at the front of the pack file and for the unpack tools to
use the stored copy. This would let us change the global dictionary in
the pack tool with no downside, you could even support multiple
dictionaries in the pack tool.

If someone wants to get fancy you could write a tool that would scan a
pack file and compute an optimal fixed dictionary. Store it at the
front of the pack file and repack using it.

Global dictionaries are common in full text searching. I seem to
recall an article stating that Google's global dictionary has about
250K entries in it. If git packs switch to a global dictionary model
it's not a big leap to add a full text search index. You just need
objects for each word in the dictionary pointing to the revisions that
contain it.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 16:15       ` Jon Smirl
@ 2006-08-14 16:32         ` David Lang
  2006-08-14 16:55           ` Jakub Narebski
  2006-08-14 18:48           ` Jon Smirl
  0 siblings, 2 replies; 30+ messages in thread
From: David Lang @ 2006-08-14 16:32 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, git

On Mon, 14 Aug 2006, Jon Smirl wrote:

> On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> I still think that this is important to think through: Is it worth a
>> couple of kilobytes (I doubt that it would be as much as 1MB in _total_),
>> and be on the unsafe side?
>
> The only "unsafe" aspect I see to this is if the global dictionary
> doesn't contain any of the words in the documents being encoded. In
> that case the global dictionary will occupy the short huffman keys
> forcing longer internal keys.  The keys for the words in the document
> would be longer by a about a bit on average.

the other factor that was mentioned was that a single-bit corruption in the 
dictionary would make the entire pack file useless. if this is really a concern 
then just store multiple copies of the dictionary. on a pack with lots of files 
in it it can still be a significant win.

David Lang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 16:32         ` David Lang
@ 2006-08-14 16:55           ` Jakub Narebski
  2006-08-14 17:15             ` Jeff Garzik
  2006-08-14 18:48           ` Jon Smirl
  1 sibling, 1 reply; 30+ messages in thread
From: Jakub Narebski @ 2006-08-14 16:55 UTC (permalink / raw)
  To: git

David Lang wrote:

> the other factor that was mentioned was that a single-bit corruption in the 
> dictionary would make the entire pack file useless. if this is really a concern 
> then just store multiple copies of the dictionary. on a pack with lots of files 
> in it it can still be a significant win.

Or use some error-correcting code for storing dictionary.

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 16:55           ` Jakub Narebski
@ 2006-08-14 17:15             ` Jeff Garzik
  2006-08-14 17:34               ` David Lang
  0 siblings, 1 reply; 30+ messages in thread
From: Jeff Garzik @ 2006-08-14 17:15 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski wrote:
> David Lang wrote:
> 
>> the other factor that was mentioned was that a single-bit corruption in the 
>> dictionary would make the entire pack file useless. if this is really a concern 
>> then just store multiple copies of the dictionary. on a pack with lots of files 
>> in it it can still be a significant win.
> 
> Or use some error-correcting code for storing dictionary.

Error-correcting code?  We have sha1 hash to determine validity...

	Jeff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 17:15             ` Jeff Garzik
@ 2006-08-14 17:34               ` David Lang
  2006-08-14 17:50                 ` Jeff Garzik
  0 siblings, 1 reply; 30+ messages in thread
From: David Lang @ 2006-08-14 17:34 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Jakub Narebski, git

On Mon, 14 Aug 2006, Jeff Garzik wrote:

> Date: Mon, 14 Aug 2006 13:15:55 -0400
> From: Jeff Garzik <jeff@garzik.org>
> To: Jakub Narebski <jnareb@gmail.com>
> Cc: git@vger.kernel.org
> Subject: Re: Compression and dictionaries
> 
> Jakub Narebski wrote:
>> David Lang wrote:
>> 
>>> the other factor that was mentioned was that a single-bit corruption in 
>>> the dictionary would make the entire pack file useless. if this is really 
>>> a concern then just store multiple copies of the dictionary. on a pack 
>>> with lots of files in it it can still be a significant win.
>> 
>> Or use some error-correcting code for storing dictionary.
>
> Error-correcting code?  We have sha1 hash to determine validity...

that would only tell you that what you have is garbage (and you need to restore 
from backup(, useing a ECC costs some space, but lets you recover from some 
errors without having to resort to backups.

David Lang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 17:34               ` David Lang
@ 2006-08-14 17:50                 ` Jeff Garzik
  0 siblings, 0 replies; 30+ messages in thread
From: Jeff Garzik @ 2006-08-14 17:50 UTC (permalink / raw)
  To: David Lang; +Cc: Jakub Narebski, git

David Lang wrote:
> that would only tell you that what you have is garbage (and you need to 
> restore from backup(, useing a ECC costs some space, but lets you 
> recover from some errors without having to resort to backups.

ECC permits you to recover from very specific, very-limited-damage 
scenarios like bit errors.

On modern hard drives, single-bit data corruption is very very very rare 
(particularly since ECC is already employed on the platter).

	Jeff

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 16:32         ` David Lang
  2006-08-14 16:55           ` Jakub Narebski
@ 2006-08-14 18:48           ` Jon Smirl
  2006-08-14 19:08             ` David Lang
  1 sibling, 1 reply; 30+ messages in thread
From: Jon Smirl @ 2006-08-14 18:48 UTC (permalink / raw)
  To: David Lang; +Cc: Johannes Schindelin, git

On 8/14/06, David Lang <dlang@digitalinsight.com> wrote:
> On Mon, 14 Aug 2006, Jon Smirl wrote:
>
> > On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> >> I still think that this is important to think through: Is it worth a
> >> couple of kilobytes (I doubt that it would be as much as 1MB in _total_),
> >> and be on the unsafe side?
> >
> > The only "unsafe" aspect I see to this is if the global dictionary
> > doesn't contain any of the words in the documents being encoded. In
> > that case the global dictionary will occupy the short huffman keys
> > forcing longer internal keys.  The keys for the words in the document
> > would be longer by a about a bit on average.
>
> the other factor that was mentioned was that a single-bit corruption in the
> dictionary would make the entire pack file useless. if this is really a concern
> then just store multiple copies of the dictionary. on a pack with lots of files
> in it it can still be a significant win.

Bit errors can mess the pack up in lots of ways. If it hits a commit
you won't be able to follow the tree back in time. Packs were never
designed to be error tolerant.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 18:48           ` Jon Smirl
@ 2006-08-14 19:08             ` David Lang
  2006-08-14 19:38               ` Johannes Schindelin
  0 siblings, 1 reply; 30+ messages in thread
From: David Lang @ 2006-08-14 19:08 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, git

On Mon, 14 Aug 2006, Jon Smirl wrote:

> On 8/14/06, David Lang <dlang@digitalinsight.com> wrote:
>> On Mon, 14 Aug 2006, Jon Smirl wrote:
>> 
>> > On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>> >> I still think that this is important to think through: Is it worth a
>> >> couple of kilobytes (I doubt that it would be as much as 1MB in 
>> _total_),
>> >> and be on the unsafe side?
>> >
>> > The only "unsafe" aspect I see to this is if the global dictionary
>> > doesn't contain any of the words in the documents being encoded. In
>> > that case the global dictionary will occupy the short huffman keys
>> > forcing longer internal keys.  The keys for the words in the document
>> > would be longer by a about a bit on average.
>> 
>> the other factor that was mentioned was that a single-bit corruption in the
>> dictionary would make the entire pack file useless. if this is really a 
>> concern
>> then just store multiple copies of the dictionary. on a pack with lots of 
>> files
>> in it it can still be a significant win.
>
> Bit errors can mess the pack up in lots of ways. If it hits a commit
> you won't be able to follow the tree back in time. Packs were never
> designed to be error tolerant.

I'm not claiming that this is a problem, I'm reponding to other people's claim 
that useing a global dictionary for a pack is a problem becouse if something 
happens to that dictionary the whole pack is worthless by pointing out that, if 
this is viewed as a real problem, it's easy to solve.

David Lang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 19:08             ` David Lang
@ 2006-08-14 19:38               ` Johannes Schindelin
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Schindelin @ 2006-08-14 19:38 UTC (permalink / raw)
  To: David Lang; +Cc: Jon Smirl, git

Hi,

On Mon, 14 Aug 2006, David Lang wrote:

> On Mon, 14 Aug 2006, Jon Smirl wrote:
> 
> > Bit errors can mess the pack up in lots of ways. If it hits a commit 
> > you won't be able to follow the tree back in time. Packs were never 
> > designed to be error tolerant.
> 
> I'm not claiming that this is a problem, I'm reponding to other people's 
> claim that useing a global dictionary for a pack is a problem becouse if 
> something happens to that dictionary the whole pack is worthless by 
> pointing out that, if this is viewed as a real problem, it's easy to 
> solve.

Let's not solve problems we do not have. I refuse to think about this 
problem further, before there are actually some hard numbers showing an 
improvement there. If the numbers do not show an improvement, we do not 
have that problem at all!

Make a _global_ dictionary, optimize the heck out of it, _use_ that 
dictionary to repack a sizeable repository (I think linux-2.6.git should 
be enough for first tests), and tell the world about the size of the 
dictionary, the original pack, and the new pack.

You would not need to implement this cleanly, just a hack to prove that 
this idea is worth following up. 

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 14:08   ` Jon Smirl
  2006-08-14 14:45     ` Johannes Schindelin
@ 2006-08-14 15:14     ` Alex Riesen
  2006-08-14 15:26       ` Johannes Schindelin
  1 sibling, 1 reply; 30+ messages in thread
From: Alex Riesen @ 2006-08-14 15:14 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Johannes Schindelin, git

On 8/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> Changing the probabilities probably won't help much, but there may be
> good gains from partially eliminating 1M encoding maps.

will the old git installations, without the maps, still be able to decode the
pack created this way?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Compression and dictionaries
  2006-08-14 15:14     ` Alex Riesen
@ 2006-08-14 15:26       ` Johannes Schindelin
  0 siblings, 0 replies; 30+ messages in thread
From: Johannes Schindelin @ 2006-08-14 15:26 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Jon Smirl, git

Hi,

On Mon, 14 Aug 2006, Alex Riesen wrote:

> On 8/14/06, Jon Smirl <jonsmirl@gmail.com> wrote:
> > Changing the probabilities probably won't help much, but there may be
> > good gains from partially eliminating 1M encoding maps.
> 
> will the old git installations, without the maps, still be able to decode the
> pack created this way?

Probably not. But then, I would _not_ want this in upload-pack, so it is a 
strictly local thing -- either you repacked with a dictionary yourself, or 
there would be no explicit dictionary.

However, as I said, I _think_ this discussion is moot. You'd probably be 
better off making the windows wider, activating stronger compression, 
basically investing more time for the repacker. Plus, this would benefit 
cloned repos as well.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2006-08-17 22:33 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-15  8:33 Compression and dictionaries linux
2006-08-15 13:29 ` Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16  0:37   ` linux
     [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16  5:50       ` Jon Smirl
2006-08-16  6:33         ` Johannes Schindelin
2006-08-16  6:55           ` Shawn Pearce
2006-08-16  7:09             ` Johannes Schindelin
2006-08-16 14:43               ` Jon Smirl
2006-08-17 22:33           ` linux
  -- strict thread matches above, loose matches on Subject: below --
2006-08-14  3:37 Jon Smirl
2006-08-14  3:56 ` Shawn Pearce
2006-08-14  4:07   ` Jon Smirl
2006-08-14  4:17     ` Shawn Pearce
2006-08-14  7:48       ` Alex Riesen
2006-08-14 10:06     ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08   ` Jon Smirl
2006-08-14 14:45     ` Johannes Schindelin
2006-08-14 16:15       ` Jon Smirl
2006-08-14 16:32         ` David Lang
2006-08-14 16:55           ` Jakub Narebski
2006-08-14 17:15             ` Jeff Garzik
2006-08-14 17:34               ` David Lang
2006-08-14 17:50                 ` Jeff Garzik
2006-08-14 18:48           ` Jon Smirl
2006-08-14 19:08             ` David Lang
2006-08-14 19:38               ` Johannes Schindelin
2006-08-14 15:14     ` Alex Riesen
2006-08-14 15:26       ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).