git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
Cc: git <git@vger.kernel.org>
Subject: Re: Compression and dictionaries
Date: Mon, 14 Aug 2006 10:08:33 -0400	[thread overview]
Message-ID: <9e4733910608140708i45e3d6day6b87676783fd6511@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.63.0608141415560.10541@wbgn013.biozentrum.uni-wuerzburg.de>

On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Sun, 13 Aug 2006, Jon Smirl wrote:
>
> > > From what I remember from long ago most compression schemes build
> > dictionaries as a way of achieving significant compression. If so,
> > since we zlib compress each entry in a pack individually, are there
> > many copies of very similar dictionaries in the pack?
>
> No, there are no dictionaries in the pack. At least no explicit ones.
>
> The trick is that the dictionary builds up with the data incrementally:
> both the encoding process and the decoding process construct it
> identically, on the fly.
>
> Example: A very primitive example of compression is the arithmetic coder
> of single characters. Given a probability distribution of the characters,
> each character is encoded such that the length of the encoding of one
> character is reciprocal to its probability *Footnote 1*. If you want to
> compress context-free, i.e. without looking at the characters before or
> after the current character when encoding or decoding, this is provably
> optimal.
>
> Now, if you do not have a probability distribution (because you haven't
> looked at the file yet), you can build an histogram as approximation on
> the fly. Whenever a character is encoded, you take the current histogram
> to approximate the probability distribution, and adjust the histogram for
> that character.
>
> For zlib, it is a little more involved (it is not a simple histogram), but
> the principle holds: if you do not have a dictionary, you just build one
> from the data seen so far.

Does a zlib dictionary just changes the probabilities in the histogram
or does it turn the dictionary into a pre-loaded encoding tree?

The other compression schemes I looked at let you load in a
precomputed huffman/arithmetic encoding tree. By preloading an
encoding tree you avoid storing the encoding of "void => 010101' in
every  item. Removing 1M encoding maps and using one common one should
be a win. Items not in the map would still be stored using internal
additions to the map.

Changing the probabilities probably won't help much, but there may be
good gains from partially eliminating 1M encoding maps.

>
> BTW I doubt that an explicit dictionary would be good: Either you
> distribute it with git, and have many people complaining that it is either
> too small, or too big, and in most cases does not fit their data well, or
> you have to create it for each local repository, which takes time.
>
> Further, if the pack-file becomes corrupt, you usually still have the
> pack index, or the start of the pack-file, and can reconstruct most of the
> objects. If you use a dictionary, and just one bit flips in it, you're
> screwed.
>
> Ciao,
> Dscho
>
> Footnote 1: If the probabilities are all powers of two (with negative
> exponent), the encodings can be fixed bit strings (because the lengths are
> integer numbers); if that is not the case, it gets a little more
> complicated; basically, you store a fractional bit as a state; that is why
> it is called arithmetic coding.
>


-- 
Jon Smirl
jonsmirl@gmail.com

  reply	other threads:[~2006-08-14 14:08 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-14  3:37 Compression and dictionaries Jon Smirl
2006-08-14  3:56 ` Shawn Pearce
2006-08-14  4:07   ` Jon Smirl
2006-08-14  4:17     ` Shawn Pearce
2006-08-14  7:48       ` Alex Riesen
2006-08-14 10:06     ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08   ` Jon Smirl [this message]
2006-08-14 14:45     ` Johannes Schindelin
2006-08-14 16:15       ` Jon Smirl
2006-08-14 16:32         ` David Lang
2006-08-14 16:55           ` Jakub Narebski
2006-08-14 17:15             ` Jeff Garzik
2006-08-14 17:34               ` David Lang
2006-08-14 17:50                 ` Jeff Garzik
2006-08-14 18:48           ` Jon Smirl
2006-08-14 19:08             ` David Lang
2006-08-14 19:38               ` Johannes Schindelin
2006-08-14 15:14     ` Alex Riesen
2006-08-14 15:26       ` Johannes Schindelin
  -- strict thread matches above, loose matches on Subject: below --
2006-08-15  8:33 linux
2006-08-15 13:29 ` Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16  0:37   ` linux
     [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16  5:50       ` Jon Smirl
2006-08-16  6:33         ` Johannes Schindelin
2006-08-16  6:55           ` Shawn Pearce
2006-08-16  7:09             ` Johannes Schindelin
2006-08-16 14:43               ` Jon Smirl
2006-08-17 22:33           ` linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910608140708i45e3d6day6b87676783fd6511@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).