git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jon Smirl" <jonsmirl@gmail.com>
To: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>
Cc: git <git@vger.kernel.org>
Subject: Re: Compression and dictionaries
Date: Mon, 14 Aug 2006 12:15:29 -0400	[thread overview]
Message-ID: <9e4733910608140915i728004c1p216bf3d74fcc6ab7@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.63.0608141641330.28360@wbgn013.biozentrum.uni-wuerzburg.de>

On 8/14/06, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> I still think that this is important to think through: Is it worth a
> couple of kilobytes (I doubt that it would be as much as 1MB in _total_),
> and be on the unsafe side?

The maps look something like this:
void => 10101010
char => 111101
int => 1110101
tree => 11010111
commit => 101011001

These maps are repeated in every one of my 1M revisions including the
deltas. I have 1GB pack files with 1M entries in them - 1K each entry.
Each byte saved out of a zlib entry take 1MB off my pack.

Note that the current internal maps aren't the same in each of each of
the zlib blobs since the algorithm that builds the internal maps
depends on the order the identifiers were encountered.

If the git tools add a global dictionary the tools would still be able
to read existing packs. If old tools try to read a new dictionary
based pack they will get the zlib NEED_DICT error.

If the entire file was one big zlib blob there would only be one
dictionary and adding a fixed dictionary wouldn't make any difference.
But since it is 1M little zlib blobs it is has 1M dictionaries.

The only "unsafe" aspect I see to this is if the global dictionary
doesn't contain any of the words in the documents being encoded. In
that case the global dictionary will occupy the short huffman keys
forcing longer internal keys.  The keys for the words in the document
would be longer by a about a bit on average.

A solution for making this work over time would be to store the global
dictionary at the front of the pack file and for the unpack tools to
use the stored copy. This would let us change the global dictionary in
the pack tool with no downside, you could even support multiple
dictionaries in the pack tool.

If someone wants to get fancy you could write a tool that would scan a
pack file and compute an optimal fixed dictionary. Store it at the
front of the pack file and repack using it.

Global dictionaries are common in full text searching. I seem to
recall an article stating that Google's global dictionary has about
250K entries in it. If git packs switch to a global dictionary model
it's not a big leap to add a full text search index. You just need
objects for each word in the dictionary pointing to the revisions that
contain it.

-- 
Jon Smirl
jonsmirl@gmail.com

  reply	other threads:[~2006-08-14 16:15 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-14  3:37 Compression and dictionaries Jon Smirl
2006-08-14  3:56 ` Shawn Pearce
2006-08-14  4:07   ` Jon Smirl
2006-08-14  4:17     ` Shawn Pearce
2006-08-14  7:48       ` Alex Riesen
2006-08-14 10:06     ` Erik Mouw
2006-08-14 12:33 ` Johannes Schindelin
2006-08-14 14:08   ` Jon Smirl
2006-08-14 14:45     ` Johannes Schindelin
2006-08-14 16:15       ` Jon Smirl [this message]
2006-08-14 16:32         ` David Lang
2006-08-14 16:55           ` Jakub Narebski
2006-08-14 17:15             ` Jeff Garzik
2006-08-14 17:34               ` David Lang
2006-08-14 17:50                 ` Jeff Garzik
2006-08-14 18:48           ` Jon Smirl
2006-08-14 19:08             ` David Lang
2006-08-14 19:38               ` Johannes Schindelin
2006-08-14 15:14     ` Alex Riesen
2006-08-14 15:26       ` Johannes Schindelin
  -- strict thread matches above, loose matches on Subject: below --
2006-08-15  8:33 linux
2006-08-15 13:29 ` Jon Smirl
2006-08-15 14:55 ` Jon Smirl
2006-08-16  0:37   ` linux
     [not found]     ` <4b73d43f0608152243i15b37036x7aa50aa3afc2b02f@mail.gmail.com>
2006-08-16  5:50       ` Jon Smirl
2006-08-16  6:33         ` Johannes Schindelin
2006-08-16  6:55           ` Shawn Pearce
2006-08-16  7:09             ` Johannes Schindelin
2006-08-16 14:43               ` Jon Smirl
2006-08-17 22:33           ` linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910608140915i728004c1p216bf3d74fcc6ab7@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).