Re: what's a useful definition of full text index on a repository?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Jon Smirl" <jonsmirl@gmail.com>
To: "David Tweed" <david.tweed@gmail.com>
Cc: "Git Mailing List" <git@vger.kernel.org>
Subject: Re: what's a useful definition of full text index on a repository?
Date: Tue, 2 Oct 2007 11:48:53 -0400	[thread overview]
Message-ID: <9e4733910710020848l496ee49ej4c243b5f62bca6c0@mail.gmail.com> (raw)
In-Reply-To: <e1dab3980710020234l1951349r38657c68aa7ef5@mail.gmail.com>

On 10/2/07, David Tweed <david.tweed@gmail.com> wrote:
> > Full text indexing can also achieve high levels of compression as
> > stated in the earlier threads. It is full scale dictionary
> > compression. When it is being used for compression you want to apply
> > it to all revisions.
>
> Well, as I say I'm not convinced it makes sense to integrate this with
> existing pack stuff precisely because I don't think it's universally
> useful. So you seem to end up with all the usual tricks, eg, Golomb
> coding inverted indexes, etc, _if_ you treat each blob as completely
> independent. I was wondering if there was anything else you can do
> given the special structure that might be both more useful and more
> compact?

Dictionary compression can be used without full-text indexes. It is
just really easy to build the full-text index if the data is already
dictionary compressed. Dictionary compression works for everything
except binary or random data.

Git is already using a small scale dictionary compressor via zip. I
suspect doing a full scale dictionary for a pack file and then using
arithmetic encoding of the tokens would provide substantially more
compression. The big win is having a single dictionary instead of a
new dictionary each time zip is used.

When we were working on Mozilla, Mozilla changed licenses three times.
The license text ended up taking about 30MB in the current scheme.
With full dictionary compression this would reduce down to a few kb.

More compression is good for git. It means we can keep more data in
RAM and reduce download times. With current hardware it is almost
always better to trade off CPU to reduce IO.

-- 
Jon Smirl
jonsmirl@gmail.com

     prev parent reply	other threads:[~2007-10-02 15:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-01 16:33 what's a useful definition of full text index on a repository? David Tweed
2007-10-01 17:25 ` Jon Smirl
2007-10-02  9:34   ` David Tweed
2007-10-02 15:48     ` Jon Smirl [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9e4733910710020848l496ee49ej4c243b5f62bca6c0@mail.gmail.com \
    --to=jonsmirl@gmail.com \
    --cc=david.tweed@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).