All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shawn Pearce <spearce@spearce.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Mozilla .git tree
Date: Tue, 29 Aug 2006 23:10:29 -0400	[thread overview]
Message-ID: <20060830031029.GA23967@spearce.org> (raw)
In-Reply-To: <9e4733910608291958l45c0257dla6e5ebd4176f7164@mail.gmail.com>

Jon Smirl <jonsmirl@gmail.com> wrote:
> sha1s are effectively 20 byte pointer addresses into the pack. With 2M
> objects you can easily get away with 4 byte address and a mapping
> table. Another idea would be to replace the 20 byte sha1 in tree
> objects with 32b file offsets - requiring that anything the tree
> refers to has to already be in the pack before the tree entry can be
> written.

I've thought of that, but when you transfer a "thin" pack over the
wire the base object may not even be in the pack.  Thus you can't
use an offset to reference it.  Otherwise there's probably little
reason why the base couldn't be referenced by its 4 byte offset
rather than its full 20 byte object ID.  Added up over all deltas
in the mozilla pack it saves a whopping 23 MiB.

> The Mozilla license has changed at least five times. That makes 250K
> copies of licenses.

Cute.
 
> I suspect a tree specific zlib dictionary will be a good win.  But
> those trees contain a lot of uncompressible data, the sha1. Those
> sha1s are in binary not hex, right?

Yup, binary.
 
> The git tools can be modified to set the compression level to 0 before
> compressing tree deltas. There is no need to change the decoding code.
> Even with compression level 0 they still get slightly larger because
> zlib tacks on a header.

See my followup email to myself; I think we're talking a zlib
overhead of 9.2 bytes on average per tree delta.  That's with a
compression level of -1 (default, which is 6).
 
> I'm still interested in getting an idea of how much a Clucene type
> dictionary compression would help. It is hard to see how you can get
> smaller than that method. Note that you don't want to include the
> indexing portion of Clucene in the comparison. Just the part where
> everything gets tokenized into a big dictionary, arithmetic encoded
> based on usage frequency, and then the strings in the orginal
> documents are replaced with the codes. You want to do the diffs before
> replacing everything with codes. Encoding this way is a two pass
> process so it is easiest to work from an existing pack.

>From what I was able to gather I don't think Clucene stores the
documents themselves as the tokenized compressed data.  Or if it
does you lose everything between the tokens.  There's a number of
things we want to preserve in the original "document" like whitespace
that would be likely stripped when constructing tokens.

But it shouldn't be that difficult to produce a rough estimate of
what that storage size would be.
 
-- 
Shawn.

  reply	other threads:[~2006-08-30  3:10 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <9e4733910608290943g6aa79855q62b98caf4f19510@mail.gmail.com>
     [not found] ` <20060829165811.GB21729@spearce.org>
     [not found]   ` <9e4733910608291037k2d9fb791v18abc19bdddf5e89@mail.gmail.com>
     [not found]     ` <20060829175819.GE21729@spearce.org>
     [not found]       ` <9e4733910608291155g782953bbv5df1b74878f4fcf1@mail.gmail.com>
     [not found]         ` <20060829190548.GK21729@spearce.org>
     [not found]           ` <9e4733910608291252q130fc723r945e6ab906ca6969@mail.gmail.com>
     [not found]             ` <20060829232007.GC22935@spearce.org>
     [not found]               ` <9e4733910608291807q9b896e4sdbfaa9e49de58c2b@mail.gmail.com>
2006-08-30  1:51                 ` Mozilla .git tree Shawn Pearce
2006-08-30  2:25                   ` Shawn Pearce
2006-08-30  2:58                   ` Jon Smirl
2006-08-30  3:10                     ` Shawn Pearce [this message]
2006-08-30  3:27                       ` Jon Smirl
2006-08-30  5:53                       ` Nicolas Pitre
2006-08-30 11:42                         ` Junio C Hamano
2006-09-01  7:42                           ` Junio C Hamano
2006-09-02  1:19                             ` Shawn Pearce
2006-09-02  4:01                               ` Junio C Hamano
2006-09-02  4:39                                 ` Shawn Pearce
2006-09-02 11:06                                   ` Junio C Hamano
2006-09-02 14:20                                     ` Jon Smirl
2006-09-02 17:39                                       ` Shawn Pearce
2006-09-02 18:56                                         ` Linus Torvalds
2006-09-02 20:53                                           ` Junio C Hamano
2006-09-02 17:44                                     ` Shawn Pearce
2006-09-02  2:04                             ` Shawn Pearce
2006-09-02 11:02                               ` Junio C Hamano
2006-09-02 17:51                                 ` Shawn Pearce
2006-09-02 20:55                                   ` Junio C Hamano
2006-09-03  3:54                                     ` Shawn Pearce
2006-09-01 17:45                           ` A Large Angry SCM
2006-09-01 18:35                             ` Linus Torvalds
2006-09-01 19:56                               ` Junio C Hamano
2006-09-01 23:14                               ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano
2006-09-02  0:23                                 ` Linus Torvalds
2006-09-02  1:39                                   ` VGER BF report? Johannes Schindelin
2006-09-02  5:58                                     ` Sam Ravnborg
2006-09-02  1:52                                   ` [PATCH] pack-objects: re-validate data we copy from elsewhere Junio C Hamano
2006-09-02  3:52                                   ` Junio C Hamano
2006-09-02  4:52                                     ` Shawn Pearce
2006-09-02  9:42                                       ` Junio C Hamano
2006-09-02 17:43                                         ` Linus Torvalds
2006-09-02 10:09                                       ` Junio C Hamano
2006-09-02 17:54                                         ` Shawn Pearce
2006-09-03 21:00                                           ` Junio C Hamano
2006-09-04  4:10                                             ` Shawn Pearce
2006-09-04  5:50                                               ` Junio C Hamano
2006-09-04  6:44                                                 ` Shawn Pearce
2006-09-04  7:39                                                   ` Junio C Hamano
2006-09-03  0:27                                         ` Linus Torvalds
2006-09-03  0:32                                           ` Junio C Hamano
2006-09-05  8:12                                           ` Junio C Hamano
2006-09-02 18:43                                     ` Linus Torvalds
2006-09-02 20:56                                       ` Junio C Hamano
2006-09-03 21:48                                       ` Junio C Hamano
2006-09-03 22:00                                         ` Linus Torvalds
2006-09-03 22:16                                           ` Linus Torvalds
2006-09-03 22:34                                           ` Junio C Hamano
2006-09-04  4:06                                             ` Junio C Hamano
2006-09-04 15:19                                               ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060830031029.GA23967@spearce.org \
    --to=spearce@spearce.org \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.