From: Shawn Pearce <spearce@spearce.org>
To: Nicolas Pitre <nico@cam.org>
Cc: Jon Smirl <jonsmirl@gmail.com>, git <git@vger.kernel.org>
Subject: Re: Huge win, compressing a window of delta runs as a unit
Date: Mon, 21 Aug 2006 03:06:09 -0400 [thread overview]
Message-ID: <20060821070609.GC24054@spearce.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0608181057440.11359@localhost.localdomain>
Nicolas Pitre <nico@cam.org> wrote:
> On Fri, 18 Aug 2006, Jon Smirl wrote:
>
> > On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > > A better way to get such a size saving is to increase the window and
> > > depth parameters. For example, a window of 20 and depth of 20 can
> > > usually provide a pack size saving greater than 11% with none of the
> > > disadvantages mentioned above.
> >
> > Our window size is effectively infinite. I am handing him all of the
> > revisions from a single file in optimal order. This includes branches.
>
> In GIT packing terms this is infinite delta _depth_ not _window_.
We're not using infinite anything.
fast-import is basically doing window=1 and depth=10.
We only examine the last blob to see if we can get a delta against
it. If we do we write that delta out; otherwise we reset our delta
chain and write the complete object. We also reset our chain after
writing out 10 deltas, each of which used the immediately prior
object as its base.
Since I just found out that in some cases the Mozilla repository has
1000s of revisions per file[*1*] and in others only 1 revision per
file we probably should be adjusting this depth to have a maximum
of 500 while also having the frontend send us a "I'm switching
files now" marker so we know to not even bother trying to delta
the new blob against the last blob as they are likely to not
delta well[*2*].
> Default delta params (window=10 depth=10) : 122103455
> Agressive deltas (window=50 depth=5000) : 105870516
> Agressive and grouped deltas (window=50 depth=5000 : 99860685
Although complex the aggressive and grouped deltas appears to
have saved you 18.2% on this repository. That's not something
to ignore. A reasonably optimal local pack dictionary could save
at least 4%[*3*]. Whacking 22% off a 400 MB pack is saving 88 MB.
Transferring that over the network on an initial clone is like
downloading all of Eclipse. Or an uncompressed kernel tarball...
[*1*] Jon noted this in another email in this thread but I'm too
lazy to lookup the hyperlink right now.
[*2*] Granted in some cases they may delta very well against each
other but I think the probablity of that occuring is low
enough that its not worth worrying about in fast-import.c;
we can let repack's strategy deal with it instead.
[*3*] I wrote a brain-dead simple local dictionary selecter in Perl.
Its horribly far from being ideal. But it is consistently
saving us 4% on the GIT and the Mozilla repository and its
pretty darn fast. Shockingly the C keywords didn't gain
us very much here; its project specific text that's the
real win.
Looking at chunks which are frequently copied in deltas
from base objects and breaking those chunks up into
smaller common chunks, then loading those most frequent
common chunks into the pack dictionary would most likely
produce far better results.
--
Shawn.
next prev parent reply other threads:[~2006-08-21 7:06 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17 4:07 ` Shawn Pearce
2006-08-17 7:56 ` Johannes Schindelin
2006-08-17 8:07 ` Johannes Schindelin
2006-08-17 14:36 ` Jon Smirl
2006-08-17 15:45 ` Johannes Schindelin
2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:05 ` Johannes Schindelin
2006-08-17 17:22 ` Jon Smirl
2006-08-17 18:15 ` Nicolas Pitre
2006-08-17 17:17 ` Jon Smirl
2006-08-17 17:32 ` Nicolas Pitre
2006-08-17 18:06 ` Jon Smirl
2006-08-17 17:22 ` Nicolas Pitre
2006-08-17 18:03 ` Jon Smirl
2006-08-17 18:24 ` Nicolas Pitre
2006-08-18 4:03 ` Nicolas Pitre
2006-08-18 12:53 ` Jon Smirl
2006-08-18 16:30 ` Nicolas Pitre
2006-08-18 16:56 ` Jon Smirl
2006-08-21 3:45 ` Nicolas Pitre
2006-08-21 6:46 ` Shawn Pearce
2006-08-21 10:24 ` Jakub Narebski
2006-08-21 16:23 ` Jon Smirl
2006-08-18 13:15 ` Jon Smirl
2006-08-18 13:36 ` Johannes Schindelin
2006-08-18 13:50 ` Jon Smirl
2006-08-19 19:25 ` Linus Torvalds
2006-08-18 16:25 ` Nicolas Pitre
2006-08-21 7:06 ` Shawn Pearce [this message]
2006-08-21 14:07 ` Jon Smirl
2006-08-21 15:46 ` Nicolas Pitre
2006-08-21 16:14 ` Jon Smirl
2006-08-21 17:48 ` Nicolas Pitre
2006-08-21 17:55 ` Nicolas Pitre
2006-08-21 18:01 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060821070609.GC24054@spearce.org \
--to=spearce@spearce.org \
--cc=git@vger.kernel.org \
--cc=jonsmirl@gmail.com \
--cc=nico@cam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).