git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nicolas Pitre <nico@cam.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: Shawn Pearce <spearce@spearce.org>, git <git@vger.kernel.org>
Subject: Re: Huge win, compressing a window of delta runs as a unit
Date: Mon, 21 Aug 2006 13:48:59 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0608211309270.3851@localhost.localdomain> (raw)
In-Reply-To: <9e4733910608210914s1157f47eta821584928ce4dd5@mail.gmail.com>

On Mon, 21 Aug 2006, Jon Smirl wrote:

> How about making the length of delta chains an exponential function of
> the number of revs? In Mozilla configure.in has 1,700 revs and it is a
> 250K file. If you store a full copy every 10 revs that is 43MB
> (prezip) of data that almost no one is going to look at.
> The chains
> lengths should reflect the relative probability that someone is going
> to ask to see the revs. That is not at all a uniform function.

1) You can do that already with stock git-repack.

2) The current git-pack-objects code can and does produce a delta "tree" 
   off of a single base object.  It doesn't have to be a linear list.

Therefore, even if the default depth is 10, you may as well have many 
deltas pointing to the _same_ base object effectively making the 
compression ratio much larger than 1/10.

If for example each object has 2 delta childs, and each of those deltas 
also have 2 delta childs, you could have up to 39366 delta objects 
attached to a _single_ undeltified base object.

And of course the git-pack-objects code doesn't limit the number of 
delta childs in any way so this could theoritically be infinite even 
though the max depth is 10.  OK the delta matching window limits that 
somehow but nothing prevents you from repacking with a larger window 
since that parameter has no penalty on the reading of objects out of the 
pack.

> Personally I am still in favor of a two pack system. One archival pack
> stores everything in a single chain and size, not speed, is it's most
> important attribute. It is marked readonly and only functions as an
> archive; git-repack never touches it. It might even use a more compact
> compression algorithm.
> 
> The second pack is for storing more recent revisions. The archival
> pack would be constructed such that none of the files needed for the
> head revisions of any branch are in it. They would all be in the
> second pack.

Personally I'm still against that.  All arguments put forward for a 
different or multiple packing system are based on unproven assumptions 
so far and none of those arguments actually present significant 
advantages over the current system.  For example, I was really intriged 
by the potential of object grouping into a single zlib stream at first, 
but it turns out not to be so great after actual testing.

I still think that a global zlib dictionary is a good idea.  Not because 
it looks like it'll make packs enormously smaller, but rather because it 
impose no performance regression over the current system.  And I 
strongly believe that you have to have a really strong case for 
promoting a solution that carries performance regression, something like 
over 25% smaller packs for example.  But so far it didn't happen.

> This may be a path to partial repositories. Instead of downloading the
> real archival pack I could download just an index for it. The index
> entries would be marked to indicate that these objects are valid but
> not-present.

An index without the actual objects is simply useless.  You could do 
without the index entirely in that case anyway since the mere presence 
of an entry in the index doesn't give you anything really useful.


Nicolas

  reply	other threads:[~2006-08-21 17:49 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17  4:07 ` Shawn Pearce
2006-08-17  7:56   ` Johannes Schindelin
2006-08-17  8:07     ` Johannes Schindelin
2006-08-17 14:36       ` Jon Smirl
2006-08-17 15:45         ` Johannes Schindelin
2006-08-17 16:33           ` Nicolas Pitre
2006-08-17 17:05             ` Johannes Schindelin
2006-08-17 17:22             ` Jon Smirl
2006-08-17 18:15               ` Nicolas Pitre
2006-08-17 17:17           ` Jon Smirl
2006-08-17 17:32             ` Nicolas Pitre
2006-08-17 18:06               ` Jon Smirl
2006-08-17 17:22   ` Nicolas Pitre
2006-08-17 18:03     ` Jon Smirl
2006-08-17 18:24       ` Nicolas Pitre
2006-08-18  4:03 ` Nicolas Pitre
2006-08-18 12:53   ` Jon Smirl
2006-08-18 16:30     ` Nicolas Pitre
2006-08-18 16:56       ` Jon Smirl
2006-08-21  3:45         ` Nicolas Pitre
2006-08-21  6:46           ` Shawn Pearce
2006-08-21 10:24             ` Jakub Narebski
2006-08-21 16:23             ` Jon Smirl
2006-08-18 13:15   ` Jon Smirl
2006-08-18 13:36     ` Johannes Schindelin
2006-08-18 13:50       ` Jon Smirl
2006-08-19 19:25         ` Linus Torvalds
2006-08-18 16:25     ` Nicolas Pitre
2006-08-21  7:06       ` Shawn Pearce
2006-08-21 14:07         ` Jon Smirl
2006-08-21 15:46         ` Nicolas Pitre
2006-08-21 16:14           ` Jon Smirl
2006-08-21 17:48             ` Nicolas Pitre [this message]
2006-08-21 17:55               ` Nicolas Pitre
2006-08-21 18:01                 ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0608211309270.3851@localhost.localdomain \
    --to=nico@cam.org \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).