git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nicolas Pitre <nico@cam.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: Shawn Pearce <spearce@spearce.org>, git <git@vger.kernel.org>
Subject: Re: Huge win, compressing a window of delta runs as a unit
Date: Sun, 20 Aug 2006 23:45:42 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0608202257020.3682@localhost.localdomain> (raw)
In-Reply-To: <9e4733910608180956n64e3362fm5c72d652e6b6243a@mail.gmail.com>

On Fri, 18 Aug 2006, Jon Smirl wrote:

> On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > I attached Shawn's code. He is gone until Monday and can't defend it.
> >
> > I will have a look at it next week as I'll be gone for the weekend as
> > well.
> 
> I looked at it some and couldn't see anything obviously wrong with it,
> but it wasn't a detailed inspection.

I looked at it too and the code looks OK.

This doesn't mean there is no problem at a higher level though.  The 
deltification process is extremely crude and I think this is the cause 
of the original pack size.

For example, last April we discovered that a small change in the 
heuristics to determine base delta objects in git-pack-objects could 
create a pack size regression up to 4x the size of the same pack created 
before such change.

It is also possible to have a denser delta stream but once deflated it 
is larger than a less dense delta to start with.

Just to say that many tweaks and heuristics have been implemented and 
studied in git-pack-objects for over a year now in order to get the 
really small packs we have today.  And a really subtle and 
inocent-looking change can break it size wize.

So what I think is happening with the fastimport code is that the delta 
selection is not really good.  It is certainly much better than no delta 
at all but still not optimal which smells deja vu to me.  Then by 
deflating them all together the redundent information that the bad delta 
set still carries along is eliminated -- thanks to zlib sort of 
mitigating the real issue.

But... as my recent experiments show, the grouping of related deltas 
into a single zlib stream doesn't produce significant improvements when 
implemented directly into git-pack-objects.  Certainly not worth the 
inconvenients and costs it brings along.  I even think that if you used 
git-repack -a -f on the pack produced by the import process, with only 
delta deflated individually just like it did originally, then the 
repacked pack would _also_ shrink significantly.  Most probably around 
4x just like you observed with the grouping of deltas in the same zlib 
stream.

Not only would git-repack make it much smaller, but it also provicdes a 
much better layout where all objects for recent commits are all stored 
together at the beginning of the pack.  The fastimport code is instead 
storing them scattered all over the pack for every commit by making all 
revisions of each file next to each other which will cause horrible 
access patterns and really bad IO.

So I think that trying to make fastimport too clever is wrong.  It 
should instead focus on creating an initial pack as fast as possible and 
then rely on a final git-repack pass to produce the shrinked pack.  I 
really doubt the import code could ever make a better job than 
git-pack-objects does.

If I can make a suggestion, you should forget about this multiple deltas 
in one zlib stream for now and focus on making the import process work 
all the way to tree and commit objects instead.  Then, only then, if 
git-repack -a -f doesn't produce satisfactory pack size we could look at 
better pack encoding.  And so far the grouping of related deltas in one 
zlib stream is _not_ a better encoding given the rather small 
improvement over unmodified git-pack-objects vs the inconvenients and 
cost it brings with it.

> As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> The 295MB git pack number does not have commits and trees in it, it is
> revisions only.

Running git-repack -a -f from a recent GIT on the Mozilla repo converted 
through cvsps and friends produces a pack smaller than 500MB.  I even 
brought it down to 430MB by using non default delta window and depth.


Nicolas

  reply	other threads:[~2006-08-21  3:45 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17  4:07 ` Shawn Pearce
2006-08-17  7:56   ` Johannes Schindelin
2006-08-17  8:07     ` Johannes Schindelin
2006-08-17 14:36       ` Jon Smirl
2006-08-17 15:45         ` Johannes Schindelin
2006-08-17 16:33           ` Nicolas Pitre
2006-08-17 17:05             ` Johannes Schindelin
2006-08-17 17:22             ` Jon Smirl
2006-08-17 18:15               ` Nicolas Pitre
2006-08-17 17:17           ` Jon Smirl
2006-08-17 17:32             ` Nicolas Pitre
2006-08-17 18:06               ` Jon Smirl
2006-08-17 17:22   ` Nicolas Pitre
2006-08-17 18:03     ` Jon Smirl
2006-08-17 18:24       ` Nicolas Pitre
2006-08-18  4:03 ` Nicolas Pitre
2006-08-18 12:53   ` Jon Smirl
2006-08-18 16:30     ` Nicolas Pitre
2006-08-18 16:56       ` Jon Smirl
2006-08-21  3:45         ` Nicolas Pitre [this message]
2006-08-21  6:46           ` Shawn Pearce
2006-08-21 10:24             ` Jakub Narebski
2006-08-21 16:23             ` Jon Smirl
2006-08-18 13:15   ` Jon Smirl
2006-08-18 13:36     ` Johannes Schindelin
2006-08-18 13:50       ` Jon Smirl
2006-08-19 19:25         ` Linus Torvalds
2006-08-18 16:25     ` Nicolas Pitre
2006-08-21  7:06       ` Shawn Pearce
2006-08-21 14:07         ` Jon Smirl
2006-08-21 15:46         ` Nicolas Pitre
2006-08-21 16:14           ` Jon Smirl
2006-08-21 17:48             ` Nicolas Pitre
2006-08-21 17:55               ` Nicolas Pitre
2006-08-21 18:01                 ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0608202257020.3682@localhost.localdomain \
    --to=nico@cam.org \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).