From: Nicolas Pitre <nico@cam.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: Shawn Pearce <spearce@spearce.org>, git <git@vger.kernel.org>
Subject: Re: Huge win, compressing a window of delta runs as a unit
Date: Sun, 20 Aug 2006 23:45:42 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.64.0608202257020.3682@localhost.localdomain> (raw)
In-Reply-To: <9e4733910608180956n64e3362fm5c72d652e6b6243a@mail.gmail.com>
On Fri, 18 Aug 2006, Jon Smirl wrote:
> On 8/18/06, Nicolas Pitre <nico@cam.org> wrote:
> > On Fri, 18 Aug 2006, Jon Smirl wrote:
> >
> > > I attached Shawn's code. He is gone until Monday and can't defend it.
> >
> > I will have a look at it next week as I'll be gone for the weekend as
> > well.
>
> I looked at it some and couldn't see anything obviously wrong with it,
> but it wasn't a detailed inspection.
I looked at it too and the code looks OK.
This doesn't mean there is no problem at a higher level though. The
deltification process is extremely crude and I think this is the cause
of the original pack size.
For example, last April we discovered that a small change in the
heuristics to determine base delta objects in git-pack-objects could
create a pack size regression up to 4x the size of the same pack created
before such change.
It is also possible to have a denser delta stream but once deflated it
is larger than a less dense delta to start with.
Just to say that many tweaks and heuristics have been implemented and
studied in git-pack-objects for over a year now in order to get the
really small packs we have today. And a really subtle and
inocent-looking change can break it size wize.
So what I think is happening with the fastimport code is that the delta
selection is not really good. It is certainly much better than no delta
at all but still not optimal which smells deja vu to me. Then by
deflating them all together the redundent information that the bad delta
set still carries along is eliminated -- thanks to zlib sort of
mitigating the real issue.
But... as my recent experiments show, the grouping of related deltas
into a single zlib stream doesn't produce significant improvements when
implemented directly into git-pack-objects. Certainly not worth the
inconvenients and costs it brings along. I even think that if you used
git-repack -a -f on the pack produced by the import process, with only
delta deflated individually just like it did originally, then the
repacked pack would _also_ shrink significantly. Most probably around
4x just like you observed with the grouping of deltas in the same zlib
stream.
Not only would git-repack make it much smaller, but it also provicdes a
much better layout where all objects for recent commits are all stored
together at the beginning of the pack. The fastimport code is instead
storing them scattered all over the pack for every commit by making all
revisions of each file next to each other which will cause horrible
access patterns and really bad IO.
So I think that trying to make fastimport too clever is wrong. It
should instead focus on creating an initial pack as fast as possible and
then rely on a final git-repack pass to produce the shrinked pack. I
really doubt the import code could ever make a better job than
git-pack-objects does.
If I can make a suggestion, you should forget about this multiple deltas
in one zlib stream for now and focus on making the import process work
all the way to tree and commit objects instead. Then, only then, if
git-repack -a -f doesn't produce satisfactory pack size we could look at
better pack encoding. And so far the grouping of related deltas in one
zlib stream is _not_ a better encoding given the rather small
improvement over unmodified git-pack-objects vs the inconvenients and
cost it brings with it.
> As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB.
> The 295MB git pack number does not have commits and trees in it, it is
> revisions only.
Running git-repack -a -f from a recent GIT on the Mozilla repo converted
through cvsps and friends produces a pack smaller than 500MB. I even
brought it down to 430MB by using non default delta window and depth.
Nicolas
next prev parent reply other threads:[~2006-08-21 3:45 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17 4:07 ` Shawn Pearce
2006-08-17 7:56 ` Johannes Schindelin
2006-08-17 8:07 ` Johannes Schindelin
2006-08-17 14:36 ` Jon Smirl
2006-08-17 15:45 ` Johannes Schindelin
2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:05 ` Johannes Schindelin
2006-08-17 17:22 ` Jon Smirl
2006-08-17 18:15 ` Nicolas Pitre
2006-08-17 17:17 ` Jon Smirl
2006-08-17 17:32 ` Nicolas Pitre
2006-08-17 18:06 ` Jon Smirl
2006-08-17 17:22 ` Nicolas Pitre
2006-08-17 18:03 ` Jon Smirl
2006-08-17 18:24 ` Nicolas Pitre
2006-08-18 4:03 ` Nicolas Pitre
2006-08-18 12:53 ` Jon Smirl
2006-08-18 16:30 ` Nicolas Pitre
2006-08-18 16:56 ` Jon Smirl
2006-08-21 3:45 ` Nicolas Pitre [this message]
2006-08-21 6:46 ` Shawn Pearce
2006-08-21 10:24 ` Jakub Narebski
2006-08-21 16:23 ` Jon Smirl
2006-08-18 13:15 ` Jon Smirl
2006-08-18 13:36 ` Johannes Schindelin
2006-08-18 13:50 ` Jon Smirl
2006-08-19 19:25 ` Linus Torvalds
2006-08-18 16:25 ` Nicolas Pitre
2006-08-21 7:06 ` Shawn Pearce
2006-08-21 14:07 ` Jon Smirl
2006-08-21 15:46 ` Nicolas Pitre
2006-08-21 16:14 ` Jon Smirl
2006-08-21 17:48 ` Nicolas Pitre
2006-08-21 17:55 ` Nicolas Pitre
2006-08-21 18:01 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0608202257020.3682@localhost.localdomain \
--to=nico@cam.org \
--cc=git@vger.kernel.org \
--cc=jonsmirl@gmail.com \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).