git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Dana How" <danahow@gmail.com>
To: "Junio C Hamano" <junkio@cox.net>
Cc: "Git Mailing List" <git@vger.kernel.org>, danahow@gmail.com
Subject: Re: [PATCH] Prevent megablobs from gunking up git packs
Date: Tue, 22 May 2007 01:00:06 -0700	[thread overview]
Message-ID: <56b7f5510705220100h77e91196r1784b33772911660@mail.gmail.com> (raw)
In-Reply-To: <7vtzu58i4c.fsf@assigned-by-dhcp.cox.net>

On 5/21/07, Junio C Hamano <junkio@cox.net> wrote:
> Dana How <danahow@gmail.com> writes:
> > git stores data in loose blobs or in packfiles.  The former
> > has essentially now become an exception mechanism,  to store
> > exceptionally *young* blobs.  Why not use this to store
> > exceptionally *large* blobs as well?  This allows us to
> > re-use all the "exception" machinery with only a small change.
> Well, I had an impression that mmapping a single loose object
> (and then munmapping it after done) would be more expensive than
> mmapping a whole pack and accessing that object through window,
> as long as you touch the same set of objects and the object in
> the pack is not deltified.
I agree with your comparison.  However,  if I'm processing a 100MB+
blob,  I doubt the extra open/mmap/munmap/close calls are going
to matter to me.  What I think _helped_ me was that, with the megablobs
pushed out of the pack,  git-log etc could play around inside a
"tiny" 13MB packfile very quickly.  This packfile contained all the
commits, all the trees, and all the blobs < 256KB.

> > Repacking the entire repository with a max-blob-size of 256KB
> > resulted in a single 13.1MB packfile,  as well as 2853 loose
> > objects totaling 15.4GB compressed and 100.08GB uncompressed,
> > 11 files per objects/xx directory on average.  All was created
> > in half the runtime of the previous yet with standard
> > --window=10 and --depth=50 parameters.  The data in the
> > packfile was 270MB uncompressed in 35976 blobs.  Operations
> > such as "git-log --pretty=oneline" were about 30X faster
> > on a cold cache and 2 to 3X faster otherwise.  Process sizes
> > remained reasonable.
>
> I think more reasonable comparison to figure out what is really
> going on would be to create such a pack with the same 0/0 window
> and depth (i.e. "keeping the huge objects out of the pack" would
> be the only difference with the "horrible" case).  With huge
> packs, I wouldn't be surprised if seeking to extract base object
> from a far away part of a packfile takes a lot longer than
> reading delta and applying the delta to base object that is kept
> in the in-core delta base cache.
Yes,  changing only one variable at a time would be better.
I will do that experiment.  However,  the huge pack _did_ have
0/0, and the small pack had default/default,  which I think is the
reverse of what you concluded above?,  so the experiment should
make things no better for the huge pack case.

> Also if you mean by "process size" the total VM size, not RSS, I
> think it is a wrong measure.  As long as you do not touch the
> rest of the pack, even if you mmap a huge packfile, you would
> not bring that much data actually into your main memory, would
> you?  Well, assuming that your mmap() implementation and virtual
> memory subsystem does a descent job... maybe we are spoiled by
> Linux here...
You are right that the VM number was more shocking,  but both
were too high.  But let's compare using 12GB+ of packfiles versus 13MB.
In the former case,  I'm depending on the sliding mmap windows doing
the right thing in an operating regime no one uses (which is why
Shawn was asking about my packedGitLimit settings etc), and in the
latter case, the packfile is <10% of the linux2.6 packfile but I have
to endure an extra open/mmap/munmap/close sequence when accessing
enormouse files.  The small extra cost of the latter is more attractive
to me than an unknown amount of tuning to get the former right,
and in the former case I still have to figure out how to *create*
the packfiles efficiently.

There's actually an even more extreme example from my day job.
The software team has a project whose files/revisions would be
similar to those in the linux kernel (larger commits, I'm sure).
But they have *ONE* 500MB file they check in because it takes
2 or 3 days to generate and different people use different versions of it.
I'm sure it has 50+ revisions now.  If they converted to git and included
these blobs in their packfile, that's a 25GB uncompressed increase!
*Every* git operation must wade through 10X -- 100X more packfile.
Or it could be kept in 50+ loose objects in objects/xx ,
requiring a few extra syscalls by each user to get a new version.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

  reply	other threads:[~2007-05-22  8:00 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
2007-05-22  6:30 ` Shawn O. Pearce
2007-05-22  7:33   ` Dana How
2007-05-22  6:52 ` Junio C Hamano
2007-05-22  8:00   ` Dana How [this message]
2007-05-22 11:05     ` Jakub Narebski
2007-05-22 16:59       ` Dana How
2007-05-22 23:44         ` Jakub Narebski
2007-05-23  0:28           ` Junio C Hamano
2007-05-23  1:58             ` Nicolas Pitre
2007-05-22 17:38 ` Nicolas Pitre
2007-05-22 18:07   ` Dana How
2007-05-23 22:08 ` Junio C Hamano
2007-05-23 23:55   ` Dana How
2007-05-24  1:44     ` Junio C Hamano
2007-05-24  7:12       ` Shawn O. Pearce
2007-05-24  9:38         ` Johannes Schindelin
2007-05-24 17:23         ` david
2007-05-24 17:29           ` Johannes Schindelin
2007-05-25  0:55             ` Shawn O. Pearce
2007-05-24 20:43         ` Geert Bosch
2007-05-24 23:29         ` Dana How
2007-05-25  2:06           ` Shawn O. Pearce
2007-05-25  5:44             ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56b7f5510705220100h77e91196r1784b33772911660@mail.gmail.com \
    --to=danahow@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).