git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Dana How" <danahow@gmail.com>
To: "Jakub Narebski" <jnareb@gmail.com>
Cc: git@vger.kernel.org, danahow@gmail.com,
	"Junio C Hamano" <junkio@cox.net>
Subject: Re: [PATCH] Prevent megablobs from gunking up git packs
Date: Tue, 22 May 2007 09:59:29 -0700	[thread overview]
Message-ID: <56b7f5510705220959x1b37a4adk537cc0cba1a27530@mail.gmail.com> (raw)
In-Reply-To: <f2uigr$ufj$1@sea.gmane.org>

On 5/22/07, Jakub Narebski <jnareb@gmail.com> wrote:
> Dana How wrote:
> > There's actually an even more extreme example from my day job.
> > The software team has a project whose files/revisions would be
> > similar to those in the linux kernel (larger commits, I'm sure).
> > But they have *ONE* 500MB file they check in because it takes
> > 2 or 3 days to generate and different people use different versions of it.
> > I'm sure it has 50+ revisions now. If they converted to git and included
> > these blobs in their packfile, that's a 25GB uncompressed increase!
> > *Every* git operation must wade through 10X -- 100X more packfile.
> > Or it could be kept in 50+ loose objects in objects/xx ,
> > requiring a few extra syscalls by each user to get a new version.
> Or keeping those large objects in separate, _kept_ packfile, containing
> only those objects (which can delta well, even if they are large).

Yes, I experimented with various changes to git-repack and
having it create .keep files just before coming up with the maxblobsize
approach.  The problem with a 12GB+ repo is not only the large
repack time,  but the fact that the repack time keeps growing with
the repo size.  So, with split packs, I had repack create .keep
files for all new packs except the last (fragmentary) one.  The next
repack would then only repack new stuff plus the single fragmentary
pack, keeping repack time from growing (until you deleted the .keep
files [just the ones with "repack" in them] to start over from scratch).
But this approach is not going to distribute commits and trees all that well.

Last night before signing off Junio proposed some partitioning ideas.
He presented them as ordering things *within* one pack;  what I had
tried was making repack operate in 2 passes: the first one would create
pack(s) containing commits+trees+tags, the 2nd would create
pack(s) containing only blobs.  Of course the first group would contain
only 1 tiny pack, and the latter 6 or 7 enormous packs.  I also combined
this with the previous paragraph, putting .keep files on all but the last
pack in each group.  Then the metadata always got repacked,
and the blob data only got its "tail" repacked.

Let's just stipulate that you've convinced me that putting everything
in packs, and not ejecting megablobs, is better or equivalent on
the "central" git repository which will replace (part of) our Perforce
repository.  What about the users' repositories?

Each person at my day job has his own workstation.  They are all
on a grid and are constantly running jobs in the background.
Each person would have at least one personal repo.  What should the
packing strategy be there?

(1) If we must put everything in packs,  then we could:
(1a) Repack everything in local repos, incurring large local runtimes.
       This extra work then denies the CPU cycles to the grid,
       which WILL be noticed and cause much whining.
       So the response will be to reduce window and/or turn
       on nodelta for some group of objects, worsening packing
       and failing to squash the whining.  This happens across
       20 to 30 workstations.  Or we reduce the frequency of
       repacking and stagger it across the network.  Since daily
       pull/fetch/checkout ("sync" in p4 parlance) grabs 400+ new
       revisions each day,  if we make repacking weekly we have
       a policy that results in 400*5/2=1000 extra loose blobs on average,
       and there will still be whining.  Why not just set maxblobsize
       to some size resulting in ~1000 loose blobs, leave window/depth
       at default, and enjoy <1hr repacking?
(1b) Repack everything ONLY in the central repo, and have the users' repos
      point to it as an alternate.  Now we have enormous network traffic.
       However, this is better than (1a), and was what I thought I'd be
       stuck with.  We still do have the possible problem of excessive
       packing time on the central repo,  but it's easier to solve/hide
      in just one place.
(2) We repack everything but leave megablobs loose.  Now packfiles
     are 13MB, repack time with default window/depth is <1hr,  and we
     can repack each users' repository from his own cron job.  This will
     be noticed,  but it won't cause too much complaining.  Most git
     operations by users will be against their local repos,  but the
     server's db will still be an alternate to fetch at least megablobs.
     This is not a problem compared to Perforce,  which stores *NO*
     repository state locally at all.

I really think megablob ejection from packs makes a lot of sense for local
repos on a network of workstations.  It lets me keep almost all repo
state locally very cheaply.  It is just another consequence of the tendency
that an adequate solution that operates principally on only 13MB of data
doesn't have to work as hard or as carefully as something
operating on the full 12GB -- three orders of magnitude larger.

If there's interest,  I could submit my other alterations to git-repack.
They still have bugs which would take a while to work out since
each run operates on 12GB of data.  With quicker runtimes,
maxblobsize was much quicker to debug even though I made
more stupid mistakes at first ;-)

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

  reply	other threads:[~2007-05-22 16:59 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
2007-05-22  6:30 ` Shawn O. Pearce
2007-05-22  7:33   ` Dana How
2007-05-22  6:52 ` Junio C Hamano
2007-05-22  8:00   ` Dana How
2007-05-22 11:05     ` Jakub Narebski
2007-05-22 16:59       ` Dana How [this message]
2007-05-22 23:44         ` Jakub Narebski
2007-05-23  0:28           ` Junio C Hamano
2007-05-23  1:58             ` Nicolas Pitre
2007-05-22 17:38 ` Nicolas Pitre
2007-05-22 18:07   ` Dana How
2007-05-23 22:08 ` Junio C Hamano
2007-05-23 23:55   ` Dana How
2007-05-24  1:44     ` Junio C Hamano
2007-05-24  7:12       ` Shawn O. Pearce
2007-05-24  9:38         ` Johannes Schindelin
2007-05-24 17:23         ` david
2007-05-24 17:29           ` Johannes Schindelin
2007-05-25  0:55             ` Shawn O. Pearce
2007-05-24 20:43         ` Geert Bosch
2007-05-24 23:29         ` Dana How
2007-05-25  2:06           ` Shawn O. Pearce
2007-05-25  5:44             ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56b7f5510705220959x1b37a4adk537cc0cba1a27530@mail.gmail.com \
    --to=danahow@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jnareb@gmail.com \
    --cc=junkio@cox.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).