Re: Git performance results on a large repository

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sam Vilain <sam@vilain.net>
To: Joshua Redstone <joshua.redstone@fb.com>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Git performance results on a large repository
Date: Mon, 06 Feb 2012 13:17:15 -0800	[thread overview]
Message-ID: <4F30435B.5070709@vilain.net> (raw)
In-Reply-To: <243C23AF01622E49BEA3F28617DBF0AD5912CA85@SC-MBX02-5.TheFacebook.com>

 > Sam Vilain: Thanks for the pointer, i didn't realize that
 > fast-import was bi-directional.  I used it for generating the
 > synthetic repo.  Will look into using it the other way around.
 > Though that still won't speed up things like git-blame,
 > presumably?

It could, because blame is an operation which primarily works on
the source history with little reference to the working copy.  Of
course this will depend on the quality of the implementation
server-side.  Blame should suit distribution over a cluster, as
it is mostly involved with scanning candidate revisions for
string matches which is the compute intensive part.  Coming up
with candidate revisions has its own cost and can probably also
be distributed, but just working on the lowest loop level might
be a good place to start.

What it doesn't help with is local filesystem operations.  For
this I think a different approach is required, if you can tie
into fam or a similar inode change notification system, then you
should be able to avoid the entire recursive stat on 'git
status'.  I'm not sure --assume-unchanged on its own is a good
idea, you could easily miss things.  Those stat's are useful.

Making the index able to hold just changes to the checked-out
tree, as others have mentioned, would also save the massive reads
and writes you've identified.  Perhaps a more high performance
back-end could be developed.

 > The sparse-checkout issue you mention is a good one.

It's actually been on the table since at least GitTogether 2008;
there's been some design discussion on it and I think it's just
one of those features which doesn't have enough demand yet for it
to be built.  It keeps coming up but not from anyone with the
inclination or resources to make it happen.  There is a protocol
issue, but this should be able to fit into the current extension
system.

 > There is a good question of how to support quick checkout,
 > branch switching, clone, push and so forth.

Sure.  It will be much more network intensive as you are
replacing the part which normally has a very fast link through
the buffercache to pack files etc.  A hybrid approach is also
possible, where objects are fetched individually via fast-import
and cached in a local .git repo.  And I have a hunch that LZOP
compression of the stream may also be a win, but as with all of
these ideas, it would be after profiling identifies it as a choke point 
than just because it sounds good.

 > I'll look into the approaches you suggest.  One consideration
 > is coming up with a high-leverage approach - i.e. not doing
 > heavy dev work if we can avoid it.

Right.  You don't actually need to port the whole of git to Hadoop 
initially, to begin with it can just pass through all commands to a 
server-side git fast-import process.  When you find specific operations 
which are slow then these specific operations can be implemented using a 
Hadoop back-end, and the rest backed to the standard git.  If done using 
a useful plug-in system, these systems could be accepted by the core 
project as an enterprise scaling option.

This could let you get going with the knowledge that the scaling option 
is there should it come out.

 > On the other hand, it would be nice if we (including the entire
 > community:) ) improve git in areas that others that share
 > similar issues benefit from as well.

Like I say, a lot of people have run into this already...

HTH,
Sam

next prev parent reply	other threads:[~2012-02-06 21:17 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
2012-02-03 17:00   ` Joshua Redstone
2012-02-03 22:40     ` Sam Vilain
2012-02-03 22:57       ` Sam Vilain
2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
2012-02-03 23:05     ` Matt Graham
2012-02-04  1:25   ` Evgeny Sazhin
2012-02-03 23:35 ` Chris Lee
2012-02-04  0:01 ` Zeki Mokhtarzada
2012-02-04  5:07 ` Joey Hess
2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
2012-02-04 18:05   ` Joshua Redstone
2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
2012-02-06 15:40       ` Joey Hess
2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
2012-02-09 21:06           ` Joshua Redstone
2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
2012-02-10  9:39               ` Christian Couder
2012-02-10 12:24                 ` Nguyen Thai Ngoc Duy
2012-02-06  7:10     ` David Mohs
2012-02-06 16:23     ` Matt Graham
2012-02-06 20:50       ` Joshua Redstone
2012-02-06 21:07         ` Greg Troxel
2012-02-07  1:28         ` david
2012-02-06 21:17     ` Sam Vilain [this message]
2012-02-04 20:05   ` Joshua Redstone
2012-02-05 15:01   ` Tomas Carnecky
2012-02-05 15:17     ` Nguyen Thai Ngoc Duy
2012-02-04  8:57 ` slinky
2012-02-04 21:42 ` Greg Troxel
2012-02-05  4:30 ` david
2012-02-05 11:24   ` David Barr
2012-02-07  8:58 ` Emanuele Zattin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F30435B.5070709@vilain.net \
    --to=sam@vilain.net \
    --cc=git@vger.kernel.org \
    --cc=joshua.redstone@fb.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.