Re: Git performance results on a large repository

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sam Vilain <sam@vilain.net>
To: Joshua Redstone <joshua.redstone@fb.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: Git performance results on a large repository
Date: Fri, 03 Feb 2012 14:40:54 -0800	[thread overview]
Message-ID: <4F2C6276.1070100@vilain.net> (raw)
In-Reply-To: <CB5179E9.3B751%joshua.redstone@fb.com>

Joshua,

You have an interesting use case.

If I were you I'd consider investigating the git fast-import protocol. 
It has become bi–directional, and is essentially socket access to a git 
repository with read and transactional update capability.  With a few 
more commands implemented, it may even be capable of providing all 
functionality required for command–line git use.

It is already possible that the ".git" directory can be a file: this 
case is used for submodules in git 1.7.8 and higher.  For this use case, 
there would be an extra field to the ".git" file which is created.  It 
would indicate the hostname (and port) to connect its internal 
'fast-import' stream to.  'clone' would consist of creating this file, 
and then getting the server to stream the objects from its pack to the 
client.

With the hard–working part of git on the other end of a network service, 
you could back it by a re–implementation of git which is written to be 
distributed in Hadoop.  There are at least two similar implementations 
of git that are like this: one for cassandra which was written by github 
as a research project, and Google's implementation on top of their 
BigTable/GFS/whatever.  As the git object storage model is write–only 
and content–addressed, it should git this kind of scaling well.

There have also been designs at various times for sparse check–outs; ie 
check–outs where you don't check out the root of the repository but a 
sub–tree.

With both of these features, clients could easily check out a small part 
of the repository very quickly.  This is probably the only case which 
SVN still does better than git at, which is a particular blocker for use 
cases like repositories with large binaries in them and for projects 
such as the one you have (another one with a similar problem was KDE, 
where their projects moved around the repository a lot, and refactoring 
touched many projects simultaneously at times).

It's a large undertaking, alright.

Sam,
just another git community propeller–head.


On 2/3/12 9:00 AM, Joshua Redstone wrote:
> Hi Ævar,
>
>
> Thanks for the comments.  I've included a bunch more info on the test repo
> below.  It is based on a growth model of two of our current repositories
> (I.e., it's not a perforce import). We already have some of the easily
> separable projects in separate repositories, like HPHP.   If we could
> split our largest repos into multiple ones, that would help the scaling
> issue.  However, the code in those repos is rather interdependent and we
> believe it'd hurt more than help to split it up, at least for the
> medium-term future.  We derive a fair amount of benefit from the code
> sharing and keeping things together in a single repo, so it's not clear
> when it'd make sense to get more aggressive splitting things up.
>
> Some more information on the test repository:   The working directory is
> 9.5 GB, the median file size is 2 KB.  The average depth of a directory
> (counting the number of '/'s) is 3.6 levels and the average depth of a
> file is 4.6.  More detailed histograms of the repository composition is
> below:
>
> ------------------------
>
> Histogram of depth of every directory in the repo (dirs=`find . -type d` ;
> (for dir in $dirs; do t=${dir//[^\/]/}; echo ${#t} ; done) |
> ~/tmp/histo.py)
> * The .git directory itself has only 161 files, so although included,
> doesn't affect the numbers significantly)
>
> [0.0 - 1.3): 271
> [1.3 - 2.6): 9966
> [2.6 - 3.9): 56595
> [3.9 - 5.2): 230239
> [5.2 - 6.5): 67394
> [6.5 - 7.8): 22868
> [7.8 - 9.1): 6568
> [9.1 - 10.4): 420
> [10.4 - 11.7): 45
> [11.7 - 13.0]: 21
> n=394387 mean=4.671830, median=5.000000, stddev=1.272658
>
>
> Histogram of depth of every file in the repo (files=`git ls-files` ; (for
> file in $files; do t=${file//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py)
> * 'git ls-files' does not prefix entries with ./, like the 'find' command
> above, does, hence why the average appears to be the same as the directory
> stats
>
> [0.0 - 1.3]: 1274
> [1.3 - 2.6]: 35353
> [2.6 - 3.9]: 196747
> [3.9 - 5.2]: 786647
> [5.2 - 6.5]: 225913
> [6.5 - 7.8]: 77667
> [7.8 - 9.1]: 22130
> [9.1 - 10.4]: 1599
> [10.4 - 11.7]: 164
> [11.7 - 13.0]: 118
> n=1347612 mean=4.655750, median=5.000000, stddev=1.278399
>
>
> Histogram of file sizes (for first 50k files - this command takes a
> while):  files=`git ls-files` ; (for file in $files; do stat -c%s $file ;
> done) | ~/tmp/histo.py
>
> [ 0.0 - 4.7): 0
> [ 4.7 - 22.5): 2
> [ 22.5 - 106.8): 0
> [ 106.8 - 506.8): 0
> [ 506.8 - 2404.7): 31142
> [ 2404.7 - 11409.9): 17837
> [ 11409.9 - 54137.1): 942
> [ 54137.1 - 256866.9): 53
> [ 256866.9 - 1218769.7): 18
> [ 1218769.7 - 5782760.0]: 5
> n=49999 mean=3590.953239, median=1772.000000, stddev=42835.330259
>
> Cheers,
> Josh
>
>
>
>
>
>
> On 2/3/12 9:56 AM, "Ævar Arnfjörð Bjarmason"<avarab@gmail.com>  wrote:
>
>> On Fri, Feb 3, 2012 at 15:20, Joshua Redstone<joshua.redstone@fb.com>
>> wrote:
>>
>>> We (Facebook) have been investigating source control systems to meet our
>>> growing needs.  We already use git fairly widely, but have noticed it
>>> getting slower as we grow, and we want to make sure we have a good story
>>> going forward.  We're debating how to proceed and would like to solicit
>>> people's thoughts.
>>
>> Where I work we also have a relatively large Git repository. Around
>> 30k files, a couple of hundred thousand commits, clone size around
>> half a GB.
>>
>> You haven't supplied background info on this but it really seems to me
>> like your testcase is converting something like a humongous Perforce
>> repository directly to Git.
>>
>> While you /can/ do this it's not a good idea, you should split up
>> repositories at the boundaries code or data doesn't directly cross
>> over, e.g. there's no reason why you need HipHop PHP in the same
>> repository as Cassandra or the Facebook chat system, is there?
>>
>> While Git could better with large repositories (in particular applying
>> commits in interactive rebase seems to be to slow down on bigger
>> repositories) there's only so much you can do about stat-ing 1.3
>> million files.
>>
>> A structure that would make more sense would be to split up that giant
>> repository into a lot of other repositories, most of them probably
>> have no direct dependencies on other components, but even those that
>> do can sometimes just use some other repository as a submodule.
>>
>> Even if you have the requirement that you'd like to roll out
>> *everything* at a certain point in time you can still solve that with
>> a super-repository that has all the other ones as submodules, and
>> creates a tag for every rollout or something like that.
>
> N�����r��y���b�X��ǧv�^�)޺{.n�+����ا�\x17��ܨ}���Ơz�&j:+v���\a����zZ+��+zf���h���~����i���z�\x1e�w���?����&�)ߢ^[fl===

next prev parent reply	other threads:[~2012-02-03 22:51 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-03 14:20 Git performance results on a large repository Joshua Redstone
2012-02-03 14:56 ` Ævar Arnfjörð Bjarmason
2012-02-03 17:00   ` Joshua Redstone
2012-02-03 22:40     ` Sam Vilain [this message]
2012-02-03 22:57       ` Sam Vilain
2012-02-07  1:19       ` Nguyen Thai Ngoc Duy
2012-02-03 23:05     ` Matt Graham
2012-02-04  1:25   ` Evgeny Sazhin
2012-02-03 23:35 ` Chris Lee
2012-02-04  0:01 ` Zeki Mokhtarzada
2012-02-04  5:07 ` Joey Hess
2012-02-04  6:53 ` Nguyen Thai Ngoc Duy
2012-02-04 18:05   ` Joshua Redstone
2012-02-05  3:47     ` Nguyen Thai Ngoc Duy
2012-02-06 15:40       ` Joey Hess
2012-02-07 13:43         ` Nguyen Thai Ngoc Duy
2012-02-09 21:06           ` Joshua Redstone
2012-02-10  7:12             ` Nguyen Thai Ngoc Duy
2012-02-10  9:39               ` Christian Couder
2012-02-10 12:24                 ` Nguyen Thai Ngoc Duy
2012-02-06  7:10     ` David Mohs
2012-02-06 16:23     ` Matt Graham
2012-02-06 20:50       ` Joshua Redstone
2012-02-06 21:07         ` Greg Troxel
2012-02-07  1:28         ` david
2012-02-06 21:17     ` Sam Vilain
2012-02-04 20:05   ` Joshua Redstone
2012-02-05 15:01   ` Tomas Carnecky
2012-02-05 15:17     ` Nguyen Thai Ngoc Duy
2012-02-04  8:57 ` slinky
2012-02-04 21:42 ` Greg Troxel
2012-02-05  4:30 ` david
2012-02-05 11:24   ` David Barr
2012-02-07  8:58 ` Emanuele Zattin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F2C6276.1070100@vilain.net \
    --to=sam@vilain.net \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=joshua.redstone@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).