git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stephen Morton <stephen.c.morton@gmail.com>
To: git@vger.kernel.org
Subject: Re: Git Scaling: What factors most affect Git performance for a large repo?
Date: Fri, 20 Feb 2015 11:06:44 -0500	[thread overview]
Message-ID: <CAH8BJxEWDb0SDHPS_ZnPzz0QEbryw2GCv2RtJm2u_6rPH566hg@mail.gmail.com> (raw)
In-Reply-To: <20150220000320.GD5021@vauxhall.crustytoothpaste.net>

This is fantastic. I really appreciate all the answers. And it's great
that I think I've sparked some general discussion that could lead
somewhere too.

Notes:

I'm currently using 2.1.3. I'll move to 2.3.x

I'm experimenting with git-annex to reduce repo size on disk. We'll see.

I could remove all tags older than /n/ years old in the active repo
and just maintain them in the historical repo. (We have quite a lot of
CI-generated tags.) It sounds like that might improve performance.

Questions:

1. Ævar : I'm a bit concerned by your statement that git rebases take
about 1-2 s per commit. Does that mean that a "git pull --rebase", if
it is picking up say 120 commits (not at all unrealistic), could
potentially take 4 minutes to complete? Or have I misinterpreted your
comment.

2. I'd not heard about bitmap indexes before this thread but it sounds
like they should help me. In limited searching I can't find much
useful documentation about them. It is also not clear to me if I have
to explicitly run "git repack --write-bitmap-indexes" or if git will
automatically detect when they're needed; first experiments seem to
indicate that I need to explicitly generate them. I assume that once
the index is there, git will just use it automatically.


Steve


On Thu, Feb 19, 2015 at 7:03 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote:
>> I posted this to comp.version-control.git.user and didn't get any response. I
>> think the question is plumbing-related enough that I can ask it here.
>>
>> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
>> large repo. [1] We will have a central repo using GitLab (or similar) that
>> everybody works with. Forks, code sharing, pull requests etc. will be done
>> through this central server.
>>
>> By 'performance', I guess I mean speed of day to day operations for devs.
>>
>>    * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>>    * Will a few simultaneous clones from the central server also slow down
>>      other concurrent operations for other users?
>
> This hasn't been a problem for us at $DAYJOB.  Git doesn't lock anything
> on fetches, so each process is independent.  We probably have about
> sixty developers (and maybe twenty other occasional users) that manage
> to interact with our Git server all day long.  We also have probably
> twenty smoker (CI) systems pulling at two hour intervals, or, when
> there's nothing to do, every two minutes, plus probably fifteen to
> twenty build systems pulling hourly.
>
> I assume you will provide adequate resources for your server.
>
>>    * Will 'git pull' be slow?
>>    * 'git push'?
>
> The most pathological case I've seen for git push is a branch with a
> single commit merged into the main development branch.  As of Git 2.3.0,
> the performance regression here is fixed.
>
> Obviously, the speed of your network connection will affect this.  Even
> at 30 MB/s, cloning several gigabytes of data takes time.  Git tries
> hard to eliminate sending a lot of data, so if your developers keep
> reasonably up-to-date, the cost of establishing the connection will tend
> to dominate.
>
> I see pull and push times that are less than 2 seconds in most cases.
>
>>    * 'git commit'? (It is listed as slow in reference [3].)
>>    * 'git stautus'? (Slow again in reference 3 though I don't see it.)
>
> These can be slow with slow disks or over remote file systems.  I
> recommend not doing that.  I've heard rumbles that disk performance is
> better on Unix, but I don't use Windows so I can't say.
>
> You should keep your .gitignore files up-to-date to avoid enumerating
> untracked files.  There's some work towards making this less of an
> issue.
>
> git blame can be somewhat slow, but it's not something I use more than
> about once a day, so it doesn't bother me that much.
>
>> Assuming I can put lots of resources into a central server with lots of CPU,
>> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
>> affect devs' experience?
>>    * Number of commits
>>    * Sheer disk space occupied by the repo
>
> The number of files can impact performance due to the number of stat()s
> required.
>
>>    * Number of tags.
>>    * Number of branches.
>
> The number of tags and branches individually is really less relevant
> than the total number of refs (tags, branches, remote branches, etc).
> Very large numbers of refs can impact performance on pushes and pulls
> due to the need to enumerate them all.
>
>>    * Binary objects in the repo that cause it to bloat in size [1]
>>    * Other factors?
>
> If you want good performance, I'd recommend the latest version of Git
> both client- and server-side.  Newer versions of Git provide pack
> bitmaps, which can dramatically speed up clones and fetches, and Git
> 2.3.0 fixes a performance regression with large numbers of refs in
> non-shallow repositories.
>
> It is totally worth it to roll your own packages of git if your vendor
> provides old versions.
>
>> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
>> networking-- which is most critical here?
>
> I generally find that having a good disk cache is important with large
> repositories.  It may be advantageous to make sure the developer
> machines have adequate memory.  Performance is notably better on
> development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB.
>
> I can't speak to the server side, as I'm not directly involved with its
> deployment.
>
>> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
>> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
>> branches" which are just one little dangling commit required to change the code
>> a little bit between a commit a tag that was not quite made from it.)
>
> I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly
> growing) refs.  Other developers work on a repo that's 9 GB packed, with
> somewhat fewer refs.  We don't tend to have problems with this.
>
> Obviously, performance is better on some of our smaller repos, but it's
> not unacceptable on the larger ones.  I generally find that the 940 KB
> repo with huge numbers of files performs worse than the 1.9 GB repo with
> somewhat fewer.  If you can split your repository into multiple logical
> repositories, that will certainly improve performance.
>
> If you end up having pain points, we're certainly interested in
> working through those.  I've brought up performance problems and people
> are generally responsive.
> --
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

  reply	other threads:[~2015-02-20 16:06 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
2015-02-19 22:21 ` Stefan Beller
2015-02-19 23:06   ` Stephen Morton
2015-02-19 23:15     ` Stefan Beller
2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
2015-02-20  0:04   ` Duy Nguyen
2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
2015-02-20 21:04         ` Junio C Hamano
2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
2015-03-02 20:15             ` Junio C Hamano
2015-02-20 22:02         ` Sebastian Schuberth
2015-02-24 12:44         ` Michael Haggerty
2015-03-02 19:42           ` Ævar Arnfjörð Bjarmason
2015-02-21  3:51       ` Duy Nguyen
2015-02-19 23:38 ` Duy Nguyen
2015-02-20  0:42   ` David Turner
2015-02-20 20:59     ` Junio C Hamano
2015-02-23 20:23       ` David Turner
2015-02-21  4:01     ` Duy Nguyen
2015-02-25 12:02       ` Duy Nguyen
2015-02-20  0:03 ` brian m. carlson
2015-02-20 16:06   ` Stephen Morton [this message]
2015-02-20 16:38     ` Matthieu Moy
2015-02-20 17:16     ` brian m. carlson
2015-02-20 22:08   ` Sebastian Schuberth
2015-02-20 22:58     ` brian m. carlson
  -- strict thread matches above, loose matches on Subject: below --
2015-02-20  6:57 Martin Fick
2015-02-20 18:29 ` David Turner
2015-02-20 20:37   ` Martin Fick
2015-02-21  0:41     ` David Turner
2015-02-20 19:27 ` Randall S. Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAH8BJxEWDb0SDHPS_ZnPzz0QEbryw2GCv2RtJm2u_6rPH566hg@mail.gmail.com \
    --to=stephen.c.morton@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).