From: Stephen Morton <stephen.c.morton@gmail.com>
To: git@vger.kernel.org
Subject: Re: Git Scaling: What factors most affect Git performance for a large repo?
Date: Fri, 20 Feb 2015 11:06:44 -0500 [thread overview]
Message-ID: <CAH8BJxEWDb0SDHPS_ZnPzz0QEbryw2GCv2RtJm2u_6rPH566hg@mail.gmail.com> (raw)
In-Reply-To: <20150220000320.GD5021@vauxhall.crustytoothpaste.net>
This is fantastic. I really appreciate all the answers. And it's great
that I think I've sparked some general discussion that could lead
somewhere too.
Notes:
I'm currently using 2.1.3. I'll move to 2.3.x
I'm experimenting with git-annex to reduce repo size on disk. We'll see.
I could remove all tags older than /n/ years old in the active repo
and just maintain them in the historical repo. (We have quite a lot of
CI-generated tags.) It sounds like that might improve performance.
Questions:
1. Ævar : I'm a bit concerned by your statement that git rebases take
about 1-2 s per commit. Does that mean that a "git pull --rebase", if
it is picking up say 120 commits (not at all unrealistic), could
potentially take 4 minutes to complete? Or have I misinterpreted your
comment.
2. I'd not heard about bitmap indexes before this thread but it sounds
like they should help me. In limited searching I can't find much
useful documentation about them. It is also not clear to me if I have
to explicitly run "git repack --write-bitmap-indexes" or if git will
automatically detect when they're needed; first experiments seem to
indicate that I need to explicitly generate them. I assume that once
the index is there, git will just use it automatically.
Steve
On Thu, Feb 19, 2015 at 7:03 PM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Thu, Feb 19, 2015 at 04:26:58PM -0500, Stephen Morton wrote:
>> I posted this to comp.version-control.git.user and didn't get any response. I
>> think the question is plumbing-related enough that I can ask it here.
>>
>> I'm evaluating the feasibility of moving my team from SVN to git. We have a very
>> large repo. [1] We will have a central repo using GitLab (or similar) that
>> everybody works with. Forks, code sharing, pull requests etc. will be done
>> through this central server.
>>
>> By 'performance', I guess I mean speed of day to day operations for devs.
>>
>> * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
>> * Will a few simultaneous clones from the central server also slow down
>> other concurrent operations for other users?
>
> This hasn't been a problem for us at $DAYJOB. Git doesn't lock anything
> on fetches, so each process is independent. We probably have about
> sixty developers (and maybe twenty other occasional users) that manage
> to interact with our Git server all day long. We also have probably
> twenty smoker (CI) systems pulling at two hour intervals, or, when
> there's nothing to do, every two minutes, plus probably fifteen to
> twenty build systems pulling hourly.
>
> I assume you will provide adequate resources for your server.
>
>> * Will 'git pull' be slow?
>> * 'git push'?
>
> The most pathological case I've seen for git push is a branch with a
> single commit merged into the main development branch. As of Git 2.3.0,
> the performance regression here is fixed.
>
> Obviously, the speed of your network connection will affect this. Even
> at 30 MB/s, cloning several gigabytes of data takes time. Git tries
> hard to eliminate sending a lot of data, so if your developers keep
> reasonably up-to-date, the cost of establishing the connection will tend
> to dominate.
>
> I see pull and push times that are less than 2 seconds in most cases.
>
>> * 'git commit'? (It is listed as slow in reference [3].)
>> * 'git stautus'? (Slow again in reference 3 though I don't see it.)
>
> These can be slow with slow disks or over remote file systems. I
> recommend not doing that. I've heard rumbles that disk performance is
> better on Unix, but I don't use Windows so I can't say.
>
> You should keep your .gitignore files up-to-date to avoid enumerating
> untracked files. There's some work towards making this less of an
> issue.
>
> git blame can be somewhat slow, but it's not something I use more than
> about once a day, so it doesn't bother me that much.
>
>> Assuming I can put lots of resources into a central server with lots of CPU,
>> RAM, fast SSD, fast networking, what aspects of the repo are most likely to
>> affect devs' experience?
>> * Number of commits
>> * Sheer disk space occupied by the repo
>
> The number of files can impact performance due to the number of stat()s
> required.
>
>> * Number of tags.
>> * Number of branches.
>
> The number of tags and branches individually is really less relevant
> than the total number of refs (tags, branches, remote branches, etc).
> Very large numbers of refs can impact performance on pushes and pulls
> due to the need to enumerate them all.
>
>> * Binary objects in the repo that cause it to bloat in size [1]
>> * Other factors?
>
> If you want good performance, I'd recommend the latest version of Git
> both client- and server-side. Newer versions of Git provide pack
> bitmaps, which can dramatically speed up clones and fetches, and Git
> 2.3.0 fixes a performance regression with large numbers of refs in
> non-shallow repositories.
>
> It is totally worth it to roll your own packages of git if your vendor
> provides old versions.
>
>> Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
>> networking-- which is most critical here?
>
> I generally find that having a good disk cache is important with large
> repositories. It may be advantageous to make sure the developer
> machines have adequate memory. Performance is notably better on
> development machines (VMs) with 2 GB or 4 GB of memory instead of 1 GB.
>
> I can't speak to the server side, as I'm not directly involved with its
> deployment.
>
>> Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
>> 50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
>> branches" which are just one little dangling commit required to change the code
>> a little bit between a commit a tag that was not quite made from it.)
>
> I routinely work on a repo that's 1.9 GB packed, with 25k (and rapidly
> growing) refs. Other developers work on a repo that's 9 GB packed, with
> somewhat fewer refs. We don't tend to have problems with this.
>
> Obviously, performance is better on some of our smaller repos, but it's
> not unacceptable on the larger ones. I generally find that the 940 KB
> repo with huge numbers of files performs worse than the 1.9 GB repo with
> somewhat fewer. If you can split your repository into multiple logical
> repositories, that will certainly improve performance.
>
> If you end up having pain points, we're certainly interested in
> working through those. I've brought up performance problems and people
> are generally responsive.
> --
> brian m. carlson / brian with sandals: Houston, Texas, US
> +1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
> OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187
next prev parent reply other threads:[~2015-02-20 16:06 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
2015-02-19 22:21 ` Stefan Beller
2015-02-19 23:06 ` Stephen Morton
2015-02-19 23:15 ` Stefan Beller
2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
2015-02-20 0:04 ` Duy Nguyen
2015-02-20 12:09 ` Ævar Arnfjörð Bjarmason
2015-02-20 12:11 ` Ævar Arnfjörð Bjarmason
2015-02-20 14:25 ` Ævar Arnfjörð Bjarmason
2015-02-20 21:04 ` Junio C Hamano
2015-03-02 19:36 ` Ævar Arnfjörð Bjarmason
2015-03-02 20:15 ` Junio C Hamano
2015-02-20 22:02 ` Sebastian Schuberth
2015-02-24 12:44 ` Michael Haggerty
2015-03-02 19:42 ` Ævar Arnfjörð Bjarmason
2015-02-21 3:51 ` Duy Nguyen
2015-02-19 23:38 ` Duy Nguyen
2015-02-20 0:42 ` David Turner
2015-02-20 20:59 ` Junio C Hamano
2015-02-23 20:23 ` David Turner
2015-02-21 4:01 ` Duy Nguyen
2015-02-25 12:02 ` Duy Nguyen
2015-02-20 0:03 ` brian m. carlson
2015-02-20 16:06 ` Stephen Morton [this message]
2015-02-20 16:38 ` Matthieu Moy
2015-02-20 17:16 ` brian m. carlson
2015-02-20 22:08 ` Sebastian Schuberth
2015-02-20 22:58 ` brian m. carlson
-- strict thread matches above, loose matches on Subject: below --
2015-02-20 6:57 Martin Fick
2015-02-20 18:29 ` David Turner
2015-02-20 20:37 ` Martin Fick
2015-02-21 0:41 ` David Turner
2015-02-20 19:27 ` Randall S. Becker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAH8BJxEWDb0SDHPS_ZnPzz0QEbryw2GCv2RtJm2u_6rPH566hg@mail.gmail.com \
--to=stephen.c.morton@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).