git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Git Scaling: What factors most affect Git performance for a large repo?
@ 2015-02-19 21:26 Stephen Morton
  2015-02-19 22:21 ` Stefan Beller
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Stephen Morton @ 2015-02-19 21:26 UTC (permalink / raw)
  To: git

I posted this to comp.version-control.git.user and didn't get any response. I
think the question is plumbing-related enough that I can ask it here.

I'm evaluating the feasibility of moving my team from SVN to git. We have a very
large repo. [1] We will have a central repo using GitLab (or similar) that
everybody works with. Forks, code sharing, pull requests etc. will be done
through this central server.

By 'performance', I guess I mean speed of day to day operations for devs.

   * (Obviously, trivially, a (non-local) clone will be slow with a large repo.)
   * Will a few simultaneous clones from the central server also slow down
     other concurrent operations for other users?
   * Will 'git pull' be slow?
   * 'git push'?
   * 'git commit'? (It is listed as slow in reference [3].)
   * 'git stautus'? (Slow again in reference 3 though I don't see it.)
   * Some operations might not seem to be day-to-day but if they are called
     frequently by the web front-end to GitLab/Stash/GitHub etc then
     they can become bottlenecks. (e.g. 'git branch --contains' seems terribly
     adversely affected by large numbers of branches.)
   * Others?


Assuming I can put lots of resources into a central server with lots of CPU,
RAM, fast SSD, fast networking, what aspects of the repo are most likely to
affect devs' experience?
   * Number of commits
   * Sheer disk space occupied by the repo
   * Number of tags.
   * Number of branches.
   * Binary objects in the repo that cause it to bloat in size [1]
   * Other factors?

Of the various HW items listed above --CPU speed, number of cores, RAM, SSD,
networking-- which is most critical here?

(Stash recommends 1.5 x repo_size x number of concurrent clones of
available RAM.
I assume that is good advice in general.)

Assume ridiculous numbers. Let me exaggerate: say 1 million commits, 15 GB repo,
50k tags, 1,000 branches. (Due to historical code fixups, another 5,000 "fix-up
branches" which are just one little dangling commit required to change the code
a little bit between a commit a tag that was not quite made from it.)

While there's lots of information online, much of it is old [3] and with git
constantly evolving I don't know how valid it still is. Then there's anecdotal
evidence that is of questionable value.[2]
    Are many/all of the issues Facebook identified [3] resolved? (Yes, I
understand Facebook went with Mercurial. But I imagine the git team nevertheless
took their analysis to heart.)


Thanks,
Steve


[1] (Yes, I'm investigating ways to make our repo not so large etc. That's
    beyond the scope of the discussion I'd like to have with this
    question. Thanks.)
[2] The large amounts of anecdotal evidence relate to the "why don't you try it
    yourself?" response to my question. I will I I have to but setting up a
    properly methodical study is time consuming and difficult --I don't want to
    produce poor anecdotal numbers that don't really hold up-- and if somebody's
    already done the work, then I should leverage it.
[3] http://thread.gmane.org/gmane.comp.version-control.git/189776

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: Git Scaling: What factors most affect Git performance for a large repo?
@ 2015-02-20  6:57 Martin Fick
  2015-02-20 18:29 ` David Turner
  2015-02-20 19:27 ` Randall S. Becker
  0 siblings, 2 replies; 33+ messages in thread
From: Martin Fick @ 2015-02-20  6:57 UTC (permalink / raw)
  To: David Turner; +Cc: Git Mailing List, Stephen Morton, Duy Nguyen

On Feb 19, 2015 5:42 PM, David Turner <dturner@twopensource.com> wrote:
>
> On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote: 
> > >    * 'git push'? 
> > 
> > This one is not affected by how deep your repo's history is, or how 
> > wide your tree is, so should be quick.. 
> > 
> > Ah the number of refs may affect both git-push and git-pull. I think 
> > Stefan knows better than I in this area. 
>
> I can tell you that this is a bit of a problem for us at Twitter.  We 
> have over 100k refs, which adds ~20MiB of downstream traffic to every 
> push. 
>
> I added a hack to improve this locally inside Twitter: The client sends 
> a bloom filter of shas that it believes that the server knows about; the 
> server sends only the sha of master and any refs that are not in the 
> bloom filter.  The client  uses its local version of the servers' refs 
> as if they had just been sent.  This means that some packs will be 
> suboptimal, due to false positives in the bloom filter leading some new 
> refs to not be sent.  Also, if there were a repack between the pull and 
> the push, some refs might have been deleted on the server; we repack 
> rarely enough and pull frequently enough that this is hopefully not an 
> issue. 
>
> We're still testing to see if this works.  But due to the number of 
> assumptions it makes, it's probably not that great an idea for general 
> use. 

Good to hear that others are starting to experiment with solutions to this problem!  I hope to hear more updates on this.

I have a prototype of a simpler, and
I believe more robust solution, but aimed at a smaller use case I think.  On connecting, the client sends a sha of all its refs/shas as defined by a refspec, which it also sends to the server, which it believes the server might have the same refs/shas values for.  The server can then calculate the value of its refs/shas which meet the same refspec, and then omit sending those refs if the "verification" sha matches, and instead send only a confirmation that they matched (along with any refs outside of the refspec).  On a match, the client can inject the local values of the refs which met the refspec and be guaranteed that they match the server's values.

This optimization is aimed at the worst case scenario (and is thus the potentially best case "compression"), when the client and server match for all refs (a refs/* refspec)  This is something that happens often on Gerrit server startup, when it verifies that its mirrors are up-to-date.  One reason I chose this as a starting optimization, is because I think it is one use case which will actually not benefit from "fixing" the git protocol to only send relevant refs since all the refs are in fact relevant here! So something like this will likely be needed in any future git protocol in order for it to be efficient for this use case.  And I believe this use case is likely to stick around.

With a minor tweak, this optimization should work when replicating actual expected updates also by excluding the expected updating refs from the verification so that the server always sends their values since they will likely not match and would wreck the optimization.  However, for this use case it is not clear whether it is actually even worth caring about the non updating refs?  In theory the knowledge of the non updating refs can potentially reduce the amount of data transmitted, but I suspect that as the ref count increases, this has diminishing returns and mostly ends up chewing up CPU and memory in a vain attempt to reduce network traffic.

Please do keep us up-to-date of your results,

-Martin


Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2015-03-02 20:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
2015-02-19 22:21 ` Stefan Beller
2015-02-19 23:06   ` Stephen Morton
2015-02-19 23:15     ` Stefan Beller
2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
2015-02-20  0:04   ` Duy Nguyen
2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
2015-02-20 21:04         ` Junio C Hamano
2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
2015-03-02 20:15             ` Junio C Hamano
2015-02-20 22:02         ` Sebastian Schuberth
2015-02-24 12:44         ` Michael Haggerty
2015-03-02 19:42           ` Ævar Arnfjörð Bjarmason
2015-02-21  3:51       ` Duy Nguyen
2015-02-19 23:38 ` Duy Nguyen
2015-02-20  0:42   ` David Turner
2015-02-20 20:59     ` Junio C Hamano
2015-02-23 20:23       ` David Turner
2015-02-21  4:01     ` Duy Nguyen
2015-02-25 12:02       ` Duy Nguyen
2015-02-20  0:03 ` brian m. carlson
2015-02-20 16:06   ` Stephen Morton
2015-02-20 16:38     ` Matthieu Moy
2015-02-20 17:16     ` brian m. carlson
2015-02-20 22:08   ` Sebastian Schuberth
2015-02-20 22:58     ` brian m. carlson
  -- strict thread matches above, loose matches on Subject: below --
2015-02-20  6:57 Martin Fick
2015-02-20 18:29 ` David Turner
2015-02-20 20:37   ` Martin Fick
2015-02-21  0:41     ` David Turner
2015-02-20 19:27 ` Randall S. Becker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).