git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Turner <dturner@twopensource.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Duy Nguyen <pclouds@gmail.com>,
	Stephen Morton <stephen.c.morton@gmail.com>,
	Git Mailing List <git@vger.kernel.org>
Subject: Re: Git Scaling: What factors most affect Git performance for a large repo?
Date: Mon, 23 Feb 2015 15:23:16 -0500	[thread overview]
Message-ID: <1424722996.27803.29.camel@leckie> (raw)
In-Reply-To: <xmqqegpkz4cf.fsf@gitster.dls.corp.google.com>


On Fri, 2015-02-20 at 12:59 -0800, Junio C Hamano wrote:
> David Turner <dturner@twopensource.com> writes:
> 
> > On Fri, 2015-02-20 at 06:38 +0700, Duy Nguyen wrote:
> >> >    * 'git push'?
> >> 
> >> This one is not affected by how deep your repo's history is, or how
> >> wide your tree is, so should be quick..
> >> 
> >> Ah the number of refs may affect both git-push and git-pull. I think
> >> Stefan knows better than I in this area.
> >
> > I can tell you that this is a bit of a problem for us at Twitter.  We
> > have over 100k refs, which adds ~20MiB of downstream traffic to every
> > push.
> >
> > I added a hack to improve this locally inside Twitter: The client sends
> > a bloom filter of shas that it believes that the server knows about; the
> > server sends only the sha of master and any refs that are not in the
> > bloom filter.  The client  uses its local version of the servers' refs
> > as if they had just been sent....
> 
> Interesting.
> 
> Care to extend the discussion to improve the protocol exchange,
> which starts at $gmane/263932 [*1*], where I list the known issues
> around the current protocol (and a possible way to correct them in
> footnotes)?

At Twitter, we changed to an entirely different clone strategy for our
largest repo: instead of using git clone, we use bittorrent (on a
tarball of the repo).  For git pull, we maintain a journal of all pushes
ever made to the server (data and ref updates); each client keeps track
of their location in that journal.  So now pull does not require any
computation on the server; the client just requests the segment of the
journal that they don't have.  Then the client replays the journal.
This scheme isn't perfect: clients end up with data about even
transitory and long-dead branches, and there is presently no way to
redact data (although that would be possible to add).  And of course
shallow and sparse clones are impossible.  But it works quite well for
Twitter's needs.  As I understand it, the hope is to implement redaction
and then submit patches upstream.

I say "we", but I personally did not do any of the above work.  Because
I haven't looked into most of these issues personally, I'm reluctant to
say too much on protocol improvements.  I would want to better
understand the constraints.  I do think there is value in having a
diversity of possible protocols to handle different use cases.  As
repositories grow, traditional full-repo clones become less viable.
Network transfer and client-side performance both suffer.  In a repo the
size of (say) WebKit, the traditional model works.  In a repo the size
of Facebook's monorepo, it starts to break down.  So Facebook does
entirely shallow clones (using hg, but the problems are similar in git).
Commands like log and blame instead call out to a server to gather
history data.  At Google, whose repo is I think two or three orders of
magnitude larger than WebKit, all local copies are both shallow and
sparse; there is also support for "sparse commits" -- so that a commit
that affects (say) ten thousand files across the entire tree can be kept
to a reasonable size. 

<end digression>

Twitter's journal scheme explains why I implemented bloom filter pushes
-- the number of refs does not significantly affect pull performance,
but pushes still go through the normal git machinery, so we wanted an
optimization to reduce latency there.

  reply	other threads:[~2015-02-23 20:23 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-19 21:26 Git Scaling: What factors most affect Git performance for a large repo? Stephen Morton
2015-02-19 22:21 ` Stefan Beller
2015-02-19 23:06   ` Stephen Morton
2015-02-19 23:15     ` Stefan Beller
2015-02-19 23:29 ` Ævar Arnfjörð Bjarmason
2015-02-20  0:04   ` Duy Nguyen
2015-02-20 12:09     ` Ævar Arnfjörð Bjarmason
2015-02-20 12:11       ` Ævar Arnfjörð Bjarmason
2015-02-20 14:25       ` Ævar Arnfjörð Bjarmason
2015-02-20 21:04         ` Junio C Hamano
2015-03-02 19:36           ` Ævar Arnfjörð Bjarmason
2015-03-02 20:15             ` Junio C Hamano
2015-02-20 22:02         ` Sebastian Schuberth
2015-02-24 12:44         ` Michael Haggerty
2015-03-02 19:42           ` Ævar Arnfjörð Bjarmason
2015-02-21  3:51       ` Duy Nguyen
2015-02-19 23:38 ` Duy Nguyen
2015-02-20  0:42   ` David Turner
2015-02-20 20:59     ` Junio C Hamano
2015-02-23 20:23       ` David Turner [this message]
2015-02-21  4:01     ` Duy Nguyen
2015-02-25 12:02       ` Duy Nguyen
2015-02-20  0:03 ` brian m. carlson
2015-02-20 16:06   ` Stephen Morton
2015-02-20 16:38     ` Matthieu Moy
2015-02-20 17:16     ` brian m. carlson
2015-02-20 22:08   ` Sebastian Schuberth
2015-02-20 22:58     ` brian m. carlson
  -- strict thread matches above, loose matches on Subject: below --
2015-02-20  6:57 Martin Fick
2015-02-20 18:29 ` David Turner
2015-02-20 20:37   ` Martin Fick
2015-02-21  0:41     ` David Turner
2015-02-20 19:27 ` Randall S. Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1424722996.27803.29.camel@leckie \
    --to=dturner@twopensource.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=stephen.c.morton@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).