From: Junio C Hamano <gitster@pobox.com>
To: Martin Fick <mfick@codeaurora.org>
Cc: Jeff King <peff@peff.net>,
Michael Haggerty <mhagger@alum.mit.edu>,
git discussion list <git@vger.kernel.org>
Subject: Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2)
Date: Tue, 22 May 2012 16:23:17 -0700 [thread overview]
Message-ID: <7v8vgj3imy.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <201205221619.31738.mfick@codeaurora.org> (Martin Fick's message of "Tue, 22 May 2012 16:19:30 -0600")
Martin Fick <mfick@codeaurora.org> writes:
> Android consists of approximately 400 projects, and if
> anyone tags their builds regularly, that means that they
> will be tagging 400 projects per build. We have something
> like 10K tags on average across 400 projects, so when we do
> a simple No Op sync just to see if all 400 projects are up
> to date, this yields about 200MB of data over the wire (4M
> refs)!!!
> ...
> As you can imagine, we really would like to see this
> eliminated from our workflows. If we want to check 400
> projects to see if they are up to date, it should be 400
> refs, not 400 * 10K.
I think we can ignore that 400 part from the above for now. As they are
independent "projects", it is unreasonable to expect that any solution
would add costs less than linearly when you add more projects. But it
should be possible to make the cost of update discovery comparable between
the cases where you have 10K existing refs that both ends have seen and
one ref got updated vs you started from 20K common refs. The latter does
not have to cost twice (or 4 times if an inefficient way is quadratic).
I think the assumption in your build-bot scenario needs to be a bit more
clarified to give context. I am assuming, when you say "see if projects
are up to date", you mean your build-bot is interested in a single branch
in each repository, possibly together with new tags that have become
reachable from the updated tip of the branch (if the branch tip moved). I
also assume that the build-bot polls very often and that is why 200MB (or
divided by 400 it would be 0.5MB, which still is a lot when you talk about
a single repository) hurts a lot.
Everything below is what I think you already know about; I am giving an
overview of _my_ understanding of the issue for the benefit of others (and
have you sanity check if I misunderstood your pain points).
Even an attempt to optimize "can we skip updating" by asking "ls-remote"
about a single branch is futile with the current protocol, as we expect
the server to respond with the full ref advertisement and filter out the
result on our end. In the upload-pack protocol, the serving side talks
first and that is "here are all the refs I have and their values". The
other side that asks for objects does not have a way to tell it that it is
only interested in a subset, even if it wanted to. Unlike the capability
exchange that happens after the initial connection, there is no gap in
the protocol for the side that initiates the connection to tell the
serving side to skip/delay the initial ref advertisement.
What the above means is that the transition has to happen by keeping the
current upload-pack and a new protocol has to be invoked by a different
program ("upload-pack-2" or something needs to be taught to the git-daemon
and also invoked by ssh) even if an updated protocol is designed. The
updated connection initiator (i.e. ls-remote and fetch) may be able to try
upload-pack-2 first and fall back to upload-pack after getting a failure,
but afaik nobody figured out if this is doable by checking if a failure
signal that comes from currently deployed software is detailed enough for
our side to tell if the failure is because the other end does not support
upload-pack-2 or because of some other failure (e.g. auth failed, no such
repo, etc.). Regardless of auto-fallback, an updated connection initiator
needs a configuration switch to be explicitly told which protocol to talk
to with which remote.
What exact protocol "upload-pack-2" talks may be an interesting subject on
its own. It may still have the serving side talk first (probably show its
capabilities and tips of branches if there are not too many of them), or
it may instead let the connection initiator talk first and ask for
specific set of refs. But that is of lessor importance from the
high-level point of view and I won't go into the details in this thread.
next prev parent reply other threads:[~2012-05-22 23:23 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-21 8:13 remove_duplicates() in builtin/fetch-pack.c is O(N^2) Michael Haggerty
2012-05-21 9:09 ` Junio C Hamano
2012-05-21 9:42 ` demerphq
2012-05-21 17:45 ` Jeff King
2012-05-21 22:14 ` Jeff King
2012-05-21 22:17 ` [PATCH 1/5] fetch-pack: sort incoming heads Jeff King
2012-05-22 20:08 ` Junio C Hamano
2012-05-22 20:23 ` Jeff King
2012-05-24 6:04 ` Jeff King
2012-05-21 22:17 ` [PATCH 2/5] fetch-pack: avoid quadratic behavior in remove_duplicates Jeff King
2012-05-21 22:19 ` [PATCH 3/5] add sorting infrastructure for list refs Jeff King
2012-05-21 22:19 ` [PATCH 4/5] fetch-pack: sort the list of incoming refs Jeff King
2012-05-21 22:23 ` [PATCH 5/5] fetch-pack: avoid quadratic loop in filter_refs Jeff King
2012-05-22 20:16 ` Junio C Hamano
2012-05-21 23:52 ` remove_duplicates() in builtin/fetch-pack.c is O(N^2) Jeff King
2012-05-22 0:07 ` Jeff King
2012-05-22 3:59 ` Michael Haggerty
2012-05-22 4:11 ` Jeff King
2012-05-22 7:15 ` Michael Haggerty
2012-05-22 7:37 ` Jeff King
2012-05-22 13:28 ` Michael Haggerty
2012-05-22 17:33 ` Jeff King
2012-05-24 12:05 ` Michael Haggerty
2012-05-25 0:17 ` Martin Fick
2012-05-25 0:39 ` Jeff King
2012-05-25 0:54 ` Martin Fick
2012-05-25 1:04 ` Jeff King
2012-05-25 1:32 ` Martin Fick
2012-05-25 6:50 ` Jeff King
2012-05-22 12:18 ` Nguyen Thai Ngoc Duy
2012-05-22 13:35 ` Michael Haggerty
2012-05-22 17:01 ` Jeff King
2012-05-22 17:35 ` Junio C Hamano
2012-05-22 17:46 ` Jeff King
2012-05-24 4:54 ` Michael Haggerty
2012-05-23 1:20 ` Nguyen Thai Ngoc Duy
2012-05-22 16:16 ` Junio C Hamano
2012-05-21 18:15 ` Martin Fick
2012-05-21 19:41 ` Jeff King
2012-05-21 22:08 ` Junio C Hamano
2012-05-21 22:24 ` Jeff King
2012-05-22 5:51 ` Martin Fick
2012-05-22 18:21 ` Jeff King
2012-05-22 22:19 ` Martin Fick
2012-05-22 23:23 ` Junio C Hamano [this message]
2012-05-23 0:46 ` Martin Fick
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7v8vgj3imy.fsf@alter.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=mfick@codeaurora.org \
--cc=mhagger@alum.mit.edu \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).