Re: How to still kill git fetch with too many refs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Martin Fick <mfick@codeaurora.org>
Cc: git@vger.kernel.org
Subject: Re: How to still kill git fetch with too many refs
Date: Tue, 2 Jul 2013 00:07:58 -0400	[thread overview]
Message-ID: <20130702040758.GA7068@sigill.intra.peff.net> (raw)
In-Reply-To: <201307012102.31384.mfick@codeaurora.org>

On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote:

> A simple synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days.  However, in our
> experience with our internal git repo, we still have performance
> issues related to having too many refs, in our kernel/msm instance we
> have around 400K.

I'm not too surprised. There's O(n^2) behavior in fetch-pack's
mark_complete function, as it adds each of the N ref tips to a
commit_list, sorting by date (so each insertion is O(n)).

I posted two alternative patches in May of 2011.  The first simply
avoids adding duplicate objects, which is simple and covers many
real-world cases (e.g., an "alternates" repository which has a bunch of
copies of the same tags, one per fork). The second one switches the
commit_list out for a heap-based priority queue.

We ended up taking the first (as ea5f220), since it was trivial and
obviously correct, but didn't bother with the second since:

  1. There had been no real-world reports of it.

  2. While in theory a priority queue implementation would be used in
     other spots, too, it ended up being a pain to use it, as most of
     the callers wanted list-like splicing.

You can see the original here:

  http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005

Though it probably doesn't apply cleanly anymore.  However, I've kept it
rebased over the years at:

  git://github.com/peff/git.git jk/fast-commit-list

Junio recently added a priority queue implementation in b4b594a
(prio-queue: priority queue of pointers to structs, 2013-06-06), which
is currently in next. So a modern version of that series would build on
top of that, rather than my priority queue.

And yet another alternative would be to keep the list unsorted during
the mark_complete calls, and then sort it at the end. Like this:

diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla
 		struct commit *commit = (struct commit *)o;
 		if (!(commit->object.flags & COMPLETE)) {
 			commit->object.flags |= COMPLETE;
-			commit_list_insert_by_date(commit, &complete);
+			commit_list_insert(commit, &complete);
 		}
 	}
 	return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
 	if (!args->depth) {
 		for_each_ref(mark_complete, NULL);
 		for_each_alternate_ref(mark_alternate_complete, NULL);
+		commit_list_sort_by_date(&complete);
 		if (cutoff)
 			mark_recent_complete_commits(args, cutoff);
 	}

(If you're wondering why we didn't do this trivial bit at the time, it
was because back then we did not yet have the René's nice linked-list
mergesort that backs commit_list_sort_by_date).

> The result, a copy of linus' repo with a million unique 
> valid refs and a git fetch of a single updated ref taking a 
> very long time (55mins and it did not complete yet).  Note, 
> with 100K refs it completes in about 2m40s.  It is likely 
> not linear since 2m40s * 10 would be ~26m (but the 
> difference could also just be how the data in the sha1s are 
> ordered).

That sounds like the O(n^2) problem. My timings back then with 100K refs
were 1-2 minutes. Does the patch above fix it for you?

-Peff

next prev parent reply	other threads:[~2013-07-02  4:07 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-02  3:02 How to still kill git fetch with too many refs Martin Fick
2013-07-02  4:07 ` Jeff King [this message]
2013-07-02  4:41   ` Jeff King
2013-07-02  5:01     ` Jeff King
2013-07-02  5:19       ` Junio C Hamano
2013-07-02  5:28         ` Jeff King
2013-07-02  6:11           ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Jeff King
2013-07-02  6:16             ` [PATCH 1/3] fetch-pack: avoid quadratic list insertion in mark_complete Jeff King
2013-07-02  6:21             ` [PATCH 2/3] commit.c: make compare_commits_by_commit_date global Jeff King
2013-07-02  6:24             ` [PATCH 3/3] fetch-pack: avoid quadratic behavior in rev_list_push Jeff King
2013-07-02  7:52               ` Eric Sunshine
2013-07-02 17:45             ` [PATCH 0/3] avoid quadratic behavior in fetch-pack Martin Fick
2013-07-02 17:52       ` How to still kill git fetch with too many refs Brandon Casey
2013-07-02  9:24 ` Michael Haggerty
2013-07-02 16:58   ` Martin Fick

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:abe5ffb dfblob:4df8abd )
 OR (
bs:"Re: How to still kill git fetch with too many refs" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130702040758.GA7068@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=mfick@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).