git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Jeff King <peff@peff.net>
Cc: Adrian Bunk <bunk@kernel.org>, git@vger.kernel.org
Subject: Re: git-revert is a memory hog
Date: Tue, 29 Jan 2008 15:19:28 -0800	[thread overview]
Message-ID: <7vwspskynz.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <7vfxwgmf87.fsf@gitster.siamese.dyndns.org> (Junio C. Hamano's message of "Tue, 29 Jan 2008 14:36:24 -0800")

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> On Wed, Jan 30, 2008 at 08:51:09AM +1100, Linus Torvalds wrote:
>>
>>> I definitely can reproduce it, it's horrid.
>>> 
>>> This is from "top" fairly late in the game, but with the thing not even 
>>> done yet. Current git, pretty much fully (and fairly aggressively) packed 
>>> current kernel repo, and using "diff.renamelmit=0".
>>
>> Hrm, setting diff.renamelimit to 0 lets me reproduce (I thought I tried
>> it before, but clearly not...).

How about this fairly obvious patch?

We used to record, for N=src deleted paths and M=dst created
paths, (N x M) "struct diff_score" that records how similar each
of the pair is, and picked the <src,dst> pairs that gives the
best match first, and then went on to worse matches.  This
sorting is so that two destinations are both found to be similar
to a single source, we can process the more similar one first,
and when processing the second one, it can notice "Ah, the
source I was planning to say I am a copy of is already taken by
somebody else" and continue on to match himself with another
source with a lessor match (this matters to a change introduced
between 1.5.3.X series and 1.5.4-rc, that lets the code to favor
unused matches first and then falls back to using already used
matches).

This instead allocates and keeps only M records in core.  For
each dst, we compute similarlity with all sources (so the number
of similarity estimate computations we do is still N x M), but
we keep the best src for each dst.  This is essentially to save
memory drastically by giving up to come up with better pairing.

I guess we could keep a handful best candidates per dst, instead
of just one, to further improve on this approach, and such a
change should be fairly straightforward.

---
 diffcore-rename.c |   46 +++++++++++++++++++++++-----------------------
 1 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3d37725..7e8fdcd 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -473,42 +473,42 @@ void diffcore_rename(struct diff_options *options)
 	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	mx = xmalloc(sizeof(*mx) * num_create * num_src);
+
+	mx = xmalloc(sizeof(*mx) * num_create);
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		int base = dst_cnt * num_src;
 		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_score *m;
+
 		if (rename_dst[i].pair)
 			continue; /* dealt with exact match already. */
+
+		m = &mx[dst_cnt];
+		m->dst = -1;
 		for (j = 0; j < rename_src_nr; j++) {
 			struct diff_filespec *one = rename_src[j].one;
-			struct diff_score *m = &mx[base+j];
-			m->src = j;
-			m->dst = i;
-			m->score = estimate_similarity(one, two,
-						       minimum_score);
-			m->name_score = basename_same(one, two);
+			int score, name_score;
+			score = estimate_similarity(one, two,
+						    minimum_score);
+			name_score = basename_same(one, two);
+			if (m->dst < 0 ||
+			    score > m->score ||
+			    (score == m->score &&
+			     name_score > m->name_score)) {
+				m->score = score;
+				m->name_score = name_score;
+				m->src = j;
+				m->dst = i;
+			}
 			diff_free_filespec_blob(one);
 		}
 		/* We do not need the text anymore */
 		diff_free_filespec_blob(two);
 		dst_cnt++;
 	}
+
 	/* cost matrix sorted by most to least similar pair */
-	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
-		struct diff_filespec *src;
-		if (dst->pair)
-			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
-		src = rename_src[mx[i].src].one;
-		if (src->rename_used)
-			continue;
-		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
-		rename_count++;
-	}
-	for (i = 0; i < num_create * num_src; i++) {
+	qsort(mx, num_create, sizeof(*mx), score_compare);
+	for (i = 0; i < num_create; i++) {
 		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */

  parent reply	other threads:[~2008-01-29 23:20 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-27 17:27 git-revert is a memory hog Adrian Bunk
2008-01-27 17:38 ` Shawn O. Pearce
2008-01-28  6:01   ` Jeff King
2008-01-28  5:59 ` Jeff King
2008-01-29 21:51   ` Linus Torvalds
2008-01-29 22:15     ` Junio C Hamano
2008-01-29 22:20     ` Jeff King
2008-01-29 22:30       ` Jeff King
2008-01-29 22:36       ` Junio C Hamano
2008-01-29 22:45         ` Jeff King
2008-01-29 22:51           ` Jeff King
2008-01-29 22:49         ` Linus Torvalds
2008-01-29 22:54           ` Jeff King
2008-01-29 22:53         ` Junio C Hamano
2008-01-29 22:57           ` Jeff King
2008-01-29 23:19         ` Junio C Hamano [this message]
2008-01-29 23:50           ` Junio C Hamano
2008-01-30  4:40           ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
2008-01-30  6:57             ` Luke Lu
2008-01-30  7:24               ` Luke Lu
2008-02-13  9:53             ` Junio C Hamano
2008-02-13 10:19               ` David Kastrup
2008-02-13 10:23                 ` Junio C Hamano
2008-02-14  3:00               ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vwspskynz.fsf@gitster.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=bunk@kernel.org \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).