From: Junio C Hamano <gitster@pobox.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
Jeff King <peff@peff.net>
Cc: Adrian Bunk <bunk@kernel.org>, git@vger.kernel.org
Subject: Re: git-revert is a memory hog
Date: Tue, 29 Jan 2008 15:19:28 -0800 [thread overview]
Message-ID: <7vwspskynz.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: <7vfxwgmf87.fsf@gitster.siamese.dyndns.org> (Junio C. Hamano's message of "Tue, 29 Jan 2008 14:36:24 -0800")
Junio C Hamano <gitster@pobox.com> writes:
> Jeff King <peff@peff.net> writes:
>
>> On Wed, Jan 30, 2008 at 08:51:09AM +1100, Linus Torvalds wrote:
>>
>>> I definitely can reproduce it, it's horrid.
>>>
>>> This is from "top" fairly late in the game, but with the thing not even
>>> done yet. Current git, pretty much fully (and fairly aggressively) packed
>>> current kernel repo, and using "diff.renamelmit=0".
>>
>> Hrm, setting diff.renamelimit to 0 lets me reproduce (I thought I tried
>> it before, but clearly not...).
How about this fairly obvious patch?
We used to record, for N=src deleted paths and M=dst created
paths, (N x M) "struct diff_score" that records how similar each
of the pair is, and picked the <src,dst> pairs that gives the
best match first, and then went on to worse matches. This
sorting is so that two destinations are both found to be similar
to a single source, we can process the more similar one first,
and when processing the second one, it can notice "Ah, the
source I was planning to say I am a copy of is already taken by
somebody else" and continue on to match himself with another
source with a lessor match (this matters to a change introduced
between 1.5.3.X series and 1.5.4-rc, that lets the code to favor
unused matches first and then falls back to using already used
matches).
This instead allocates and keeps only M records in core. For
each dst, we compute similarlity with all sources (so the number
of similarity estimate computations we do is still N x M), but
we keep the best src for each dst. This is essentially to save
memory drastically by giving up to come up with better pairing.
I guess we could keep a handful best candidates per dst, instead
of just one, to further improve on this approach, and such a
change should be fairly straightforward.
---
diffcore-rename.c | 46 +++++++++++++++++++++++-----------------------
1 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3d37725..7e8fdcd 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -473,42 +473,42 @@ void diffcore_rename(struct diff_options *options)
if (num_create * num_src > rename_limit * rename_limit)
goto cleanup;
- mx = xmalloc(sizeof(*mx) * num_create * num_src);
+
+ mx = xmalloc(sizeof(*mx) * num_create);
for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
- int base = dst_cnt * num_src;
struct diff_filespec *two = rename_dst[i].two;
+ struct diff_score *m;
+
if (rename_dst[i].pair)
continue; /* dealt with exact match already. */
+
+ m = &mx[dst_cnt];
+ m->dst = -1;
for (j = 0; j < rename_src_nr; j++) {
struct diff_filespec *one = rename_src[j].one;
- struct diff_score *m = &mx[base+j];
- m->src = j;
- m->dst = i;
- m->score = estimate_similarity(one, two,
- minimum_score);
- m->name_score = basename_same(one, two);
+ int score, name_score;
+ score = estimate_similarity(one, two,
+ minimum_score);
+ name_score = basename_same(one, two);
+ if (m->dst < 0 ||
+ score > m->score ||
+ (score == m->score &&
+ name_score > m->name_score)) {
+ m->score = score;
+ m->name_score = name_score;
+ m->src = j;
+ m->dst = i;
+ }
diff_free_filespec_blob(one);
}
/* We do not need the text anymore */
diff_free_filespec_blob(two);
dst_cnt++;
}
+
/* cost matrix sorted by most to least similar pair */
- qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
- for (i = 0; i < num_create * num_src; i++) {
- struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
- struct diff_filespec *src;
- if (dst->pair)
- continue; /* already done, either exact or fuzzy. */
- if (mx[i].score < minimum_score)
- break; /* there is no more usable pair. */
- src = rename_src[mx[i].src].one;
- if (src->rename_used)
- continue;
- record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
- rename_count++;
- }
- for (i = 0; i < num_create * num_src; i++) {
+ qsort(mx, num_create, sizeof(*mx), score_compare);
+ for (i = 0; i < num_create; i++) {
struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
if (dst->pair)
continue; /* already done, either exact or fuzzy. */
next prev parent reply other threads:[~2008-01-29 23:20 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-27 17:27 git-revert is a memory hog Adrian Bunk
2008-01-27 17:38 ` Shawn O. Pearce
2008-01-28 6:01 ` Jeff King
2008-01-28 5:59 ` Jeff King
2008-01-29 21:51 ` Linus Torvalds
2008-01-29 22:15 ` Junio C Hamano
2008-01-29 22:20 ` Jeff King
2008-01-29 22:30 ` Jeff King
2008-01-29 22:36 ` Junio C Hamano
2008-01-29 22:45 ` Jeff King
2008-01-29 22:51 ` Jeff King
2008-01-29 22:49 ` Linus Torvalds
2008-01-29 22:54 ` Jeff King
2008-01-29 22:53 ` Junio C Hamano
2008-01-29 22:57 ` Jeff King
2008-01-29 23:19 ` Junio C Hamano [this message]
2008-01-29 23:50 ` Junio C Hamano
2008-01-30 4:40 ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
2008-01-30 6:57 ` Luke Lu
2008-01-30 7:24 ` Luke Lu
2008-02-13 9:53 ` Junio C Hamano
2008-02-13 10:19 ` David Kastrup
2008-02-13 10:23 ` Junio C Hamano
2008-02-14 3:00 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vwspskynz.fsf@gitster.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=bunk@kernel.org \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).