git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] Fix ugly magic special case in exact rename detection
@ 2007-10-26 23:51 ` Linus Torvalds
  2007-10-26 23:56   ` [PATCH 2/2] Do the fuzzy rename detection limits with the exact renames removed Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Linus Torvalds @ 2007-10-26 23:51 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


For historical reasons, the exact rename detection had populated the
filespecs for the entries it compared, and the rest of the similarity
analysis depended on that.  I hadn't even bothered to debug why that was
the case when I re-did the rename detection, I just made the new one
have the same broken behaviour, with a note about this special case.

This fixes that fixme.  The reason the exact rename detector needed to
fill in the file sizes of the files it checked was that the _inexact_
rename detector was broken, and started comparing file sizes before it
filled them in.

Fixing that allows the exact phase to do the sane thing of never even
caring (since all *it* cares about is really just the SHA1 itself, not
the size nor the contents).

It turns out that this also indirectly fixes a bug: trying to populate
all the filespecs will run out of virtual memory if there is tons and
tons of possible rename options.  The fuzzy similarity analysis does the
right thing in this regard, and free's the blob info after it has
generated the hash tables, so the special case code caused more trouble
than just some extra illogical code.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Without this, when I try to do a "git commit --amend" on my 
horror-repository with the commit that moved around a hundred thousand 
files, I actually have "git runstatus" die() on me, because xmmap() fails 
(mmap() returns NULL due to running out of vm mappings).

Not to mention that this is "obviously correct" (tm) and the "right thing" 
(tm)" to do (famous last words). It's wrong of estimate_similarity() to 
try to compare the sizes of the filespec entries before they have been 
filled in, and it really shouldn't expect the exact rename detection to 
fill them in, since the exact rename detection doesn't even care!

 diffcore-rename.c |   27 ++++++++++++++-------------
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3946932..7ed5ef8 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -144,6 +144,20 @@ static int estimate_similarity(struct diff_filespec *src,
 	if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
 		return 0;
 
+	/*
+	 * Need to check that source and destination sizes are
+	 * filled in before comparing them.
+	 *
+	 * If we already have "cnt_data" filled in, we know it's
+	 * all good (avoid checking the size for zero, as that
+	 * is a possible size - we really should have a flag to
+	 * say whether the size is valid or not!)
+	 */
+	if (!src->cnt_data && diff_populate_filespec(src, 0))
+		return 0;
+	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
+		return 0;
+
 	max_size = ((src->size > dst->size) ? src->size : dst->size);
 	base_size = ((src->size < dst->size) ? src->size : dst->size);
 	delta_size = max_size - base_size;
@@ -159,11 +173,6 @@ static int estimate_similarity(struct diff_filespec *src,
 	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
 		return 0;
 
-	if ((!src->cnt_data && diff_populate_filespec(src, 0))
-		|| (!dst->cnt_data && diff_populate_filespec(dst, 0)))
-		return 0; /* error but caught downstream */
-
-
 	delta_limit = (unsigned long)
 		(base_size * (MAX_SCORE-minimum_score) / MAX_SCORE);
 	if (diffcore_count_changes(src, dst,
@@ -270,19 +279,11 @@ static int find_identical_files(struct file_similarity *src,
 	return renames;
 }
 
-/*
- * Note: the rest of the rename logic depends on this
- * phase also populating all the filespecs for any
- * entry that isn't matched up with an exact rename.
- */
 static void free_similarity_list(struct file_similarity *p)
 {
 	while (p) {
 		struct file_similarity *entry = p;
 		p = p->next;
-
-		/* Stupid special case, see note above! */
-		diff_populate_filespec(entry->filespec, 0);
 		free(entry);
 	}
 }

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] Do the fuzzy rename detection limits with the exact renames removed
  2007-10-26 23:51 ` [PATCH 1/2] Fix ugly magic special case in exact rename detection Linus Torvalds
@ 2007-10-26 23:56   ` Linus Torvalds
  2007-10-27  0:12     ` Linus Torvalds
  0 siblings, 1 reply; 3+ messages in thread
From: Linus Torvalds @ 2007-10-26 23:56 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


When we do the fuzzy rename detection, we don't care about the
destinations that we already handled with the exact rename detector.
And, in fact, the code already knew that - but the rename limiter, which
used to run *before* exact renames were detected, did not.

This fixes it so that the rename detection limiter now bases its
decisions on the *remaining* rename counts, rather than the original
ones.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Pretty obvious. It just moves the tests around a bit, and uses the updated 
counters.

Works For Me(tm), although this isn't all that obviously testable (ie I 
should find a test that is border-line, and triggers the rename detection 
limits, but then has enough exact renames that the rename detection array 
goes away).

 diffcore-rename.c |   32 ++++++++++++++++++--------------
 1 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7ed5ef8..f9ebea5 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -435,33 +435,37 @@ void diffcore_rename(struct diff_options *options)
 	 */
 	rename_count = find_exact_renames();
 
+	/* Did we only want exact renames? */
+	if (minimum_score == MAX_SCORE)
+		goto cleanup;
+
+	/*
+	 * Calculate how many renames are left (but all the source
+	 * files still remain as options for rename/copies!)
+	 */
+	num_create = (rename_dst_nr - rename_count);
+	num_src = rename_src_nr;
+
+	/* All done? */
+	if (!num_create)
+		goto cleanup;
+
 	/*
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
-	 *    rename_dst_nr * rename_src_nr > rename_limit * rename_limit
+	 *    num_create * num_src > rename_limit * rename_limit
 	 *
 	 * but handles the potential overflow case specially (and we
 	 * assume at least 32-bit integers)
 	 */
 	if (rename_limit <= 0 || rename_limit > 32767)
 		rename_limit = 32767;
-	if (rename_dst_nr > rename_limit && rename_src_nr > rename_limit)
+	if (num_create > rename_limit && num_src > rename_limit)
 		goto cleanup;
-	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
+	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	/* Have we run out the created file pool?  If so we can avoid
-	 * doing the delta matrix altogether.
-	 */
-	if (rename_count == rename_dst_nr)
-		goto cleanup;
-
-	if (minimum_score == MAX_SCORE)
-		goto cleanup;
-
-	num_create = (rename_dst_nr - rename_count);
-	num_src = rename_src_nr;
 	mx = xmalloc(sizeof(*mx) * num_create * num_src);
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
 		int base = dst_cnt * num_src;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 2/2] Do the fuzzy rename detection limits with the exact renames removed
  2007-10-26 23:56   ` [PATCH 2/2] Do the fuzzy rename detection limits with the exact renames removed Linus Torvalds
@ 2007-10-27  0:12     ` Linus Torvalds
  0 siblings, 0 replies; 3+ messages in thread
From: Linus Torvalds @ 2007-10-27  0:12 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


On Fri, 26 Oct 2007, Linus Torvalds wrote:
> 
> Works For Me(tm), although this isn't all that obviously testable (ie I 
> should find a test that is border-line, and triggers the rename detection 
> limits, but then has enough exact renames that the rename detection array 
> goes away).

Actually, I just tested it. I used that same 100,000 file thing, but added 
one more (larger) file, and did another commit that moved the 100,000 
files exactly, and the one larger file with a small change.

The code before this patch (but with my linear-time rename changes) would 
find the 100,000 exact renames, but would *not* find the one that had a 
small change, because it would hit the rename limit, and wouldn't even 
try.

With these two patches in place, it finds all the exact renames, and once 
it has done that, the resulting rename array is small enough (one single 
unknown target remaining, even if there are 100,001 possible source files) 
that it doesn't trigger the rename limit, and it now finds the remaining 
non-exact rename too.

So it not only looked obvious, it seems to work too.

		Linus

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-10-27  0:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.LFD.0.999.0710261646230.30120@woody.linux-foundation.or g>
2007-10-26 23:51 ` [PATCH 1/2] Fix ugly magic special case in exact rename detection Linus Torvalds
2007-10-26 23:56   ` [PATCH 2/2] Do the fuzzy rename detection limits with the exact renames removed Linus Torvalds
2007-10-27  0:12     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).