[PATCH 0/6] Cleaned-up rename detection patch-series

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Cleaned-up rename detection patch-series
@ 2007-10-25 18:15 ` Linus Torvalds
  2007-10-25 18:16   ` [PATCH 1/6] Add 'diffcore.h' to LIB_H Linus Torvalds
                     ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:15 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List

Ok, here's the patch-series to do rename detection of exact renames in 
linear time, except it's cleaned up into a nice series of 6 patches. The 
end result is identical to the previous patches (which got smushed down 
into just two patches in Junio's tree), apart from a fixed dependency in 
the Makefile that caused me grief and a broken test-suite due to some 
object files simply not being correctly recompiled.

The end result should be more understandable this way.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] Add 'diffcore.h' to LIB_H
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
@ 2007-10-25 18:16   ` Linus Torvalds
  2007-10-25 18:17   ` [PATCH 2/6] Split out "exact content match" phase of rename detection Linus Torvalds
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:16 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 1/6] Add 'diffcore.h' to LIB_H

The diffcore.h header file is included by more than just the internal
diff generation files, and needs to be part of the proper dependencies.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This caused me pain.

I spent way too much time trying to debug something that wasn't buggy to 
begin with (which made it really hard to see the bug ;).

We really should have a proper (automatic) dependency generation phase. 
This patch does *not* do that, it just fixes up the crappy stuff we have 
now.

 Makefile |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index b728920..8a0082d 100644
--- a/Makefile
+++ b/Makefile
@@ -290,7 +290,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h transport.h
+	mailmap.h remote.h transport.h diffcore.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -917,7 +917,6 @@ git-http-push$X: revision.o http.o http-push.o $(GITLIBS)
 
 $(LIB_OBJS) $(BUILTIN_OBJS): $(LIB_H)
 $(patsubst git-%$X,%.o,$(PROGRAMS)): $(LIB_H) $(wildcard */*.h)
-$(DIFF_OBJS): diffcore.h
 
 $(LIB_FILE): $(LIB_OBJS)
 	$(QUIET_AR)$(RM) $@ && $(AR) rcs $@ $(LIB_OBJS)
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] Split out "exact content match" phase of rename detection
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
  2007-10-25 18:16   ` [PATCH 1/6] Add 'diffcore.h' to LIB_H Linus Torvalds
@ 2007-10-25 18:17   ` Linus Torvalds
  2007-10-25 18:19   ` [PATCH 3/6] Ref-count the filespecs used by diffcore Linus Torvalds
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:17 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 2/6] Split out "exact content match" phase of rename detection

This makes the exact content match a separate function of its own.
Partly to cut down a bit on the size of the diffcore_rename() function
(which is too complex as it is), and partly because there are smarter
ways to do this than an O(m*n) loop over it all, and that function
should be rewritten to take that into account.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This is, I think, identical to the patch you already have. Feel free to 
skip this, and base the rest of the patch-series on your existing one.

 diffcore-rename.c |   90 +++++++++++++++++++++++++++++++++--------------------
 1 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 142e537..2077a9b 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -262,6 +262,58 @@ static int compute_stays(struct diff_queue_struct *q,
 	return 1;
 }
 
+/*
+ * Find exact renames first.
+ *
+ * The first round matches up the up-to-date entries,
+ * and then during the second round we try to match
+ * cache-dirty entries as well.
+ *
+ * Note: the rest of the rename logic depends on this
+ * phase also populating all the filespecs for any
+ * entry that isn't matched up with an exact rename,
+ * see "is_exact_match()".
+ */
+static int find_exact_renames(void)
+{
+	int rename_count = 0;
+	int contents_too;
+
+	for (contents_too = 0; contents_too < 2; contents_too++) {
+		int i;
+
+		for (i = 0; i < rename_dst_nr; i++) {
+			struct diff_filespec *two = rename_dst[i].two;
+			int j;
+
+			if (rename_dst[i].pair)
+				continue; /* dealt with an earlier round */
+			for (j = 0; j < rename_src_nr; j++) {
+				int k;
+				struct diff_filespec *one = rename_src[j].one;
+				if (!is_exact_match(one, two, contents_too))
+					continue;
+
+				/* see if there is a basename match, too */
+				for (k = j; k < rename_src_nr; k++) {
+					one = rename_src[k].one;
+					if (basename_same(one, two) &&
+						is_exact_match(one, two,
+							contents_too)) {
+						j = k;
+						break;
+					}
+				}
+
+				record_rename_pair(i, j, (int)MAX_SCORE);
+				rename_count++;
+				break; /* we are done with this entry */
+			}
+		}
+	}
+	return rename_count;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -270,12 +322,11 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_queue_struct *q = &diff_queued_diff;
 	struct diff_queue_struct outq;
 	struct diff_score *mx;
-	int i, j, rename_count, contents_too;
+	int i, j, rename_count;
 	int num_create, num_src, dst_cnt;
 
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
-	rename_count = 0;
 
 	for (i = 0; i < q->nr; i++) {
 		struct diff_filepair *p = q->queue[i];
@@ -318,40 +369,11 @@ void diffcore_rename(struct diff_options *options)
 	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
 		goto cleanup;
 
-	/* We really want to cull the candidates list early
+	/*
+	 * We really want to cull the candidates list early
 	 * with cheap tests in order to avoid doing deltas.
-	 * The first round matches up the up-to-date entries,
-	 * and then during the second round we try to match
-	 * cache-dirty entries as well.
 	 */
-	for (contents_too = 0; contents_too < 2; contents_too++) {
-		for (i = 0; i < rename_dst_nr; i++) {
-			struct diff_filespec *two = rename_dst[i].two;
-			if (rename_dst[i].pair)
-				continue; /* dealt with an earlier round */
-			for (j = 0; j < rename_src_nr; j++) {
-				int k;
-				struct diff_filespec *one = rename_src[j].one;
-				if (!is_exact_match(one, two, contents_too))
-					continue;
-
-				/* see if there is a basename match, too */
-				for (k = j; k < rename_src_nr; k++) {
-					one = rename_src[k].one;
-					if (basename_same(one, two) &&
-						is_exact_match(one, two,
-							contents_too)) {
-						j = k;
-						break;
-					}
-				}
-
-				record_rename_pair(i, j, (int)MAX_SCORE);
-				rename_count++;
-				break; /* we are done with this entry */
-			}
-		}
-	}
+	rename_count = find_exact_renames();
 
 	/* Have we run out the created file pool?  If so we can avoid
 	 * doing the delta matrix altogether.
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] Ref-count the filespecs used by diffcore
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
  2007-10-25 18:16   ` [PATCH 1/6] Add 'diffcore.h' to LIB_H Linus Torvalds
  2007-10-25 18:17   ` [PATCH 2/6] Split out "exact content match" phase of rename detection Linus Torvalds
@ 2007-10-25 18:19   ` Linus Torvalds
  2007-10-25 18:20   ` [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops Linus Torvalds
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:19 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List



From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 3/6] Ref-count the filespecs used by diffcore

Rather than copy the refcounts when introducing new versions of them
(for rename or copy detection), use a refcount and increment the count
when reusing the diff_filespec.

This avoids unnecessary allocations, but the real reason behind this is
a future enhancement: we will want to track shared data across the
copy/rename detection.  In order to efficiently notice when a filespec
is used by a rename, the rename machinery wants to keep track of a
rename usage count which is shared across all different users of the
filespec.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This was sent out in the original series as a fix on top of broken code. 
Now it sets up the infrastructure so that the next patch won't be broken.

 diff.c            |   15 +++++++++++----
 diffcore-rename.c |   16 ++++++----------
 diffcore.h        |    2 ++
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/diff.c b/diff.c
index dfb8595..0b320f6 100644
--- a/diff.c
+++ b/diff.c
@@ -1440,9 +1440,18 @@ struct diff_filespec *alloc_filespec(const char *path)
 	memset(spec, 0, sizeof(*spec));
 	spec->path = (char *)(spec + 1);
 	memcpy(spec->path, path, namelen+1);
+	spec->count = 1;
 	return spec;
 }
 
+void free_filespec(struct diff_filespec *spec)
+{
+	if (!--spec->count) {
+		diff_free_filespec_data(spec);
+		free(spec);
+	}
+}
+
 void fill_filespec(struct diff_filespec *spec, const unsigned char *sha1,
 		   unsigned short mode)
 {
@@ -2435,10 +2444,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *queue,
 
 void diff_free_filepair(struct diff_filepair *p)
 {
-	diff_free_filespec_data(p->one);
-	diff_free_filespec_data(p->two);
-	free(p->one);
-	free(p->two);
+	free_filespec(p->one);
+	free_filespec(p->two);
 	free(p);
 }
 
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2077a9b..3da06b7 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -209,21 +209,19 @@ static int estimate_similarity(struct diff_filespec *src,
 
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
-	struct diff_filespec *one, *two, *src, *dst;
+	struct diff_filespec *src, *dst;
 	struct diff_filepair *dp;
 
 	if (rename_dst[dst_index].pair)
 		die("internal error: dst already matched.");
 
 	src = rename_src[src_index].one;
-	one = alloc_filespec(src->path);
-	fill_filespec(one, src->sha1, src->mode);
+	src->count++;
 
 	dst = rename_dst[dst_index].two;
-	two = alloc_filespec(dst->path);
-	fill_filespec(two, dst->sha1, dst->mode);
+	dst->count++;
 
-	dp = diff_queue(NULL, one, two);
+	dp = diff_queue(NULL, src, dst);
 	dp->renamed_pair = 1;
 	if (!strcmp(src->path, dst->path))
 		dp->score = rename_src[src_index].score;
@@ -526,10 +524,8 @@ void diffcore_rename(struct diff_options *options)
 		}
 	}
 
-	for (i = 0; i < rename_dst_nr; i++) {
-		diff_free_filespec_data(rename_dst[i].two);
-		free(rename_dst[i].two);
-	}
+	for (i = 0; i < rename_dst_nr; i++)
+		free_filespec(rename_dst[i].two);
 
 	free(rename_dst);
 	rename_dst = NULL;
diff --git a/diffcore.h b/diffcore.h
index eb618b1..30055ac 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -29,6 +29,7 @@ struct diff_filespec {
 	void *cnt_data;
 	const char *funcname_pattern_ident;
 	unsigned long size;
+	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
 	unsigned short mode;	 /* file mode */
 	unsigned sha1_valid : 1; /* if true, use sha1 and trust mode;
@@ -43,6 +44,7 @@ struct diff_filespec {
 };
 
 extern struct diff_filespec *alloc_filespec(const char *);
+extern void free_filespec(struct diff_filespec *);
 extern void fill_filespec(struct diff_filespec *, const unsigned char *,
 			  unsigned short);
 
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
                     ` (2 preceding siblings ...)
  2007-10-25 18:19   ` [PATCH 3/6] Ref-count the filespecs used by diffcore Linus Torvalds
@ 2007-10-25 18:20   ` Linus Torvalds
  2007-10-26 22:53     ` Junio C Hamano
  2007-10-25 18:23   ` [PATCH 5/6] Do linear-time/space rename logic for exact renames Linus Torvalds
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:20 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops

The core rename detection had some rather stupid code to check if a
pathname was used by a later modification or rename, which basically
walked the whole pathname space for all renames for each rename, in
order to tell whether it was a pure rename (no remaining users) or
should be considered a copy (other users of the source file remaining).

That's really silly, since we can just keep a count of users around, and
replace all those complex and expensive loops with just testing that
simple counter (but this all depends on the previous commit that shared
the diff_filespec data structure by using a separate reference count).

Note that the reference count is not the same as the rename count: they
behave otherwise rather similarly, but the reference count is tied to
the allocation (and decremented at de-allocation, so that when it turns
zero we can get rid of the memory), while the rename count is tied to
the renames and is decremented when we find a rename (so that when it
turns zero we know that it was a rename, not a copy).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This fixes the generic diffcode infrastructure, but doesn't do the actual 
*detection* any faster. But as you can tell, it's a huge cleanup on its 
own, and basically removes the "incidental" and stupid support costs of 
rename detection.

 diff.c            |   40 +++++++++++++------------------
 diffcore-rename.c |   68 +++++++++++++---------------------------------------
 diffcore.h        |    2 +-
 3 files changed, 35 insertions(+), 75 deletions(-)

diff --git a/diff.c b/diff.c
index 0b320f6..af85b94 100644
--- a/diff.c
+++ b/diff.c
@@ -2597,9 +2597,9 @@ void diff_debug_filepair(const struct diff_filepair *p, int i)
 {
 	diff_debug_filespec(p->one, i, "one");
 	diff_debug_filespec(p->two, i, "two");
-	fprintf(stderr, "score %d, status %c stays %d broken %d\n",
+	fprintf(stderr, "score %d, status %c rename_used %d broken %d\n",
 		p->score, p->status ? p->status : '?',
-		p->source_stays, p->broken_pair);
+		p->one->rename_used, p->broken_pair);
 }
 
 void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
@@ -2617,8 +2617,8 @@ void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
 
 static void diff_resolve_rename_copy(void)
 {
-	int i, j;
-	struct diff_filepair *p, *pp;
+	int i;
+	struct diff_filepair *p;
 	struct diff_queue_struct *q = &diff_queued_diff;
 
 	diff_debug_queue("resolve-rename-copy", q);
@@ -2640,27 +2640,21 @@ static void diff_resolve_rename_copy(void)
 		 * either in-place edit or rename/copy edit.
 		 */
 		else if (DIFF_PAIR_RENAME(p)) {
-			if (p->source_stays) {
-				p->status = DIFF_STATUS_COPIED;
-				continue;
-			}
-			/* See if there is some other filepair that
-			 * copies from the same source as us.  If so
-			 * we are a copy.  Otherwise we are either a
-			 * copy if the path stays, or a rename if it
-			 * does not, but we already handled "stays" case.
+			/*
+			 * A rename might have re-connected a broken
+			 * pair up, causing the pathnames to be the
+			 * same again. If so, that's not a rename at
+			 * all, just a modification..
+			 *
+			 * Otherwise, see if this source was used for
+			 * multiple renames, in which case we decrement
+			 * the count, and call it a copy.
 			 */
-			for (j = i + 1; j < q->nr; j++) {
-				pp = q->queue[j];
-				if (strcmp(pp->one->path, p->one->path))
-					continue; /* not us */
-				if (!DIFF_PAIR_RENAME(pp))
-					continue; /* not a rename/copy */
-				/* pp is a rename/copy from the same source */
+			if (!strcmp(p->one->path, p->two->path))
+				p->status = DIFF_STATUS_MODIFIED;
+			else if (--p->one->rename_used > 0)
 				p->status = DIFF_STATUS_COPIED;
-				break;
-			}
-			if (!p->status)
+			else
 				p->status = DIFF_STATUS_RENAMED;
 		}
 		else if (hashcmp(p->one->sha1, p->two->sha1) ||
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3da06b7..edb2424 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -55,12 +55,10 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 static struct diff_rename_src {
 	struct diff_filespec *one;
 	unsigned short score; /* to remember the break score */
-	unsigned src_path_left : 1;
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
 static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
-						   int src_path_left,
 						   unsigned short score)
 {
 	int first, last;
@@ -92,7 +90,6 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 			(rename_src_nr - first - 1) * sizeof(*rename_src));
 	rename_src[first].one = one;
 	rename_src[first].score = score;
-	rename_src[first].src_path_left = src_path_left;
 	return &(rename_src[first]);
 }
 
@@ -216,6 +213,7 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		die("internal error: dst already matched.");
 
 	src = rename_src[src_index].one;
+	src->rename_used++;
 	src->count++;
 
 	dst = rename_dst[dst_index].two;
@@ -227,7 +225,6 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		dp->score = rename_src[src_index].score;
 	else
 		dp->score = score;
-	dp->source_stays = rename_src[src_index].src_path_left;
 	rename_dst[dst_index].pair = dp;
 }
 
@@ -245,21 +242,6 @@ static int score_compare(const void *a_, const void *b_)
 	return b->score - a->score;
 }
 
-static int compute_stays(struct diff_queue_struct *q,
-			 struct diff_filespec *one)
-{
-	int i;
-	for (i = 0; i < q->nr; i++) {
-		struct diff_filepair *p = q->queue[i];
-		if (strcmp(one->path, p->two->path))
-			continue;
-		if (DIFF_PAIR_RENAME(p)) {
-			return 0; /* something else is renamed into this */
-		}
-	}
-	return 1;
-}
-
 /*
  * Find exact renames first.
  *
@@ -338,15 +320,25 @@ void diffcore_rename(struct diff_options *options)
 				locate_rename_dst(p->two, 1);
 		}
 		else if (!DIFF_FILE_VALID(p->two)) {
-			/* If the source is a broken "delete", and
+			/*
+			 * If the source is a broken "delete", and
 			 * they did not really want to get broken,
 			 * that means the source actually stays.
+			 * So we increment the "rename_used" score
+			 * by one, to indicate ourselves as a user
+			 */
+			if (p->broken_pair && !p->score)
+				p->one->rename_used++;
+			register_rename_src(p->one, p->score);
+		}
+		else if (detect_rename == DIFF_DETECT_COPY) {
+			/*
+			 * Increment the "rename_used" score by
+			 * one, to indicate ourselves as a user.
 			 */
-			int stays = (p->broken_pair && !p->score);
-			register_rename_src(p->one, stays, p->score);
+			p->one->rename_used++;
+			register_rename_src(p->one, p->score);
 		}
-		else if (detect_rename == DIFF_DETECT_COPY)
-			register_rename_src(p->one, 1, p->score);
 	}
 	if (rename_dst_nr == 0 || rename_src_nr == 0)
 		goto cleanup; /* nothing to do */
@@ -472,16 +464,7 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 			else {
-				for (j = 0; j < rename_dst_nr; j++) {
-					if (!rename_dst[j].pair)
-						continue;
-					if (strcmp(rename_dst[j].pair->
-						   one->path,
-						   p->one->path))
-						continue;
-					break;
-				}
-				if (j < rename_dst_nr)
+				if (p->one->rename_used)
 					/* this path remains */
 					pair_to_free = p;
 			}
@@ -507,23 +490,6 @@ void diffcore_rename(struct diff_options *options)
 	*q = outq;
 	diff_debug_queue("done collapsing", q);
 
-	/* We need to see which rename source really stays here;
-	 * earlier we only checked if the path is left in the result,
-	 * but even if a path remains in the result, if that is coming
-	 * from copying something else on top of it, then the original
-	 * source is lost and does not stay.
-	 */
-	for (i = 0; i < q->nr; i++) {
-		struct diff_filepair *p = q->queue[i];
-		if (DIFF_PAIR_RENAME(p) && p->source_stays) {
-			/* If one appears as the target of a rename-copy,
-			 * then mark p->source_stays = 0; otherwise
-			 * leave it as is.
-			 */
-			p->source_stays = compute_stays(q, p->one);
-		}
-	}
-
 	for (i = 0; i < rename_dst_nr; i++)
 		free_filespec(rename_dst[i].two);
 
diff --git a/diffcore.h b/diffcore.h
index 30055ac..cc96c20 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -31,6 +31,7 @@ struct diff_filespec {
 	unsigned long size;
 	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
+	int rename_used;         /* Count of rename users */
 	unsigned short mode;	 /* file mode */
 	unsigned sha1_valid : 1; /* if true, use sha1 and trust mode;
 				  * if false, use the name and read from
@@ -58,7 +59,6 @@ struct diff_filepair {
 	struct diff_filespec *two;
 	unsigned short int score;
 	char status; /* M C R N D U (see Documentation/diff-format.txt) */
-	unsigned source_stays : 1; /* all of R/C are copies */
 	unsigned broken_pair : 1;
 	unsigned renamed_pair : 1;
 	unsigned is_unmerged : 1;
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] Do linear-time/space rename logic for exact renames
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
                     ` (3 preceding siblings ...)
  2007-10-25 18:20   ` [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops Linus Torvalds
@ 2007-10-25 18:23   ` Linus Torvalds
  2007-10-25 19:43     ` Daniel Barkalow
  2007-10-25 18:24   ` [PATCH 6/6] Do exact rename detection regardless of rename limits Linus Torvalds
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:23 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 5/6] Do linear-time/space rename logic for exact renames

This implements a smarter rename detector for exact renames, which
rather than doing a pairwise comparison (time O(m*n)) will just hash the
files into a hash-table (size O(n+m)), and only do pairwise comparisons
to renames that have the same hash (time O(n+m) except for unrealistic
hash collissions, which we just cull aggressively).

Admittedly the exact rename case is not nearly as interesting as the
generic case, but it's an important case none-the-less.A similar general
approach should work for the generic case too, but even then you do need
to handle the exact renames/copies separately (to avoid the inevitable
added cost factor that comes from the _size_ of the file), so this is
worth doing.

In the expectation that we will indeed do the same hashing trick for the
general rename case, this code uses a generic hash-table implementation
that can be used for other things too.  In fact, we might be able to
consolidate some of our existing hash tables with the new generic code
in hash.[ch].

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This is obviously the actual meat of the new exact rename detection: 
everything else was just leading up to this.

I could have split out "hash.[ch]" as a separate patch just to introduce 
the infrastructure, but I hate patches that add code that isn't used at 
all. But hash.[ch] really is potentially independently useful, and has 
nothing that is specific to the rename detection per se in it.

 Makefile          |    4 +-
 diffcore-rename.c |  211 +++++++++++++++++++++++++++++++++++++----------------
 hash.c            |  110 +++++++++++++++++++++++++++
 hash.h            |   43 +++++++++++
 4 files changed, 303 insertions(+), 65 deletions(-)
 create mode 100644 hash.c
 create mode 100644 hash.h

diff --git a/Makefile b/Makefile
index 8a0082d..845f811 100644
--- a/Makefile
+++ b/Makefile
@@ -290,7 +290,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h transport.h diffcore.h
+	mailmap.h remote.h transport.h diffcore.h hash.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -300,7 +300,7 @@ DIFF_OBJS = \
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
 	date.o diff-delta.o entry.o exec_cmd.o ident.o \
-	interpolate.o \
+	interpolate.o hash.o \
 	lockfile.o \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
diff --git a/diffcore-rename.c b/diffcore-rename.c
index edb2424..e7e370b 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,6 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "hash.h"
 
 /* Table of rename/copy destinations */
 
@@ -93,29 +94,6 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 	return &(rename_src[first]);
 }
 
-static int is_exact_match(struct diff_filespec *src,
-			  struct diff_filespec *dst,
-			  int contents_too)
-{
-	if (src->sha1_valid && dst->sha1_valid &&
-	    !hashcmp(src->sha1, dst->sha1))
-		return 1;
-	if (!contents_too)
-		return 0;
-	if (diff_populate_filespec(src, 1) || diff_populate_filespec(dst, 1))
-		return 0;
-	if (src->size != dst->size)
-		return 0;
-	if (src->sha1_valid && dst->sha1_valid)
-	    return !hashcmp(src->sha1, dst->sha1);
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
-		return 0;
-	if (src->size == dst->size &&
-	    !memcmp(src->data, dst->data, src->size))
-		return 1;
-	return 0;
-}
-
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 {
 	int src_len = strlen(src->path), dst_len = strlen(dst->path);
@@ -242,56 +220,163 @@ static int score_compare(const void *a_, const void *b_)
 	return b->score - a->score;
 }
 
+struct file_similarity {
+	int src_dst, index;
+	struct diff_filespec *filespec;
+	struct file_similarity *next;
+};
+
+static int find_identical_files(struct file_similarity *src,
+				struct file_similarity *dst)
+{
+	int renames = 0;
+
+	/*
+	 * Walk over all the destinations ...
+	 */
+	do {
+		struct diff_filespec *one = dst->filespec;
+		struct file_similarity *p, *best;
+		int i = 100;
+
+		/*
+		 * .. to find the best source match
+		 */
+		best = NULL;
+		for (p = src; p; p = p->next) {
+			struct diff_filespec *two = p->filespec;
+
+			/* False hash collission? */
+			if (hashcmp(one->sha1, two->sha1))
+				continue;
+			/* Non-regular files? If so, the modes must match! */
+			if (!S_ISREG(one->mode) || !S_ISREG(two->mode)) {
+				if (one->mode != two->mode)
+					continue;
+			}
+			best = p;
+			if (basename_same(one, two))
+				break;
+
+			/* Too many identical alternatives? Pick one */
+			if (!--i)
+				break;
+		}
+		if (best) {
+			record_rename_pair(dst->index, best->index, MAX_SCORE);
+			renames++;
+		}
+	} while ((dst = dst->next) != NULL);
+	return renames;
+}
+
+/*
+ * Note: the rest of the rename logic depends on this
+ * phase also populating all the filespecs for any
+ * entry that isn't matched up with an exact rename.
+ */
+static void free_similarity_list(struct file_similarity *p)
+{
+	while (p) {
+		struct file_similarity *entry = p;
+		p = p->next;
+
+		/* Stupid special case, see note above! */
+		diff_populate_filespec(entry->filespec, 0);
+		free(entry);
+	}
+}
+
+static int find_same_files(void *ptr)
+{
+	int ret;
+	struct file_similarity *p = ptr;
+	struct file_similarity *src = NULL, *dst = NULL;
+
+	/* Split the hash list up into sources and destinations */
+	do {
+		struct file_similarity *entry = p;
+		p = p->next;
+		if (entry->src_dst < 0) {
+			entry->next = src;
+			src = entry;
+		} else {
+			entry->next = dst;
+			dst = entry;
+		}
+	} while (p);
+
+	/*
+	 * If we have both sources *and* destinations, see if
+	 * we can match them up
+	 */
+	ret = (src && dst) ? find_identical_files(src, dst) : 0;
+
+	/* Free the hashes and return the number of renames found */
+	free_similarity_list(src);
+	free_similarity_list(dst);
+	return ret;
+}
+
+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+	unsigned int hash;
+	if (!filespec->sha1_valid) {
+		if (diff_populate_filespec(filespec, 0))
+			return 0;
+		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
+	}
+	memcpy(&hash, filespec->sha1, sizeof(hash));
+	return hash;
+}
+
+static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
+{
+	void **pos;
+	unsigned int hash;
+	struct file_similarity *entry = xmalloc(sizeof(*entry));
+
+	entry->src_dst = src_dst;
+	entry->index = index;
+	entry->filespec = filespec;
+	entry->next = NULL;
+
+	hash = hash_filespec(filespec);
+	pos = insert_hash(hash, entry, table);
+
+	/* We already had an entry there? */
+	if (pos) {
+		entry->next = *pos;
+		*pos = entry;
+	}
+}
+
 /*
  * Find exact renames first.
  *
  * The first round matches up the up-to-date entries,
  * and then during the second round we try to match
  * cache-dirty entries as well.
- *
- * Note: the rest of the rename logic depends on this
- * phase also populating all the filespecs for any
- * entry that isn't matched up with an exact rename,
- * see "is_exact_match()".
  */
 static int find_exact_renames(void)
 {
-	int rename_count = 0;
-	int contents_too;
-
-	for (contents_too = 0; contents_too < 2; contents_too++) {
-		int i;
-
-		for (i = 0; i < rename_dst_nr; i++) {
-			struct diff_filespec *two = rename_dst[i].two;
-			int j;
-
-			if (rename_dst[i].pair)
-				continue; /* dealt with an earlier round */
-			for (j = 0; j < rename_src_nr; j++) {
-				int k;
-				struct diff_filespec *one = rename_src[j].one;
-				if (!is_exact_match(one, two, contents_too))
-					continue;
+	int i;
+	struct hash_table file_table;
 
-				/* see if there is a basename match, too */
-				for (k = j; k < rename_src_nr; k++) {
-					one = rename_src[k].one;
-					if (basename_same(one, two) &&
-						is_exact_match(one, two,
-							contents_too)) {
-						j = k;
-						break;
-					}
-				}
-
-				record_rename_pair(i, j, (int)MAX_SCORE);
-				rename_count++;
-				break; /* we are done with this entry */
-			}
-		}
-	}
-	return rename_count;
+	init_hash(&file_table);
+	for (i = 0; i < rename_src_nr; i++)
+		insert_file_table(&file_table, -1, i, rename_src[i].one);
+
+	for (i = 0; i < rename_dst_nr; i++)
+		insert_file_table(&file_table, 1, i, rename_dst[i].two);
+
+	/* Find the renames */
+	i = for_each_hash(&file_table, find_same_files);
+
+	/* .. and free the hash data structure */
+	free_hash(&file_table);
+
+	return i;
 }
 
 void diffcore_rename(struct diff_options *options)
diff --git a/hash.c b/hash.c
new file mode 100644
index 0000000..7b492d4
--- /dev/null
+++ b/hash.c
@@ -0,0 +1,110 @@
+/*
+ * Some generic hashing helpers.
+ */
+#include "cache.h"
+#include "hash.h"
+
+/*
+ * Look up a hash entry in the hash table. Return the pointer to
+ * the existing entry, or the empty slot if none existed. The caller
+ * can then look at the (*ptr) to see whether it existed or not.
+ */
+static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
+{
+	unsigned int size = table->size, nr = hash % size;
+	struct hash_table_entry *array = table->array;
+
+	while (array[nr].ptr) {
+		if (array[nr].hash == hash)
+			break;
+		nr++;
+		if (nr >= size)
+			nr = 0;
+	}
+	return array + nr;
+}
+
+
+/*
+ * Insert a new hash entry pointer into the table.
+ *
+ * If that hash entry already existed, return the pointer to
+ * the existing entry (and the caller can create a list of the
+ * pointers or do anything else). If it didn't exist, return
+ * NULL (and the caller knows the pointer has been inserted).
+ */
+static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
+
+	if (!entry->ptr) {
+		entry->ptr = ptr;
+		entry->hash = hash;
+		table->nr++;
+		return NULL;
+	}
+	return &entry->ptr;
+}
+
+static void grow_hash_table(struct hash_table *table)
+{
+	unsigned int i;
+	unsigned int old_size = table->size, new_size;
+	struct hash_table_entry *old_array = table->array, *new_array;
+
+	new_size = alloc_nr(old_size);
+	new_array = xcalloc(sizeof(struct hash_table_entry), new_size);
+	table->size = new_size;
+	table->array = new_array;
+	table->nr = 0;
+	for (i = 0; i < old_size; i++) {
+		unsigned int hash = old_array[i].hash;
+		void *ptr = old_array[i].ptr;
+		if (ptr)
+			insert_hash_entry(hash, ptr, table);
+	}
+	free(old_array);
+}
+
+void *lookup_hash(unsigned int hash, struct hash_table *table)
+{
+	if (!table->array)
+		return NULL;
+	return &lookup_hash_entry(hash, table)->ptr;
+}
+
+void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	unsigned int nr = table->nr;
+	if (nr >= table->size/2)
+		grow_hash_table(table);
+	return insert_hash_entry(hash, ptr, table);
+}
+
+int for_each_hash(struct hash_table *table, int (*fn)(void *))
+{
+	int sum = 0;
+	unsigned int i;
+	unsigned int size = table->size;
+	struct hash_table_entry *array = table->array;
+
+	for (i = 0; i < size; i++) {
+		void *ptr = array->ptr;
+		array++;
+		if (ptr) {
+			int val = fn(ptr);
+			if (val < 0)
+				return val;
+			sum += val;
+		}
+	}
+	return sum;
+}
+
+void free_hash(struct hash_table *table)
+{
+	free(table->array);
+	table->array = NULL;
+	table->size = 0;
+	table->nr = 0;
+}
diff --git a/hash.h b/hash.h
new file mode 100644
index 0000000..a8b0fbb
--- /dev/null
+++ b/hash.h
@@ -0,0 +1,43 @@
+#ifndef HASH_H
+#define HASH_H
+
+/*
+ * These are some simple generic hash table helper functions.
+ * Not necessarily suitable for all users, but good for things
+ * where you want to just keep track of a list of things, and
+ * have a good hash to use on them.
+ *
+ * It keeps the hash table at roughly 50-75% free, so the memory
+ * cost of the hash table itself is roughly
+ *
+ *	3 * 2*sizeof(void *) * nr_of_objects
+ *
+ * bytes.
+ *
+ * FIXME: on 64-bit architectures, we waste memory. It would be
+ * good to have just 32-bit pointers, requiring a special allocator
+ * for hashed entries or something.
+ */
+struct hash_table_entry {
+	unsigned int hash;
+	void *ptr;
+};
+
+struct hash_table {
+	unsigned int size, nr;
+	struct hash_table_entry *array;
+};
+
+extern void *lookup_hash(unsigned int hash, struct hash_table *table);
+extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
+extern int for_each_hash(struct hash_table *table, int (*fn)(void *));
+extern void free_hash(struct hash_table *table);
+
+static inline void init_hash(struct hash_table *table)
+{
+	table->size = 0;
+	table->nr = 0;
+	table->array = NULL;
+}
+
+#endif
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] Do exact rename detection regardless of rename limits
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
                     ` (4 preceding siblings ...)
  2007-10-25 18:23   ` [PATCH 5/6] Do linear-time/space rename logic for exact renames Linus Torvalds
@ 2007-10-25 18:24   ` Linus Torvalds
  2007-10-25 18:37   ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
  2007-10-25 19:08   ` Jeff King
  7 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:24 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List

From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: [PATCH 6/6] Do exact rename detection regardless of rename limits

Now that the exact rename detection is linear-time (with a very small
constant factor to boot), there is no longer any reason to limit it by
the number of files involved.

In some trivial testing, I created a repository with a directory that
had a hundred thousand files in it (all with different contents), and
then moved that directory to show the effects of renaming 100,000 files.

With the new code, that resulted in

	[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
	400006

	real    0m2.071s
	user    0m1.520s
	sys     0m0.576s

ie the code can correctly detect the hundred thousand renames in about 2
seconds (the number "400006" comes from four lines for each rename:

	diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
	similarity index 100%
	rename from really-big-dir/file-1-1-1-1-1
	rename to moved-big-dir/file-1-1-1-1-1

and the extra six lines is from a one-liner commit message and all the
commit information and spacing).

Most of those two seconds weren't even really the rename detection, it's
really all the other stuff needed to get there.

With the old code, this wouldn't have been practically possible.  Doing
a pairwise check of the ten billion possible pairs would have been
prohibitively expensive.  In fact, even with the rename limiter in
place, the old code would waste a lot of time just on the diff_filespec
checks, and despite not even trying to find renames, it used to look
like:

	[torvalds@woody big-rename]$ time git show -C | wc -l
	1400006

	real    0m12.337s
	user    0m12.285s
	sys     0m0.192s

ie we used to take 12 seconds for this load and not even do any rename
detection! (The number 1400006 comes from fourteen lines per file moved:
seven lines each for the delete and the create of a one-liner file, and
the same extra six lines of commit information).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This obviously just moves the rename detection call. It made sense as a 
separate patch largely because of the explanation that goes along with it, 
but it conceptually is very different from actually improving the rename 
detection logic itself. So it got a patch of its own.

 diffcore-rename.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e7e370b..3946932 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -429,6 +429,12 @@ void diffcore_rename(struct diff_options *options)
 		goto cleanup; /* nothing to do */

 	/*
+	 * We really want to cull the candidates list early
+	 * with cheap tests in order to avoid doing deltas.
+	 */
+	rename_count = find_exact_renames();
+
+	/*
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
@@ -444,12 +450,6 @@ void diffcore_rename(struct diff_options *options)
 	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
 		goto cleanup;

-	/*
-	 * We really want to cull the candidates list early
-	 * with cheap tests in order to avoid doing deltas.
-	 */
-	rename_count = find_exact_renames();
-
 	/* Have we run out the created file pool?  If so we can avoid
 	 * doing the delta matrix altogether.
 	 */
-- 
1.5.3.4.330.g1dec6

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] Cleaned-up rename detection patch-series
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
                     ` (5 preceding siblings ...)
  2007-10-25 18:24   ` [PATCH 6/6] Do exact rename detection regardless of rename limits Linus Torvalds
@ 2007-10-25 18:37   ` Linus Torvalds
  2007-10-25 19:08   ` Jeff King
  7 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 18:37 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List



Final notes:

 - all of these were done on top of the current 'master' branch.  Not that 
   it likely matters much, but I thought I'd mention it in case there are 
   conflicts with anything currently cooking in 'next'. I didn't even 
   check.

 - The commits depend on each other, but apart from the ordering, they all 
   stand on their own. They pass the test-suite at all stages.

 - In particular, the first four commits are cleanups and infrastructure 
   that really is totally independent of the last two. I'd argue that the 
   first one (the dependency fix) could/should be merged into stable, and 
   the next three can probably be merged more aggressively than the last 
   two.

 - I *think* this series is good to go. I've used the new rename detection 
   code in my own "production" environment (ie the kernel) since sending 
   it out, and the end result of all these patches is identical - apart 
   from the the dependency fix - to what I've been running for the last 
   week. But maybe it makes more sense to merge the first four into 
   'next', and leave the last two in 'pu', for example (or 'master' and 
   'next' respectively, depending on how comfy people are with the 
   patches).

Splitting the thing up definitely makes the series more readable, and 
maybe that means that somebody will actually comment on it. I don't think 
I got any comments on the original series once I fixed the stupid bugs. 
Hopefully the cleaner series is more amenable to people reviewing it.

			Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/6] Cleaned-up rename detection patch-series
  2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
                     ` (6 preceding siblings ...)
  2007-10-25 18:37   ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
@ 2007-10-25 19:08   ` Jeff King
  7 siblings, 0 replies; 18+ messages in thread
From: Jeff King @ 2007-10-25 19:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Thu, Oct 25, 2007 at 11:15:16AM -0700, Linus Torvalds wrote:

> Ok, here's the patch-series to do rename detection of exact renames in 
> linear time, except it's cleaned up into a nice series of 6 patches. The 
> end result is identical to the previous patches (which got smushed down 
> into just two patches in Junio's tree), apart from a fixed dependency in 
> the Makefile that caused me grief and a broken test-suite due to some 
> object files simply not being correctly recompiled.

Thanks, I have been basing my inexact rename work off of your messy
patches, so I will rebase onto this.

Sorry I am lagging behind on getting that work out, but I hope to have
something to show soon.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
  2007-10-25 18:23   ` [PATCH 5/6] Do linear-time/space rename logic for exact renames Linus Torvalds
@ 2007-10-25 19:43     ` Daniel Barkalow
  2007-10-25 19:48       ` Jeff King
  2007-10-25 20:25       ` Linus Torvalds
  0 siblings, 2 replies; 18+ messages in thread
From: Daniel Barkalow @ 2007-10-25 19:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

> +/*
> + * Insert a new hash entry pointer into the table.
> + *
> + * If that hash entry already existed, return the pointer to
> + * the existing entry (and the caller can create a list of the
> + * pointers or do anything else). If it didn't exist, return
> + * NULL (and the caller knows the pointer has been inserted).
> + */

Creating a list of the pointers doesn't work correctly with the grow 
implementation, because growing the hash may turn a collision into a 
non-collision, at which point items other than the first cannot be found 
(since they're listed inside a bucket that's now wrong for them). AFAIK, 
resizing a hash table requires being able to figure out what happened with 
collisions.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
  2007-10-25 19:43     ` Daniel Barkalow
@ 2007-10-25 19:48       ` Jeff King
  2007-10-25 20:23         ` Daniel Barkalow
  2007-10-25 20:25       ` Linus Torvalds
  1 sibling, 1 reply; 18+ messages in thread
From: Jeff King @ 2007-10-25 19:48 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List

On Thu, Oct 25, 2007 at 03:43:46PM -0400, Daniel Barkalow wrote:

> Creating a list of the pointers doesn't work correctly with the grow 
> implementation, because growing the hash may turn a collision into a 
> non-collision, at which point items other than the first cannot be found 
> (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> resizing a hash table requires being able to figure out what happened with 
> collisions.

I thought this at first, too, but there are two types of collisions in
this hash implementation: those that come from having the actual 32-bit
hash collide, and those that come from not having enough buckets.

The client code gets a pointer kicked back to it when there is a
collision on the actual hash value (i.e., two things had the exact same
hash value). The number of buckets grows when you simply have more
buckets filled than you like. Two different hashes that would be in the
same bucket don't actually occupy the same bucket -- the second one to
arrive gets shoved into the next available bucket.

-Peff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
  2007-10-25 19:48       ` Jeff King
@ 2007-10-25 20:23         ` Daniel Barkalow
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Barkalow @ 2007-10-25 20:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List

On Thu, 25 Oct 2007, Jeff King wrote:

> On Thu, Oct 25, 2007 at 03:43:46PM -0400, Daniel Barkalow wrote:
> 
> > Creating a list of the pointers doesn't work correctly with the grow 
> > implementation, because growing the hash may turn a collision into a 
> > non-collision, at which point items other than the first cannot be found 
> > (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> > resizing a hash table requires being able to figure out what happened with 
> > collisions.
> 
> I thought this at first, too, but there are two types of collisions in
> this hash implementation: those that come from having the actual 32-bit
> hash collide, and those that come from not having enough buckets.
> 
> The client code gets a pointer kicked back to it when there is a
> collision on the actual hash value (i.e., two things had the exact same
> hash value). The number of buckets grows when you simply have more
> buckets filled than you like. Two different hashes that would be in the
> same bucket don't actually occupy the same bucket -- the second one to
> arrive gets shoved into the next available bucket.

Ah, right, nevermind. The comment might be a bit misleading in that case, 
if we both missed this at first.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
  2007-10-25 19:43     ` Daniel Barkalow
  2007-10-25 19:48       ` Jeff King
@ 2007-10-25 20:25       ` Linus Torvalds
  2007-10-25 21:37         ` Daniel Barkalow
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2007-10-25 20:25 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, Git Mailing List

On Thu, 25 Oct 2007, Daniel Barkalow wrote:
> 
> Creating a list of the pointers doesn't work correctly with the grow 
> implementation, because growing the hash may turn a collision into a 
> non-collision, at which point items other than the first cannot be found 
> (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> resizing a hash table requires being able to figure out what happened with 
> collisions.

Nope. 

The hash algorithm is much smarter than that.

I *always* uses a full 32-bit hash, and no amount of resizing is ever 
going to change that. The index into the hash-table is in fact entirely 
unused.

This has several good properties:

 - it means that hash-table resizing is a non-event

 - it means that you always have the full 32-bit hash, and a collision in 
   the hash size never causes unnecessary work apart from the fact that 
   the code walks the hash table a bit more.

 - because the hash is embedded in the table itself, it has relatively 
   good cache behaviour when compared to something that needs to actually 
   follow the pointer to validate the full data. So assuming that the full 
   32-bit hash is good enough to effectively never have any collisions 
   (or, assuming you don't even *care* about the collisions, which is the 
   case when you're just generating content fingerprints for lines when 
   comparing the data in two files), you never end up with unnecessarily 
   following pointers to cachelines that you are not interested in.

The last point at least somewhat mitigates the (inevitably) bad cache 
behaviour that hash tables tend to have. It's not like it's going to be 
wonderful in the cache, but at least it's less horrid than the more common 
implementation that needs to follow the pointer to validate each hash 
entry that may or may not be a collision.

but the important part is #1, which is what allows the code to be a 
generic hash algorithm that resizes the hash table without even 
understanding or caring what is behind the pointer.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/6] Do linear-time/space rename logic for exact renames
  2007-10-25 20:25       ` Linus Torvalds
@ 2007-10-25 21:37         ` Daniel Barkalow
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Barkalow @ 2007-10-25 21:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Thu, 25 Oct 2007, Linus Torvalds wrote:

> On Thu, 25 Oct 2007, Daniel Barkalow wrote:
> > 
> > Creating a list of the pointers doesn't work correctly with the grow 
> > implementation, because growing the hash may turn a collision into a 
> > non-collision, at which point items other than the first cannot be found 
> > (since they're listed inside a bucket that's now wrong for them). AFAIK, 
> > resizing a hash table requires being able to figure out what happened with 
> > collisions.
> 
> Nope. 
> 
> The hash algorithm is much smarter than that.

Ah, yes. I was just confused by the comment suggesting the possibility of 
a collision when the only way to have two calls get the same bucket is 
when the key that the library gets is actually identical.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops
  2007-10-25 18:20   ` [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops Linus Torvalds
@ 2007-10-26 22:53     ` Junio C Hamano
  2007-10-26 23:10       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2007-10-26 22:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> @@ -2640,27 +2640,21 @@ static void diff_resolve_rename_copy(void)
>  		 * either in-place edit or rename/copy edit.
>  		 */
>  		else if (DIFF_PAIR_RENAME(p)) {
> +			/*
> +			 * A rename might have re-connected a broken
> +			 * pair up, causing the pathnames to be the
> +			 * same again. If so, that's not a rename at
> +			 * all, just a modification..
> +			 *
> +			 * Otherwise, see if this source was used for
> +			 * multiple renames, in which case we decrement
> +			 * the count, and call it a copy.
>  			 */
> +			if (!strcmp(p->one->path, p->two->path))
> +				p->status = DIFF_STATUS_MODIFIED;
> +			else if (--p->one->rename_used > 0)
>  				p->status = DIFF_STATUS_COPIED;
> +			else
>  				p->status = DIFF_STATUS_RENAMED;
>  		}
>  		else if (hashcmp(p->one->sha1, p->two->sha1) ||

The interaction between the above and ...

> @@ -338,15 +320,25 @@ void diffcore_rename(struct diff_options *options)
>  				locate_rename_dst(p->two, 1);
>  		}
>  		else if (!DIFF_FILE_VALID(p->two)) {
> +			/*
> +			 * If the source is a broken "delete", and
>  			 * they did not really want to get broken,
>  			 * that means the source actually stays.
> +			 * So we increment the "rename_used" score
> +			 * by one, to indicate ourselves as a user
> +			 */
> +			if (p->broken_pair && !p->score)
> +				p->one->rename_used++;
> +			register_rename_src(p->one, p->score);
> +		}
> +		else if (detect_rename == DIFF_DETECT_COPY) {
> +			/*
> +			 * Increment the "rename_used" score by
> +			 * one, to indicate ourselves as a user.
>  			 */
> +			p->one->rename_used++;
> +			register_rename_src(p->one, p->score);
>  		}
>  	}
>  	if (rename_dst_nr == 0 || rename_src_nr == 0)
>  		goto cleanup; /* nothing to do */

... this part feels a bit too subtle for a still-jet-lagged
brain to grok.  I wonder what happens if the preimage of a
broken pair is used as the rename source for more than two
postimages.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops
  2007-10-26 22:53     ` Junio C Hamano
@ 2007-10-26 23:10       ` Linus Torvalds
  2007-10-26 23:27         ` Junio C Hamano
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2007-10-26 23:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Fri, 26 Oct 2007, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > @@ -2640,27 +2640,21 @@ static void diff_resolve_rename_copy(void)
> >  		 * either in-place edit or rename/copy edit.
> >  		 */
> >  		else if (DIFF_PAIR_RENAME(p)) {
> > +			/*
> > +			 * A rename might have re-connected a broken
> > +			 * pair up, causing the pathnames to be the
> > +			 * same again. If so, that's not a rename at
> > +			 * all, just a modification..
> > +			 *
> > +			 * Otherwise, see if this source was used for
> > +			 * multiple renames, in which case we decrement
> > +			 * the count, and call it a copy.
> >  			 */
> > +			if (!strcmp(p->one->path, p->two->path))
> > +				p->status = DIFF_STATUS_MODIFIED;
> > +			else if (--p->one->rename_used > 0)
> >  				p->status = DIFF_STATUS_COPIED;
> > +			else
> >  				p->status = DIFF_STATUS_RENAMED;
> >  		}
> >  		else if (hashcmp(p->one->sha1, p->two->sha1) ||
> 
> The interaction between the above and ...

Heh.

I'm pretty sure it's right, because:

> > @@ -338,15 +320,25 @@ void diffcore_rename(struct diff_options *options)
> >  				locate_rename_dst(p->two, 1);
> >  		}
> >  		else if (!DIFF_FILE_VALID(p->two)) {
> > +			/*
> > +			 * If the source is a broken "delete", and
> >  			 * they did not really want to get broken,
> >  			 * that means the source actually stays.
> > +			 * So we increment the "rename_used" score
> > +			 * by one, to indicate ourselves as a user
> > +			 */
> > +			if (p->broken_pair && !p->score)
> > +				p->one->rename_used++;
> > +			register_rename_src(p->one, p->score);
> > +		}
> > +		else if (detect_rename == DIFF_DETECT_COPY) {
> > +			/*
> > +			 * Increment the "rename_used" score by
> > +			 * one, to indicate ourselves as a user.
> >  			 */
> > +			p->one->rename_used++;
> > +			register_rename_src(p->one, p->score);
> >  		}
> >  	}
> >  	if (rename_dst_nr == 0 || rename_src_nr == 0)
> >  		goto cleanup; /* nothing to do */
> 
> ... this part feels a bit too subtle for a still-jet-lagged
> brain to grok.  I wonder what happens if the preimage of a
> broken pair is used as the rename source for more than two
> postimages.

The nice thing about the whole counting thing is that we really don't even 
care. What happens is:

 - *if* any rename at all happens, the rename detection will increment the 
   "rename_used" thing even more for the source (once for each rename)

 - so if the rename_used started out non-zero, it will never become zero 
   in diff_resolve_rename_copy(), and every single detected "rename" will 
   be considered a copy - exactly like we want.

 - in other words, a "rename_used++" before rename-detection guarantees 
   that you never see a rename, only a copy of the source.

The above is actually true *even*if* the 

	if (!strcmp(p->one->path, p->two->path))

code in diff_resolve_rename_copy() actually triggers, so we could have 
(and at one point I did) done the "--p->one->rename_used > 0" test before 
the strcmp() test, and it would all continue to work fine. However, the 
reason that I put the strcmp() first is that it needs to be done whether 
we decide to consider it a "copy" or a "rename" (so we cannot avoid it 
anyway), and *if* it triggers, we actually want to avoid the rename_used 
going down to zero anyway (not that it would, because I think it's bound 
to be one of the cases where we pre-incremented the count), so not doing 
the decrement there is equivalent to doing another pre-increment.

So I think it's all right, and more obviously right than it used to be. 
But hey, it's possible that I missed something.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops
  2007-10-26 23:10       ` Linus Torvalds
@ 2007-10-26 23:27         ` Junio C Hamano
  2007-10-26 23:36           ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2007-10-26 23:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> The nice thing about the whole counting thing is that we really don't even 
> care. What happens is:
>
>  - *if* any rename at all happens, the rename detection will increment the 
>    "rename_used" thing even more for the source (once for each rename)
>
>  - so if the rename_used started out non-zero, it will never become zero 
>    in diff_resolve_rename_copy(), and every single detected "rename" will 
>    be considered a copy - exactly like we want.
>
>  - in other words, a "rename_used++" before rename-detection guarantees 
>    that you never see a rename, only a copy of the source.

Ok, and that is because we _know_ a broken pair means that the
path itself remains in the postimage and any other postimage
that takes from its preimage can never "rename the path away".
They can only be copies.  Which makes sense.

> The above is actually true *even*if* the 
>
> 	if (!strcmp(p->one->path, p->two->path))
>
> code in diff_resolve_rename_copy() actually triggers,...
> ..., so not doing 
> the decrement there is equivalent to doing another pre-increment.

Yes, this is what confused me.  So for a broken pair, the actual
value of rename_used does not really matter.  We only care about
it not going down to zero.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops
  2007-10-26 23:27         ` Junio C Hamano
@ 2007-10-26 23:36           ` Linus Torvalds
  0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2007-10-26 23:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Fri, 26 Oct 2007, Junio C Hamano wrote:
> 
> So for a broken pair, the actual value of rename_used does not really 
> matter.  We only care about it not going down to zero.

Correct. The rename_used count really is immaterial *except* for the magic 
distinction between zero ("it's a rename, no original source file left") 
and non-zero ("it's a copy, original source file remains").

Which is why the new counter is so fundamentally simple: by decrementing 
it for each rename we encounter, we automatically get that behaviour of 
"only the last user turns into a 'rename' if the source file really went 
away" that we want. 

The old code did it all with some really expensive loops over the 
remaining renames.

		Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-10-26 23:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.LFD.0.999.0710251112120.30120@woody.linux-foundation.or g>
2007-10-25 18:15 ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
2007-10-25 18:16   ` [PATCH 1/6] Add 'diffcore.h' to LIB_H Linus Torvalds
2007-10-25 18:17   ` [PATCH 2/6] Split out "exact content match" phase of rename detection Linus Torvalds
2007-10-25 18:19   ` [PATCH 3/6] Ref-count the filespecs used by diffcore Linus Torvalds
2007-10-25 18:20   ` [PATCH 4/6] copy vs rename detection: avoid unnecessary O(n*m) loops Linus Torvalds
2007-10-26 22:53     ` Junio C Hamano
2007-10-26 23:10       ` Linus Torvalds
2007-10-26 23:27         ` Junio C Hamano
2007-10-26 23:36           ` Linus Torvalds
2007-10-25 18:23   ` [PATCH 5/6] Do linear-time/space rename logic for exact renames Linus Torvalds
2007-10-25 19:43     ` Daniel Barkalow
2007-10-25 19:48       ` Jeff King
2007-10-25 20:23         ` Daniel Barkalow
2007-10-25 20:25       ` Linus Torvalds
2007-10-25 21:37         ` Daniel Barkalow
2007-10-25 18:24   ` [PATCH 6/6] Do exact rename detection regardless of rename limits Linus Torvalds
2007-10-25 18:37   ` [PATCH 0/6] Cleaned-up rename detection patch-series Linus Torvalds
2007-10-25 19:08   ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).