[PATCH, take 1] Linear-time/space rename logic (exact renames only)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH, take 1] Linear-time/space rename logic (exact renames only)
@ 2007-10-21 23:59 Linus Torvalds
  2007-10-22  0:31 ` David Symonds
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Linus Torvalds @ 2007-10-21 23:59 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano, Shawn O. Pearce
  Cc: David Kastrup, Jeff King


This is a first effort at avoiding various O(n*m) effects in the rename 
detection. Right now it does so only for the exact renames, which is 
admittedly a rather easier case to handle, but having emailed a bit with 
Andy Chu, I think it's possible to do even the non-exact renames using a 
similar approach.

This depends on the previous diffcore-rename() cleanup, and introduces a 
new set of helpers for doing hash tables (hash.[ch]). We could probably 
move some of the other of our hash table users over to this (it's designed 
to be fairly generic), but that's a separate issue.

What it does is to rather than iterate over all sources and destinations 
and checking if they are identical (which is O(src*dst)), it hashes each 
of the sources and destinations into a hash table, using the SHA1 hash of 
the contents as the hash. That's O(n+m). It then walks the hash table 
(which is also O(m+n) in size), and only pairs up files for comparison 
that hashed to the same spot.

Doing this for more than just the exact same contents would be basically 
the same thing, except it starts hashing up fingerprints of the contents 
and linking up file pairs that get linked up by those fingerprints. More 
involved, but not impossible.

I tried a trivial case where I moved 100,000 files from one directory to 
another, and this patch speeds that up from ~13s to just under 2s for me.

However! Please note:
 - it looks ok, and I've tested it some, but this needs more people 
   looking at it.
 - because it only helps the exact rename case, it by no means "solves" 
   the rename cost issue. It just makes one particular case go much 
   faster.
 - in fact, the big optimization isn't the actual hash table, but the 
   independent and much simpler "diff_filespec->used" optimization for a 
   deleted filename that was used for a rename/copy.

But I'd like to have people give it a look,

		Linus

---

 Makefile          |    4 +-
 diffcore-rename.c |  214 ++++++++++++++++++++++++++++++++++------------------
 diffcore.h        |    1 +
 hash.c            |  110 +++++++++++++++++++++++++++
 hash.h            |   43 +++++++++++
 5 files changed, 296 insertions(+), 76 deletions(-)

diff --git a/Makefile b/Makefile
index 8db4dbe..17c31ba 100644
--- a/Makefile
+++ b/Makefile
@@ -291,7 +291,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h
+	mailmap.h remote.h hash.o
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -301,7 +301,7 @@ DIFF_OBJS = \
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
 	date.o diff-delta.o entry.o exec_cmd.o ident.o \
-	interpolate.o \
+	interpolate.o hash.o \
 	lockfile.o \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2077a9b..05d39db 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,6 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "hash.h"
 
 /* Table of rename/copy destinations */
 
@@ -96,29 +97,6 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 	return &(rename_src[first]);
 }
 
-static int is_exact_match(struct diff_filespec *src,
-			  struct diff_filespec *dst,
-			  int contents_too)
-{
-	if (src->sha1_valid && dst->sha1_valid &&
-	    !hashcmp(src->sha1, dst->sha1))
-		return 1;
-	if (!contents_too)
-		return 0;
-	if (diff_populate_filespec(src, 1) || diff_populate_filespec(dst, 1))
-		return 0;
-	if (src->size != dst->size)
-		return 0;
-	if (src->sha1_valid && dst->sha1_valid)
-	    return !hashcmp(src->sha1, dst->sha1);
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
-		return 0;
-	if (src->size == dst->size &&
-	    !memcmp(src->data, dst->data, src->size))
-		return 1;
-	return 0;
-}
-
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 {
 	int src_len = strlen(src->path), dst_len = strlen(dst->path);
@@ -216,6 +194,7 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		die("internal error: dst already matched.");
 
 	src = rename_src[src_index].one;
+	src->used = 1;
 	one = alloc_filespec(src->path);
 	fill_filespec(one, src->sha1, src->mode);
 
@@ -262,56 +241,152 @@ static int compute_stays(struct diff_queue_struct *q,
 	return 1;
 }
 
+struct file_similarity {
+	int src_dst, index;
+	struct diff_filespec *filespec;
+	struct file_similarity *next;
+};
+
+static int find_identical_files(struct file_similarity *src,
+				struct file_similarity *dst)
+{
+	int renames = 0;
+	do {
+		struct diff_filespec *one = src->filespec;
+		struct file_similarity *p, *best;
+		int i = 100;
+
+		best = NULL;
+		for (p = dst; p; p = p->next) {
+			struct diff_filespec *two = p->filespec;
+
+			/* Already picked as a destination? */
+			if (!p->src_dst)
+				continue;
+			/* False hash collission? */
+			if (hashcmp(one->sha1, two->sha1))
+				continue;
+			best = p;
+			if (basename_same(one, two))
+				break;
+
+			/* Too many identical alternatives? Pick one */
+			if (!--i)
+				break;
+		}
+		if (best) {
+			best->src_dst = 0;
+			record_rename_pair(best->index, src->index, MAX_SCORE);
+			renames++;
+		}
+	} while ((src = src->next) != NULL);
+	return renames;
+}
+
+static int find_same_files(void *ptr)
+{
+	struct file_similarity *p = ptr;
+	struct file_similarity *src = NULL, *dst = NULL;
+
+	/* Split the hash list up into sources and destinations */
+	do {
+		struct file_similarity *entry = p;
+		p = p->next;
+		if (entry->src_dst < 0) {
+			entry->next = src;
+			src = entry;
+		} else {
+			entry->next = dst;
+			dst = entry;
+		}
+	} while (p);
+
+	/*
+	 * If we have both sources *and* destinations, see if
+	 * we can match them up
+	 */
+	return (src && dst) ? find_identical_files(src, dst) : 0;
+}
+
+/*
+ * Note: the rest of the rename logic depends on this
+ * phase also populating all the filespecs for any
+ * entry that isn't matched up with an exact rename.
+ */
+static int free_file_table(void *ptr)
+{
+	struct file_similarity *p = ptr;
+	do {
+		struct file_similarity *entry = p;
+		p = p->next;
+
+		/* Stupid special case, see note above! */
+		diff_populate_filespec(entry->filespec, 0);
+		free(entry);
+	} while (p);
+	return 0;
+}
+
+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+	unsigned int hash;
+	if (!filespec->sha1_valid) {
+		if (diff_populate_filespec(filespec, 0))
+			return 0;
+		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
+	}
+	memcpy(&hash, filespec->sha1, sizeof(hash));
+	return hash;
+}
+
+static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
+{
+	void **pos;
+	unsigned int hash;
+	struct file_similarity *entry = xmalloc(sizeof(*entry));
+
+	entry->src_dst = src_dst;
+	entry->index = index;
+	entry->filespec = filespec;
+	entry->next = NULL;
+
+	hash = hash_filespec(filespec);
+	pos = insert_hash(hash, entry, table);
+
+	/* We already had an entry there? */
+	if (pos) {
+		entry->next = *pos;
+		*pos = entry;
+	}
+}
+
 /*
  * Find exact renames first.
  *
  * The first round matches up the up-to-date entries,
  * and then during the second round we try to match
  * cache-dirty entries as well.
- *
- * Note: the rest of the rename logic depends on this
- * phase also populating all the filespecs for any
- * entry that isn't matched up with an exact rename,
- * see "is_exact_match()".
  */
 static int find_exact_renames(void)
 {
-	int rename_count = 0;
-	int contents_too;
-
-	for (contents_too = 0; contents_too < 2; contents_too++) {
-		int i;
-
-		for (i = 0; i < rename_dst_nr; i++) {
-			struct diff_filespec *two = rename_dst[i].two;
-			int j;
-
-			if (rename_dst[i].pair)
-				continue; /* dealt with an earlier round */
-			for (j = 0; j < rename_src_nr; j++) {
-				int k;
-				struct diff_filespec *one = rename_src[j].one;
-				if (!is_exact_match(one, two, contents_too))
-					continue;
-
-				/* see if there is a basename match, too */
-				for (k = j; k < rename_src_nr; k++) {
-					one = rename_src[k].one;
-					if (basename_same(one, two) &&
-						is_exact_match(one, two,
-							contents_too)) {
-						j = k;
-						break;
-					}
-				}
-
-				record_rename_pair(i, j, (int)MAX_SCORE);
-				rename_count++;
-				break; /* we are done with this entry */
-			}
-		}
-	}
-	return rename_count;
+	int i;
+	struct hash_table file_table;
+
+	init_hash(&file_table);
+	for (i = 0; i < rename_src_nr; i++)
+		insert_file_table(&file_table, -1, i, rename_src[i].one);
+
+	for (i = 0; i < rename_dst_nr; i++)
+		insert_file_table(&file_table, 1, i, rename_dst[i].two);
+
+	/* Find the renames */
+	i = for_each_hash(&file_table, find_same_files);
+
+	/* .. and free the hash data structures */
+	for_each_hash(&file_table, free_file_table);
+	free_hash(&file_table);
+
+	return i;
 }
 
 void diffcore_rename(struct diff_options *options)
@@ -474,16 +549,7 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 			else {
-				for (j = 0; j < rename_dst_nr; j++) {
-					if (!rename_dst[j].pair)
-						continue;
-					if (strcmp(rename_dst[j].pair->
-						   one->path,
-						   p->one->path))
-						continue;
-					break;
-				}
-				if (j < rename_dst_nr)
+				if (p->one->used)
 					/* this path remains */
 					pair_to_free = p;
 			}
diff --git a/diffcore.h b/diffcore.h
index eb618b1..a58d345 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -40,6 +40,7 @@ struct diff_filespec {
 	unsigned should_munmap : 1; /* data should be munmap()'ed */
 	unsigned checked_attr : 1;
 	unsigned is_binary : 1; /* data should be considered "binary" */
+	unsigned used : 1;	/* this pathspec was used for copy/delete */
 };
 
 extern struct diff_filespec *alloc_filespec(const char *);
diff --git a/hash.c b/hash.c
new file mode 100644
index 0000000..7b492d4
--- /dev/null
+++ b/hash.c
@@ -0,0 +1,110 @@
+/*
+ * Some generic hashing helpers.
+ */
+#include "cache.h"
+#include "hash.h"
+
+/*
+ * Look up a hash entry in the hash table. Return the pointer to
+ * the existing entry, or the empty slot if none existed. The caller
+ * can then look at the (*ptr) to see whether it existed or not.
+ */
+static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
+{
+	unsigned int size = table->size, nr = hash % size;
+	struct hash_table_entry *array = table->array;
+
+	while (array[nr].ptr) {
+		if (array[nr].hash == hash)
+			break;
+		nr++;
+		if (nr >= size)
+			nr = 0;
+	}
+	return array + nr;
+}
+
+
+/*
+ * Insert a new hash entry pointer into the table.
+ *
+ * If that hash entry already existed, return the pointer to
+ * the existing entry (and the caller can create a list of the
+ * pointers or do anything else). If it didn't exist, return
+ * NULL (and the caller knows the pointer has been inserted).
+ */
+static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
+
+	if (!entry->ptr) {
+		entry->ptr = ptr;
+		entry->hash = hash;
+		table->nr++;
+		return NULL;
+	}
+	return &entry->ptr;
+}
+
+static void grow_hash_table(struct hash_table *table)
+{
+	unsigned int i;
+	unsigned int old_size = table->size, new_size;
+	struct hash_table_entry *old_array = table->array, *new_array;
+
+	new_size = alloc_nr(old_size);
+	new_array = xcalloc(sizeof(struct hash_table_entry), new_size);
+	table->size = new_size;
+	table->array = new_array;
+	table->nr = 0;
+	for (i = 0; i < old_size; i++) {
+		unsigned int hash = old_array[i].hash;
+		void *ptr = old_array[i].ptr;
+		if (ptr)
+			insert_hash_entry(hash, ptr, table);
+	}
+	free(old_array);
+}
+
+void *lookup_hash(unsigned int hash, struct hash_table *table)
+{
+	if (!table->array)
+		return NULL;
+	return &lookup_hash_entry(hash, table)->ptr;
+}
+
+void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	unsigned int nr = table->nr;
+	if (nr >= table->size/2)
+		grow_hash_table(table);
+	return insert_hash_entry(hash, ptr, table);
+}
+
+int for_each_hash(struct hash_table *table, int (*fn)(void *))
+{
+	int sum = 0;
+	unsigned int i;
+	unsigned int size = table->size;
+	struct hash_table_entry *array = table->array;
+
+	for (i = 0; i < size; i++) {
+		void *ptr = array->ptr;
+		array++;
+		if (ptr) {
+			int val = fn(ptr);
+			if (val < 0)
+				return val;
+			sum += val;
+		}
+	}
+	return sum;
+}
+
+void free_hash(struct hash_table *table)
+{
+	free(table->array);
+	table->array = NULL;
+	table->size = 0;
+	table->nr = 0;
+}
diff --git a/hash.h b/hash.h
new file mode 100644
index 0000000..5056c9a
--- /dev/null
+++ b/hash.h
@@ -0,0 +1,43 @@
+#ifndef HASH_H
+#define HASH_H
+
+/*
+ * These are some simple generic hash table helper functions.
+ * Not necessarily suitable for all users, but good for things
+ * where you want to just keep track of a list of things, and
+ * have a good hash to use on them.
+ *
+ * It keeps the hash table at roughly 50-75% free, so the memory
+ * cost of the hash table itself is roughly
+ *
+ *	3 * 2*sizeof(void *) * nr_of_objects
+ *
+ * bytes. 
+ *
+ * FIXME: on 64-bit architectures, we waste memory. It would be
+ * good to have just 32-bit pointers, requiring a special allocator
+ * for hashed entries or something.
+ */
+struct hash_table_entry {
+	unsigned int hash;
+	void *ptr;
+};
+
+struct hash_table {
+	unsigned int size, nr;
+	struct hash_table_entry *array;
+};
+
+extern void *lookup_hash(unsigned int hash, struct hash_table *table);
+extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
+extern int for_each_hash(struct hash_table *table, int (*fn)(void *));
+extern void free_hash(struct hash_table *table);
+
+static inline void init_hash(struct hash_table *table)
+{
+	table->size = 0;
+	table->nr = 0;
+	table->array = NULL;
+}
+
+#endif

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
  2007-10-21 23:59 [PATCH, take 1] Linear-time/space rename logic (exact renames only) Linus Torvalds
@ 2007-10-22  0:31 ` David Symonds
  2007-10-22  5:41   ` Linus Torvalds
  2007-10-22  6:47 ` Jeff King
  2007-10-22  7:07 ` Sven Verdoolaege
  2 siblings, 1 reply; 13+ messages in thread
From: David Symonds @ 2007-10-22  0:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King

On 22/10/2007, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> diff --git a/Makefile b/Makefile
> index 8db4dbe..17c31ba 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -291,7 +291,7 @@ LIB_H = \
>         run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
>         tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
>         utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
> -       mailmap.h remote.h
> +       mailmap.h remote.h hash.o

I assume that should be "hash.h", not "hash.o"?


Dave.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
  2007-10-22  0:31 ` David Symonds
@ 2007-10-22  5:41   ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2007-10-22  5:41 UTC (permalink / raw)
  To: David Symonds
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King



On Mon, 22 Oct 2007, David Symonds wrote:
> > @@ -291,7 +291,7 @@ LIB_H = \
> >         run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
> >         tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
> >         utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
> > -       mailmap.h remote.h
> > +       mailmap.h remote.h hash.o
> 
> I assume that should be "hash.h", not "hash.o"?

Oops.

Yes.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
  2007-10-21 23:59 [PATCH, take 1] Linear-time/space rename logic (exact renames only) Linus Torvalds
  2007-10-22  0:31 ` David Symonds
@ 2007-10-22  6:47 ` Jeff King
  2007-10-22  7:07 ` Sven Verdoolaege
  2 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-10-22  6:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup

On Sun, Oct 21, 2007 at 04:59:03PM -0700, Linus Torvalds wrote:

> What it does is to rather than iterate over all sources and destinations 
> and checking if they are identical (which is O(src*dst)), it hashes each 
> of the sources and destinations into a hash table, using the SHA1 hash of 
> the contents as the hash. That's O(n+m). It then walks the hash table 
> (which is also O(m+n) in size), and only pairs up files for comparison 
> that hashed to the same spot.
> 
> Doing this for more than just the exact same contents would be basically 
> the same thing, except it starts hashing up fingerprints of the contents 
> and linking up file pairs that get linked up by those fingerprints. More 
> involved, but not impossible.

Hrm. For the inexact case, it seems like you should be able to do
something like this:

 1. put the content fingerprint hashes in a hash table
 2. create a single src * dst table of scores
 3. walk the hash table; for each bucket in which there are collisions,
    find every pair in that bucket, and add to the similarity score in
    the big score table
 4. for each src in the similarity table, find the maximum dest

Step 1 is clearly O(n+m). Step 2 allocates O(n*m) memory, but we only
need a single integer in each slot. The outer walk in step 3 is
O(fingerprints), which is, O(content_size * (n+m)). The inner loop can
actually be worst case O(n*m), but is more likely to be a handful of
pairs (assuming that for any fingerprint chunk, it is only going to
exist in a few files; you would get worst case if you were comparing a
bunch of files with identical contents). And step 4 is actually going to
end up being O(n*m), since you have to find the best match for each.

So there is actually still O(n*m) behavior, but I wonder if we are
tightening the O(n*m) loop enough that we will see improvement.

Hrm. I wonder if we could keep a "best" pointer for each src file, and
when we update the score for a given src/dest pair, check for a new
best. That would add more to step 3, but make step 4 O(n).

And of course the n*m memory is going to bottleneck. I think for 1000 *
1000, it should be fine, but if you want to try 100,000 * 100,000, you
are going to need tens of gigabytes. And it's going to be very sparse.
So perhaps a hash table with keys of src+dst mapped to scores. And then
we could sort the whole table afterwards, and just run through it
linearly, which helps step 4.

>  - it looks ok, and I've tested it some, but this needs more people 
>    looking at it.

Overall it looks like a sane approach. My comments are below.

>  - in fact, the big optimization isn't the actual hash table, but the 
>    independent and much simpler "diff_filespec->used" optimization for a 
>    deleted filename that was used for a rename/copy.

Do you have separate timings? The "diff_filespec->used" optimization
appears to be cutting out an O(n*m) strcmp. If it is most of the
optimization, I wonder if the complexity of the hash change is worth it
(although I find the new code easier to read, so maybe it is worth it on
those grounds alone).

> +struct file_similarity {
> +	int src_dst, index;

It took me a while to figure out all of the meanings of src_dst; maybe a
comment is in order?

> +static int find_identical_files(struct file_similarity *src,
> +				struct file_similarity *dst)

Your function naming is a bit confusing. You have find_identical_files,
find_same_files, and find_exact_renames, all of which do the same thing,
but for different levels of input. Perhaps the names should reflect how
they are different?

> +static int find_identical_files(struct file_similarity *src,
> +				struct file_similarity *dst)
> +{
> +	int renames = 0;
> +	do {
> +		struct diff_filespec *one = src->filespec;
> +		struct file_similarity *p, *best;
> +		int i = 100;
> +
> +		best = NULL;
> +		for (p = dst; p; p = p->next) {
> +			struct diff_filespec *two = p->filespec;
> +
> +			/* Already picked as a destination? */
> +			if (!p->src_dst)
> +				continue;
> +			/* False hash collission? */
> +			if (hashcmp(one->sha1, two->sha1))
> +				continue;
> +			best = p;
> +			if (basename_same(one, two))
> +				break;
> +
> +			/* Too many identical alternatives? Pick one */
> +			if (!--i)
> +				break;
> +		}
> +		if (best) {
> +			best->src_dst = 0;
> +			record_rename_pair(best->index, src->index, MAX_SCORE);
> +			renames++;
> +		}
> +	} while ((src = src->next) != NULL);
> +	return renames;
> +}

The "too many identical alternatives" sanity check is interesting. It
can produce suboptimal results if, e.g., the 101st entry has the same
basename. I suppose it is necessary to prevent a worst case "oops, all
of these files are identical" O(n*m) behavior. But generally I would
favor "always correct, pathological cases slow" to "always fast,
pathological cases incorrect". But maybe the basename heuristic isn't
worth considering "correct" in this case (and honestly, I doubt anyone
will hit the 100 limit anyway).

> +/*
> + * Note: the rest of the rename logic depends on this
> + * phase also populating all the filespecs for any
> + * entry that isn't matched up with an exact rename.
> + */
> +static int free_file_table(void *ptr)
> +{
> +	struct file_similarity *p = ptr;
> +	do {
> +		struct file_similarity *entry = p;
> +		p = p->next;
> +
> +		/* Stupid special case, see note above! */
> +		diff_populate_filespec(entry->filespec, 0);
> +		free(entry);
> +	} while (p);
> +	return 0;
> +}

diff_populate_filespec knows whether or not it needs to do any work. I
wonder if those parts of the rename process that need it should just
call it unconditionally. It seems like a fragile dependency. Besides
which...

> +static unsigned int hash_filespec(struct diff_filespec *filespec)
> +{
> +	unsigned int hash;
> +	if (!filespec->sha1_valid) {
> +		if (diff_populate_filespec(filespec, 0))
> +			return 0;
> +		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
> +	}
> +	memcpy(&hash, filespec->sha1, sizeof(hash));
> +	return hash;
> +}

...if everything has been through this hash function, why do we need to
populate again upon freeing?

> +static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
> +{
> +	unsigned int size = table->size, nr = hash % size;
> +	struct hash_table_entry *array = table->array;
> +
> +	while (array[nr].ptr) {
> +		if (array[nr].hash == hash)
> +			break;
> +		nr++;
> +		if (nr >= size)
> +			nr = 0;
> +	}
> +	return array + nr;
> +}

Rather than having buckets where collisions extend in a list within the
bucket, it looks like you are just overflowing into the next bucket.
It seems like you could end up with pretty bad "runs" of full buckets.
E.g., bucket 1 has a collision, so it bleeds into bucket 2.  Now when we
place something in bucket 2, it has to skip past bucket 1's overflow.
And a collision in bucket 2 means we have to skip the overflow for
buckets 1 _and_ 2. And so on, until we can resync by finding enough free
buckets to accept all of our collisions. So it depends on keeping a lot
of holes in the hash structure, which I see that you do, but I wonder
what the optimal value is.

I assume you chose this method to reduce memory fragmentation, which
makes sense. And maybe there is a simple answer that "50-75% free gives
good results over evenly distributed hashes." Though perhaps we should
also consider clustered hashes (especially if we want to put the
fingerprint hashes directly in here).

> +/*
> + * Insert a new hash entry pointer into the table.
> + *
> + * If that hash entry already existed, return the pointer to
> + * the existing entry (and the caller can create a list of the
> + * pointers or do anything else). If it didn't exist, return
> + * NULL (and the caller knows the pointer has been inserted).
> + */
> +static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)

This calling convention seems a bit clunky. Since all callers have to
check for hash collision anyway, why not just unconditionally return the
pointer to the data ptr (which will itself be NULL if this is the first
entry)? Then you can drop the conditional, and:

  pos = insert_hash(hash, entry, table);
  if (pos) {
    entry->next = *pos;
    *pos = entry;
  }

can simply become:

  pos = insert_hash(hash, table);
  entry->next = *pos;
  *pos = entry;

> +void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
> +{
> +	unsigned int nr = table->nr;
> +	if (nr >= table->size/2)
> +		grow_hash_table(table);
> +	return insert_hash_entry(hash, ptr, table);
> +}

Style nitpick: the local "nr" is rather pointless.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
  2007-10-21 23:59 [PATCH, take 1] Linear-time/space rename logic (exact renames only) Linus Torvalds
  2007-10-22  0:31 ` David Symonds
  2007-10-22  6:47 ` Jeff King
@ 2007-10-22  7:07 ` Sven Verdoolaege
  2007-10-22  7:21   ` Jeff King
  2007-10-22 16:33   ` Linus Torvalds
  2 siblings, 2 replies; 13+ messages in thread
From: Sven Verdoolaege @ 2007-10-22  7:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King

On Sun, Oct 21, 2007 at 04:59:03PM -0700, Linus Torvalds wrote:
> +static int find_same_files(void *ptr)
> +{
> +	struct file_similarity *p = ptr;
> +	struct file_similarity *src = NULL, *dst = NULL;
> +
> +	/* Split the hash list up into sources and destinations */
> +	do {
> +		struct file_similarity *entry = p;
> +		p = p->next;
> +		if (entry->src_dst < 0) {
> +			entry->next = src;
> +			src = entry;
> +		} else {
> +			entry->next = dst;
> +			dst = entry;
> +		}
> +	} while (p);

Aren't you truncating the ptr list after the first entry here?
(While you still need the whole list in free_file_table.)

skimo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
  2007-10-22  7:07 ` Sven Verdoolaege
@ 2007-10-22  7:21   ` Jeff King
  2007-10-22 16:33   ` Linus Torvalds
  1 sibling, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-10-22  7:21 UTC (permalink / raw)
  To: skimo
  Cc: Linus Torvalds, Git Mailing List, Junio C Hamano, Shawn O. Pearce,
	David Kastrup

On Mon, Oct 22, 2007 at 09:07:50AM +0200, Sven Verdoolaege wrote:

> Aren't you truncating the ptr list after the first entry here?
> (While you still need the whole list in free_file_table.)

Yes, good eyes. And because we actually reverse the list, it's not as
simple as just sticking the two broken up pieces together again; the
original head must end up as the head of the list after they are glued
together again, but it is actually the tail of one of the lists.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 1] Linear-time/space rename logic (exact renames only)
  2007-10-22  7:07 ` Sven Verdoolaege
  2007-10-22  7:21   ` Jeff King
@ 2007-10-22 16:33   ` Linus Torvalds
  2007-10-22 17:29     ` [PATCH, take 2] " Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2007-10-22 16:33 UTC (permalink / raw)
  To: skimo
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King



On Mon, 22 Oct 2007, Sven Verdoolaege wrote:
> 
> Aren't you truncating the ptr list after the first entry here?
> (While you still need the whole list in free_file_table.)

Yes. I didn't have that bug in the first version (I didn't do a separate 
"free_file_table()" at all - I just free'd the src/dst pointer lists at 
the end of that function). But I wanted to "clean up" the thing. Duh.

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH, take 2] Linear-time/space rename logic (exact renames only)
  2007-10-22 16:33   ` Linus Torvalds
@ 2007-10-22 17:29     ` Linus Torvalds
  2007-10-22 19:31       ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2007-10-22 17:29 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano, Shawn O. Pearce
  Cc: David Kastrup, Jeff King, Sven Verdoolaege


Ok, as some people notices, there were a few bugs in the previous patch. I 
didn't free the hashes correctly (stupid) and the Makefile had "hash.o" 
instead of "hash.h".

But more importantly, my "testing" had been totally broken, because I had 
forgotten to actually move the rename limiting code to after the exact 
rename phase, so when I tested the 100,000 file rename, almost none of the 
new code triggered, so my performance testing was totally bogus.

When fixing that, I noticed that while my new exact rename detection was 
essentially instantaneous, there were some O(n*m) effects in the generic 
diff code from the extremely stupid way we handled the "was it a copy or a 
rename" issue.

To fix that, I just made the "was the path used by a rename" be a counter 
instead of a single "it was used"

With that in place, I could actually time the rename detection of 100,000 
files in my big-rename test repository. This is what it looks like when 
you rename a hundred thousand files:

	[torvalds@woody big-rename]$ time ~/git/git show -C | wc -l
	400006

	real    0m2.675s
	user    0m2.148s
	sys     0m0.540s

(each renamed file is 4 lines: they looks like

	diff --git a/really-big-dir/file-1-1-1-1-1 b/moved-big-dir/file-1-1-1-1-1
	similarity index 100%
	rename from really-big-dir/file-1-1-1-1-1
	rename to moved-big-dir/file-1-1-1-1-1

and the extra six lines is from a one-liner commit message and all the 
commit information and spacing).

So two seconds to do that exact rename detection.

Now, I can't really compare it to the "before" stage, because that is just 
so horrible. The rename detection limit triggers, so you never even get 
any renames, but if I were to move the limit check later (like I do in 
this patch) without my other fixes, it would take hours. Trying to do ten 
billion (100k x 100k) SHA1 and pathname compares simply isn't going to 
work.

But I *can* compare it to the old code *with* the rename limiting, which 
still wastes all the time on the whole "copy usage" crap. So here are the 
numbers on that repo without the patch:

	[torvalds@woody big-rename]$ time git show -C | wc -l
	1400006
	
	real    0m12.383s
	user    0m12.365s
	sys     0m0.160s

That's right: we used to take 12 seconds and not even do renames (now you 
see fourteen lines per file moved: seven lines each for the delete and the 
create of a one-liner file, and the same extra six lines of commit 
information).

So it not only makes the rename detection possible in the first place, it 
removes some stupid code that made it take a long time even when it 
failed!

Now, I'd still be careful with this patch, and I'd really like people to 
double-check all my logic, but I think it's worthy of some 'pu' love.

		Linus
---

 Makefile          |    4 +-
 diff.c            |   24 +----
 diffcore-rename.c |  275 ++++++++++++++++++++++++++++++----------------------
 diffcore.h        |    2 +-
 hash.c            |  110 +++++++++++++++++++++
 hash.h            |   43 ++++++++
 6 files changed, 321 insertions(+), 137 deletions(-)

diff --git a/Makefile b/Makefile
index 8db4dbe..b1ca186 100644
--- a/Makefile
+++ b/Makefile
@@ -291,7 +291,7 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h
+	mailmap.h remote.h hash.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -301,7 +301,7 @@ DIFF_OBJS = \
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
 	date.o diff-delta.o entry.o exec_cmd.o ident.o \
-	interpolate.o \
+	interpolate.o hash.o \
 	lockfile.o \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
diff --git a/diff.c b/diff.c
index 6648e01..e892030 100644
--- a/diff.c
+++ b/diff.c
@@ -2586,9 +2586,9 @@ void diff_debug_filepair(const struct diff_filepair *p, int i)
 {
 	diff_debug_filespec(p->one, i, "one");
 	diff_debug_filespec(p->two, i, "two");
-	fprintf(stderr, "score %d, status %c stays %d broken %d\n",
+	fprintf(stderr, "score %d, status %c rename_used %d broken %d\n",
 		p->score, p->status ? p->status : '?',
-		p->source_stays, p->broken_pair);
+		p->rename_used, p->broken_pair);
 }
 
 void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
@@ -2606,8 +2606,8 @@ void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
 
 static void diff_resolve_rename_copy(void)
 {
-	int i, j;
-	struct diff_filepair *p, *pp;
+	int i;
+	struct diff_filepair *p;
 	struct diff_queue_struct *q = &diff_queued_diff;
 
 	diff_debug_queue("resolve-rename-copy", q);
@@ -2629,27 +2629,15 @@ static void diff_resolve_rename_copy(void)
 		 * either in-place edit or rename/copy edit.
 		 */
 		else if (DIFF_PAIR_RENAME(p)) {
-			if (p->source_stays) {
-				p->status = DIFF_STATUS_COPIED;
-				continue;
-			}
 			/* See if there is some other filepair that
 			 * copies from the same source as us.  If so
 			 * we are a copy.  Otherwise we are either a
 			 * copy if the path stays, or a rename if it
 			 * does not, but we already handled "stays" case.
 			 */
-			for (j = i + 1; j < q->nr; j++) {
-				pp = q->queue[j];
-				if (strcmp(pp->one->path, p->one->path))
-					continue; /* not us */
-				if (!DIFF_PAIR_RENAME(pp))
-					continue; /* not a rename/copy */
-				/* pp is a rename/copy from the same source */
+			if (--p->one->rename_used > 0)
 				p->status = DIFF_STATUS_COPIED;
-				break;
-			}
-			if (!p->status)
+			else
 				p->status = DIFF_STATUS_RENAMED;
 		}
 		else if (hashcmp(p->one->sha1, p->two->sha1) ||
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2077a9b..cc105db 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,6 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "hash.h"
 
 /* Table of rename/copy destinations */
 
@@ -55,12 +56,10 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 static struct diff_rename_src {
 	struct diff_filespec *one;
 	unsigned short score; /* to remember the break score */
-	unsigned src_path_left : 1;
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
 static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
-						   int src_path_left,
 						   unsigned short score)
 {
 	int first, last;
@@ -92,33 +91,9 @@ static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 			(rename_src_nr - first - 1) * sizeof(*rename_src));
 	rename_src[first].one = one;
 	rename_src[first].score = score;
-	rename_src[first].src_path_left = src_path_left;
 	return &(rename_src[first]);
 }
 
-static int is_exact_match(struct diff_filespec *src,
-			  struct diff_filespec *dst,
-			  int contents_too)
-{
-	if (src->sha1_valid && dst->sha1_valid &&
-	    !hashcmp(src->sha1, dst->sha1))
-		return 1;
-	if (!contents_too)
-		return 0;
-	if (diff_populate_filespec(src, 1) || diff_populate_filespec(dst, 1))
-		return 0;
-	if (src->size != dst->size)
-		return 0;
-	if (src->sha1_valid && dst->sha1_valid)
-	    return !hashcmp(src->sha1, dst->sha1);
-	if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0))
-		return 0;
-	if (src->size == dst->size &&
-	    !memcmp(src->data, dst->data, src->size))
-		return 1;
-	return 0;
-}
-
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 {
 	int src_len = strlen(src->path), dst_len = strlen(dst->path);
@@ -216,6 +191,7 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		die("internal error: dst already matched.");
 
 	src = rename_src[src_index].one;
+	src->rename_used++;
 	one = alloc_filespec(src->path);
 	fill_filespec(one, src->sha1, src->mode);
 
@@ -229,7 +205,6 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 		dp->score = rename_src[src_index].score;
 	else
 		dp->score = score;
-	dp->source_stays = rename_src[src_index].src_path_left;
 	rename_dst[dst_index].pair = dp;
 }
 
@@ -247,19 +222,127 @@ static int score_compare(const void *a_, const void *b_)
 	return b->score - a->score;
 }
 
-static int compute_stays(struct diff_queue_struct *q,
-			 struct diff_filespec *one)
+struct file_similarity {
+	int src_dst, index;
+	struct diff_filespec *filespec;
+	struct file_similarity *next;
+};
+
+static int find_identical_files(struct file_similarity *src,
+				struct file_similarity *dst)
 {
-	int i;
-	for (i = 0; i < q->nr; i++) {
-		struct diff_filepair *p = q->queue[i];
-		if (strcmp(one->path, p->two->path))
-			continue;
-		if (DIFF_PAIR_RENAME(p)) {
-			return 0; /* something else is renamed into this */
+	int renames = 0;
+	do {
+		struct diff_filespec *one = src->filespec;
+		struct file_similarity *p, *best;
+		int i = 100;
+
+		best = NULL;
+		for (p = dst; p; p = p->next) {
+			struct diff_filespec *two = p->filespec;
+
+			/* Already picked as a destination? */
+			if (!p->src_dst)
+				continue;
+			/* False hash collission? */
+			if (hashcmp(one->sha1, two->sha1))
+				continue;
+			best = p;
+			if (basename_same(one, two))
+				break;
+
+			/* Too many identical alternatives? Pick one */
+			if (!--i)
+				break;
 		}
+		if (best) {
+			best->src_dst = 0;
+			record_rename_pair(best->index, src->index, MAX_SCORE);
+			renames++;
+		}
+	} while ((src = src->next) != NULL);
+	return renames;
+}
+
+/*
+ * Note: the rest of the rename logic depends on this
+ * phase also populating all the filespecs for any
+ * entry that isn't matched up with an exact rename.
+ */
+static void free_similarity_list(struct file_similarity *p)
+{
+	while (p) {
+		struct file_similarity *entry = p;
+		p = p->next;
+
+		/* Stupid special case, see note above! */
+		diff_populate_filespec(entry->filespec, 0);
+		free(entry);
+	}
+}
+
+static int find_same_files(void *ptr)
+{
+	int ret;
+	struct file_similarity *p = ptr;
+	struct file_similarity *src = NULL, *dst = NULL;
+
+	/* Split the hash list up into sources and destinations */
+	do {
+		struct file_similarity *entry = p;
+		p = p->next;
+		if (entry->src_dst < 0) {
+			entry->next = src;
+			src = entry;
+		} else {
+			entry->next = dst;
+			dst = entry;
+		}
+	} while (p);
+
+	/*
+	 * If we have both sources *and* destinations, see if
+	 * we can match them up
+	 */
+	ret = (src && dst) ? find_identical_files(src, dst) : 0;
+
+	/* Free the hashes and return the number of renames found */
+	free_similarity_list(src);
+	free_similarity_list(dst);
+	return ret;
+}
+
+static unsigned int hash_filespec(struct diff_filespec *filespec)
+{
+	unsigned int hash;
+	if (!filespec->sha1_valid) {
+		if (diff_populate_filespec(filespec, 0))
+			return 0;
+		hash_sha1_file(filespec->data, filespec->size, "blob", filespec->sha1);
+	}
+	memcpy(&hash, filespec->sha1, sizeof(hash));
+	return hash;
+}
+
+static void insert_file_table(struct hash_table *table, int src_dst, int index, struct diff_filespec *filespec)
+{
+	void **pos;
+	unsigned int hash;
+	struct file_similarity *entry = xmalloc(sizeof(*entry));
+
+	entry->src_dst = src_dst;
+	entry->index = index;
+	entry->filespec = filespec;
+	entry->next = NULL;
+
+	hash = hash_filespec(filespec);
+	pos = insert_hash(hash, entry, table);
+
+	/* We already had an entry there? */
+	if (pos) {
+		entry->next = *pos;
+		*pos = entry;
 	}
-	return 1;
 }
 
 /*
@@ -268,50 +351,26 @@ static int compute_stays(struct diff_queue_struct *q,
  * The first round matches up the up-to-date entries,
  * and then during the second round we try to match
  * cache-dirty entries as well.
- *
- * Note: the rest of the rename logic depends on this
- * phase also populating all the filespecs for any
- * entry that isn't matched up with an exact rename,
- * see "is_exact_match()".
  */
 static int find_exact_renames(void)
 {
-	int rename_count = 0;
-	int contents_too;
-
-	for (contents_too = 0; contents_too < 2; contents_too++) {
-		int i;
-
-		for (i = 0; i < rename_dst_nr; i++) {
-			struct diff_filespec *two = rename_dst[i].two;
-			int j;
-
-			if (rename_dst[i].pair)
-				continue; /* dealt with an earlier round */
-			for (j = 0; j < rename_src_nr; j++) {
-				int k;
-				struct diff_filespec *one = rename_src[j].one;
-				if (!is_exact_match(one, two, contents_too))
-					continue;
-
-				/* see if there is a basename match, too */
-				for (k = j; k < rename_src_nr; k++) {
-					one = rename_src[k].one;
-					if (basename_same(one, two) &&
-						is_exact_match(one, two,
-							contents_too)) {
-						j = k;
-						break;
-					}
-				}
-
-				record_rename_pair(i, j, (int)MAX_SCORE);
-				rename_count++;
-				break; /* we are done with this entry */
-			}
-		}
-	}
-	return rename_count;
+	int i;
+	struct hash_table file_table;
+
+	init_hash(&file_table);
+	for (i = 0; i < rename_src_nr; i++)
+		insert_file_table(&file_table, -1, i, rename_src[i].one);
+
+	for (i = 0; i < rename_dst_nr; i++)
+		insert_file_table(&file_table, 1, i, rename_dst[i].two);
+
+	/* Find the renames */
+	i = for_each_hash(&file_table, find_same_files);
+
+	/* .. and free the hash data structure */
+	free_hash(&file_table);
+
+	return i;
 }
 
 void diffcore_rename(struct diff_options *options)
@@ -340,20 +399,36 @@ void diffcore_rename(struct diff_options *options)
 				locate_rename_dst(p->two, 1);
 		}
 		else if (!DIFF_FILE_VALID(p->two)) {
-			/* If the source is a broken "delete", and
+			/*
+			 * If the source is a broken "delete", and
 			 * they did not really want to get broken,
 			 * that means the source actually stays.
+			 * So we increment the "rename_used" score
+			 * by one, to indicate ourselves as a user
 			 */
-			int stays = (p->broken_pair && !p->score);
-			register_rename_src(p->one, stays, p->score);
+			if (p->broken_pair && !p->score)
+				p->one->rename_used++;
+			register_rename_src(p->one, p->score);
+		}
+		else if (detect_rename == DIFF_DETECT_COPY) {
+			/*
+			 * Increment the "rename_used" score by
+			 * one, to indicate ourselves as a user.
+			 */
+			p->one->rename_used++;
+			register_rename_src(p->one, p->score);
 		}
-		else if (detect_rename == DIFF_DETECT_COPY)
-			register_rename_src(p->one, 1, p->score);
 	}
 	if (rename_dst_nr == 0 || rename_src_nr == 0)
 		goto cleanup; /* nothing to do */
 
 	/*
+	 * We really want to cull the candidates list early
+	 * with cheap tests in order to avoid doing deltas.
+	 */
+	rename_count = find_exact_renames();
+
+	/*
 	 * This basically does a test for the rename matrix not
 	 * growing larger than a "rename_limit" square matrix, ie:
 	 *
@@ -369,12 +444,6 @@ void diffcore_rename(struct diff_options *options)
 	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
 		goto cleanup;
 
-	/*
-	 * We really want to cull the candidates list early
-	 * with cheap tests in order to avoid doing deltas.
-	 */
-	rename_count = find_exact_renames();
-
 	/* Have we run out the created file pool?  If so we can avoid
 	 * doing the delta matrix altogether.
 	 */
@@ -474,16 +543,7 @@ void diffcore_rename(struct diff_options *options)
 					pair_to_free = p;
 			}
 			else {
-				for (j = 0; j < rename_dst_nr; j++) {
-					if (!rename_dst[j].pair)
-						continue;
-					if (strcmp(rename_dst[j].pair->
-						   one->path,
-						   p->one->path))
-						continue;
-					break;
-				}
-				if (j < rename_dst_nr)
+				if (p->one->rename_used)
 					/* this path remains */
 					pair_to_free = p;
 			}
@@ -509,23 +569,6 @@ void diffcore_rename(struct diff_options *options)
 	*q = outq;
 	diff_debug_queue("done collapsing", q);
 
-	/* We need to see which rename source really stays here;
-	 * earlier we only checked if the path is left in the result,
-	 * but even if a path remains in the result, if that is coming
-	 * from copying something else on top of it, then the original
-	 * source is lost and does not stay.
-	 */
-	for (i = 0; i < q->nr; i++) {
-		struct diff_filepair *p = q->queue[i];
-		if (DIFF_PAIR_RENAME(p) && p->source_stays) {
-			/* If one appears as the target of a rename-copy,
-			 * then mark p->source_stays = 0; otherwise
-			 * leave it as is.
-			 */
-			p->source_stays = compute_stays(q, p->one);
-		}
-	}
-
 	for (i = 0; i < rename_dst_nr; i++) {
 		diff_free_filespec_data(rename_dst[i].two);
 		free(rename_dst[i].two);
diff --git a/diffcore.h b/diffcore.h
index eb618b1..ceda932 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -30,6 +30,7 @@ struct diff_filespec {
 	const char *funcname_pattern_ident;
 	unsigned long size;
 	int xfrm_flags;		 /* for use by the xfrm */
+	int rename_used;         /* Count of rename users */
 	unsigned short mode;	 /* file mode */
 	unsigned sha1_valid : 1; /* if true, use sha1 and trust mode;
 				  * if false, use the name and read from
@@ -56,7 +57,6 @@ struct diff_filepair {
 	struct diff_filespec *two;
 	unsigned short int score;
 	char status; /* M C R N D U (see Documentation/diff-format.txt) */
-	unsigned source_stays : 1; /* all of R/C are copies */
 	unsigned broken_pair : 1;
 	unsigned renamed_pair : 1;
 	unsigned is_unmerged : 1;
diff --git a/hash.c b/hash.c
new file mode 100644
index 0000000..7b492d4
--- /dev/null
+++ b/hash.c
@@ -0,0 +1,110 @@
+/*
+ * Some generic hashing helpers.
+ */
+#include "cache.h"
+#include "hash.h"
+
+/*
+ * Look up a hash entry in the hash table. Return the pointer to
+ * the existing entry, or the empty slot if none existed. The caller
+ * can then look at the (*ptr) to see whether it existed or not.
+ */
+static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash_table *table)
+{
+	unsigned int size = table->size, nr = hash % size;
+	struct hash_table_entry *array = table->array;
+
+	while (array[nr].ptr) {
+		if (array[nr].hash == hash)
+			break;
+		nr++;
+		if (nr >= size)
+			nr = 0;
+	}
+	return array + nr;
+}
+
+
+/*
+ * Insert a new hash entry pointer into the table.
+ *
+ * If that hash entry already existed, return the pointer to
+ * the existing entry (and the caller can create a list of the
+ * pointers or do anything else). If it didn't exist, return
+ * NULL (and the caller knows the pointer has been inserted).
+ */
+static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
+
+	if (!entry->ptr) {
+		entry->ptr = ptr;
+		entry->hash = hash;
+		table->nr++;
+		return NULL;
+	}
+	return &entry->ptr;
+}
+
+static void grow_hash_table(struct hash_table *table)
+{
+	unsigned int i;
+	unsigned int old_size = table->size, new_size;
+	struct hash_table_entry *old_array = table->array, *new_array;
+
+	new_size = alloc_nr(old_size);
+	new_array = xcalloc(sizeof(struct hash_table_entry), new_size);
+	table->size = new_size;
+	table->array = new_array;
+	table->nr = 0;
+	for (i = 0; i < old_size; i++) {
+		unsigned int hash = old_array[i].hash;
+		void *ptr = old_array[i].ptr;
+		if (ptr)
+			insert_hash_entry(hash, ptr, table);
+	}
+	free(old_array);
+}
+
+void *lookup_hash(unsigned int hash, struct hash_table *table)
+{
+	if (!table->array)
+		return NULL;
+	return &lookup_hash_entry(hash, table)->ptr;
+}
+
+void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
+{
+	unsigned int nr = table->nr;
+	if (nr >= table->size/2)
+		grow_hash_table(table);
+	return insert_hash_entry(hash, ptr, table);
+}
+
+int for_each_hash(struct hash_table *table, int (*fn)(void *))
+{
+	int sum = 0;
+	unsigned int i;
+	unsigned int size = table->size;
+	struct hash_table_entry *array = table->array;
+
+	for (i = 0; i < size; i++) {
+		void *ptr = array->ptr;
+		array++;
+		if (ptr) {
+			int val = fn(ptr);
+			if (val < 0)
+				return val;
+			sum += val;
+		}
+	}
+	return sum;
+}
+
+void free_hash(struct hash_table *table)
+{
+	free(table->array);
+	table->array = NULL;
+	table->size = 0;
+	table->nr = 0;
+}
diff --git a/hash.h b/hash.h
new file mode 100644
index 0000000..5056c9a
--- /dev/null
+++ b/hash.h
@@ -0,0 +1,43 @@
+#ifndef HASH_H
+#define HASH_H
+
+/*
+ * These are some simple generic hash table helper functions.
+ * Not necessarily suitable for all users, but good for things
+ * where you want to just keep track of a list of things, and
+ * have a good hash to use on them.
+ *
+ * It keeps the hash table at roughly 50-75% free, so the memory
+ * cost of the hash table itself is roughly
+ *
+ *	3 * 2*sizeof(void *) * nr_of_objects
+ *
+ * bytes. 
+ *
+ * FIXME: on 64-bit architectures, we waste memory. It would be
+ * good to have just 32-bit pointers, requiring a special allocator
+ * for hashed entries or something.
+ */
+struct hash_table_entry {
+	unsigned int hash;
+	void *ptr;
+};
+
+struct hash_table {
+	unsigned int size, nr;
+	struct hash_table_entry *array;
+};
+
+extern void *lookup_hash(unsigned int hash, struct hash_table *table);
+extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
+extern int for_each_hash(struct hash_table *table, int (*fn)(void *));
+extern void free_hash(struct hash_table *table);
+
+static inline void init_hash(struct hash_table *table)
+{
+	table->size = 0;
+	table->nr = 0;
+	table->array = NULL;
+}
+
+#endif

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 2] Linear-time/space rename logic (exact renames only)
  2007-10-22 17:29     ` [PATCH, take 2] " Linus Torvalds
@ 2007-10-22 19:31       ` Linus Torvalds
  2007-10-22 19:44         ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2007-10-22 19:31 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano, Shawn O. Pearce
  Cc: David Kastrup, Jeff King, Sven Verdoolaege



On Mon, 22 Oct 2007, Linus Torvalds wrote:
> 
> Ok, as some people notices, there were a few bugs in the previous patch. I 
> didn't free the hashes correctly (stupid) and the Makefile had "hash.o" 
> instead of "hash.h".

Ok, there were still more bugs, and before you get too involved with this 
last patch (not that I've seen any comments yet), apply this appended 
patch to actually fix things a bit more first!

Yes, I'm a moron. I hadn't even bothered to run the test-suite on it, and 
that showed several silly problems.

One of the problems was that since the rename detection copied the 
diffspecs around, the "rename_used" count couldn't work right, because 
things got copied around and the count stayed with one diffspec, but not 
the other..

In the kernel, we have a rule that says that any data structure that isn't 
ref-counted is basically a bug, and that was true here too. Instead of 
copying and splitting the diffspecs, just refcount them and keep track of 
how many users there are.

While the above bug was a somewhat subtle issue from me trying to be 
clever in avoiding the O(n*m) file copy/rename reuse issue, there were a 
few issues that were me just being totally braindead: the exact rename 
detection had lost the code that took file modes into account, so it would 
generate "renames" from regular files to symlinks, that the generic diff 
core layer would just split up again.

And even more stupidly, I had matched up the src/dst things when finding 
the rename, which just complicated things (added a totally unnecessary 
need to keep track of a destination being used more than once) and also 
broke the basename matching comparison. Duh.

So here's an incremental patch on top of the previous failed try. And if 
somebody is confused (and that might be me) and cannot get things to 
apply, just holler and I'll send the whole thing again. I might even try 
to clean up the series a bit and do it in stages.

This patch shouldn't change any performance behaviour (well, it might 
speed things up a bit to not allocate those diffspec structures, but it's 
unlikely that is even measurable). It just fixes stuff.

I'm sure there's more to come..

		Linus

---
 diff.c            |   17 ++++++++++++-----
 diffcore-rename.c |   40 ++++++++++++++++++++++------------------
 diffcore.h        |    2 ++
 3 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/diff.c b/diff.c
index e892030..2e74cb3 100644
--- a/diff.c
+++ b/diff.c
@@ -1440,9 +1440,18 @@ struct diff_filespec *alloc_filespec(const char *path)
 	memset(spec, 0, sizeof(*spec));
 	spec->path = (char *)(spec + 1);
 	memcpy(spec->path, path, namelen+1);
+	spec->count = 1;
 	return spec;
 }
 
+void free_filespec(struct diff_filespec *spec)
+{
+	if (!--spec->count) {
+		diff_free_filespec_data(spec);
+		free(spec);
+	}
+}
+
 void fill_filespec(struct diff_filespec *spec, const unsigned char *sha1,
 		   unsigned short mode)
 {
@@ -2431,10 +2440,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *queue,
 
 void diff_free_filepair(struct diff_filepair *p)
 {
-	diff_free_filespec_data(p->one);
-	diff_free_filespec_data(p->two);
-	free(p->one);
-	free(p->two);
+	free_filespec(p->one);
+	free_filespec(p->two);
 	free(p);
 }
 
@@ -2588,7 +2595,7 @@ void diff_debug_filepair(const struct diff_filepair *p, int i)
 	diff_debug_filespec(p->two, i, "two");
 	fprintf(stderr, "score %d, status %c rename_used %d broken %d\n",
 		p->score, p->status ? p->status : '?',
-		p->rename_used, p->broken_pair);
+		p->one->rename_used, p->broken_pair);
 }
 
 void diff_debug_queue(const char *msg, struct diff_queue_struct *q)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index cc105db..3946932 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -184,7 +184,7 @@ static int estimate_similarity(struct diff_filespec *src,
 
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
-	struct diff_filespec *one, *two, *src, *dst;
+	struct diff_filespec *src, *dst;
 	struct diff_filepair *dp;
 
 	if (rename_dst[dst_index].pair)
@@ -192,14 +192,12 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 
 	src = rename_src[src_index].one;
 	src->rename_used++;
-	one = alloc_filespec(src->path);
-	fill_filespec(one, src->sha1, src->mode);
+	src->count++;
 
 	dst = rename_dst[dst_index].two;
-	two = alloc_filespec(dst->path);
-	fill_filespec(two, dst->sha1, dst->mode);
+	dst->count++;
 
-	dp = diff_queue(NULL, one, two);
+	dp = diff_queue(NULL, src, dst);
 	dp->renamed_pair = 1;
 	if (!strcmp(src->path, dst->path))
 		dp->score = rename_src[src_index].score;
@@ -232,21 +230,30 @@ static int find_identical_files(struct file_similarity *src,
 				struct file_similarity *dst)
 {
 	int renames = 0;
+
+	/*
+	 * Walk over all the destinations ...
+	 */
 	do {
-		struct diff_filespec *one = src->filespec;
+		struct diff_filespec *one = dst->filespec;
 		struct file_similarity *p, *best;
 		int i = 100;
 
+		/*
+		 * .. to find the best source match
+		 */
 		best = NULL;
-		for (p = dst; p; p = p->next) {
+		for (p = src; p; p = p->next) {
 			struct diff_filespec *two = p->filespec;
 
-			/* Already picked as a destination? */
-			if (!p->src_dst)
-				continue;
 			/* False hash collission? */
 			if (hashcmp(one->sha1, two->sha1))
 				continue;
+			/* Non-regular files? If so, the modes must match! */
+			if (!S_ISREG(one->mode) || !S_ISREG(two->mode)) {
+				if (one->mode != two->mode)
+					continue;
+			}
 			best = p;
 			if (basename_same(one, two))
 				break;
@@ -256,11 +263,10 @@ static int find_identical_files(struct file_similarity *src,
 				break;
 		}
 		if (best) {
-			best->src_dst = 0;
-			record_rename_pair(best->index, src->index, MAX_SCORE);
+			record_rename_pair(dst->index, best->index, MAX_SCORE);
 			renames++;
 		}
-	} while ((src = src->next) != NULL);
+	} while ((dst = dst->next) != NULL);
 	return renames;
 }
 
@@ -569,10 +575,8 @@ void diffcore_rename(struct diff_options *options)
 	*q = outq;
 	diff_debug_queue("done collapsing", q);
 
-	for (i = 0; i < rename_dst_nr; i++) {
-		diff_free_filespec_data(rename_dst[i].two);
-		free(rename_dst[i].two);
-	}
+	for (i = 0; i < rename_dst_nr; i++)
+		free_filespec(rename_dst[i].two);
 
 	free(rename_dst);
 	rename_dst = NULL;
diff --git a/diffcore.h b/diffcore.h
index ceda932..cc96c20 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -29,6 +29,7 @@ struct diff_filespec {
 	void *cnt_data;
 	const char *funcname_pattern_ident;
 	unsigned long size;
+	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
 	int rename_used;         /* Count of rename users */
 	unsigned short mode;	 /* file mode */
@@ -44,6 +45,7 @@ struct diff_filespec {
 };
 
 extern struct diff_filespec *alloc_filespec(const char *);
+extern void free_filespec(struct diff_filespec *);
 extern void fill_filespec(struct diff_filespec *, const unsigned char *,
 			  unsigned short);
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 2] Linear-time/space rename logic (exact renames only)
  2007-10-22 19:31       ` Linus Torvalds
@ 2007-10-22 19:44         ` Linus Torvalds
  2007-10-22 21:17           ` Alex Riesen
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2007-10-22 19:44 UTC (permalink / raw)
  To: Git Mailing List, Junio C Hamano, Shawn O. Pearce
  Cc: David Kastrup, Jeff King, Sven Verdoolaege



On Mon, 22 Oct 2007, Linus Torvalds wrote:
> 
> I'm sure there's more to come..

One more detail.. The updated comment explains the issue: if we broke a 
file apart, and rename detection joined it back together, the result is 
neither a rename nor a copy, it's a regular modification (and all 
remaining renames will be copies of the original, so don't bother 
decrementing the "rename_used" count).

		Linus

---
 diff.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/diff.c b/diff.c
index 2e74cb3..e839f59 100644
--- a/diff.c
+++ b/diff.c
@@ -2636,13 +2636,19 @@ static void diff_resolve_rename_copy(void)
 		 * either in-place edit or rename/copy edit.
 		 */
 		else if (DIFF_PAIR_RENAME(p)) {
-			/* See if there is some other filepair that
-			 * copies from the same source as us.  If so
-			 * we are a copy.  Otherwise we are either a
-			 * copy if the path stays, or a rename if it
-			 * does not, but we already handled "stays" case.
+			/*
+			 * A rename might have re-connected a broken
+			 * pair up, causing the pathnames to be the
+			 * same again. If so, that's not a rename at
+			 * all, just a modification..
+			 *
+			 * Otherwise, see if this source was used for
+			 * multiple renames, in which case we decrement
+			 * the count, and call it a copy.
 			 */
-			if (--p->one->rename_used > 0)
+			if (!strcmp(p->one->path, p->two->path))
+				p->status = DIFF_STATUS_MODIFIED;
+			else if (--p->one->rename_used > 0)
 				p->status = DIFF_STATUS_COPIED;
 			else
 				p->status = DIFF_STATUS_RENAMED;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 2] Linear-time/space rename logic (exact renames only)
  2007-10-22 19:44         ` Linus Torvalds
@ 2007-10-22 21:17           ` Alex Riesen
  2007-10-22 21:37             ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Riesen @ 2007-10-22 21:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King, Sven Verdoolaege

Linus Torvalds, Mon, Oct 22, 2007 21:44:48 +0200:
> 
> 
> On Mon, 22 Oct 2007, Linus Torvalds wrote:
> > 
> > I'm sure there's more to come..
> 
> One more detail.. The updated comment explains the issue: if we broke a 
> file apart, and rename detection joined it back together, the result is 
> neither a rename nor a copy, it's a regular modification (and all 
> remaining renames will be copies of the original, so don't bother 
> decrementing the "rename_used" count).

It breaks t3402-rebase-merge.sh

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 2] Linear-time/space rename logic (exact renames only)
  2007-10-22 21:17           ` Alex Riesen
@ 2007-10-22 21:37             ` Linus Torvalds
  2007-10-22 22:54               ` Alex Riesen
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2007-10-22 21:37 UTC (permalink / raw)
  To: Alex Riesen
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King, Sven Verdoolaege



On Mon, 22 Oct 2007, Alex Riesen wrote:
> 
> It breaks t3402-rebase-merge.sh

Hmm. Works for me here. But I will check if there is some incomplete 
dependency in the makefile or something...

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH, take 2] Linear-time/space rename logic (exact renames only)
  2007-10-22 21:37             ` Linus Torvalds
@ 2007-10-22 22:54               ` Alex Riesen
  0 siblings, 0 replies; 13+ messages in thread
From: Alex Riesen @ 2007-10-22 22:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Git Mailing List, Junio C Hamano, Shawn O. Pearce, David Kastrup,
	Jeff King, Sven Verdoolaege

Linus Torvalds, Mon, Oct 22, 2007 23:37:55 +0200:
> On Mon, 22 Oct 2007, Alex Riesen wrote:
> > 
> > It breaks t3402-rebase-merge.sh
> 
> Hmm. Works for me here. But I will check if there is some incomplete 
> dependency in the makefile or something...
> 

Ok, resolved. Turns out I completely missed the last two patches

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-10-22 22:55 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-21 23:59 [PATCH, take 1] Linear-time/space rename logic (exact renames only) Linus Torvalds
2007-10-22  0:31 ` David Symonds
2007-10-22  5:41   ` Linus Torvalds
2007-10-22  6:47 ` Jeff King
2007-10-22  7:07 ` Sven Verdoolaege
2007-10-22  7:21   ` Jeff King
2007-10-22 16:33   ` Linus Torvalds
2007-10-22 17:29     ` [PATCH, take 2] " Linus Torvalds
2007-10-22 19:31       ` Linus Torvalds
2007-10-22 19:44         ` Linus Torvalds
2007-10-22 21:17           ` Alex Riesen
2007-10-22 21:37             ` Linus Torvalds
2007-10-22 22:54               ` Alex Riesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).