[PATCH/RFC 0/3] faster inexact rename handling

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC 0/3] faster inexact rename handling
@ 2007-10-30  4:21 Jeff King
  2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30  4:21 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano

This is my first stab at faster rename handling based on Andy's code.
The patches are on top of next (to get Linus' recent work on exact
renames). Most of the interesting stuff is in 2/3.

  1/3: extension of hash interface
  2/3: similarity detection code
  3/3: integrate similarity detection into diffcore-rename

The implementation is pretty basic, so I think there is room for
code optimization (50% of the time is spent in hash lookups, so we might
be able to micro-optimize that) as well as algorithmic improvements (like the
sampling Andy mentioned).

With these patches, I can get my monster binary diff down from about 2
minutes to 17 seconds. And comparing all of linux-2.4 to all of
linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

There are a few downsides:
  - the current implementation tends to give lower similarity values
    compared to the old code (see discussion in 2/3), but this should be
    tweakable
  - on large datasets, it's more memory hungry than the old code because
    the hash grows very large. This can be helped by bumping up the
    binary chunk size (actually, the 17 seconds quoted above is using
    256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
    more like 24 seconds) as well as sampling.
  - no improvement on smaller datasets. Running "git-whatchanged -M
    --raw -l0" on the linux-2.6 repo takes about the same time with the
    old and new code (presumably the algorithmic savings of the new code
    are lost in a higher constant factor, so when n is small, it is a
    wash).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 1/3] change hash table calling conventions
  2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
@ 2007-10-30  4:23 ` Jeff King
  2007-10-30  4:24 ` [PATCH/RFC 2/3] introduce generic similarity library Jeff King
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30  4:23 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano

The recent hash table code has two limitations in its
calling conventions that are addressed here:

  1. Insertion either inserts a value, returning NULL, or
     returns a pointer to the previously inserted value.
     This is fine if you are making a linked list of the
     colliding values, but is awkward if your goal is to:
       a. modify the value if it already exists
       b. otherwise, allocate and insert the value
     With the old convention, you must either allocate the
     structure (and throw it away in case a), or perform two
     lookups (one to see if the entry exists, then another
     to perform the insertion).

     Instead, insertion no longer inserts any value; it
     simply returns a pointer to where you _can_ insert a
     value (which will be non-NULL if a value already
     existed).

  2. for_each_hash now allows a void 'data' pointer to be
     passed to the callback function along with each hash
     entry.

Signed-off-by: Jeff King <peff@peff.net>
---

The insertion feels kind of hack-ish. Suggestions are welcome.

 diffcore-rename.c |   14 +++++---------
 hash.c            |   18 +++++++++---------
 hash.h            |    5 +++--
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f9ebea5..ba038af 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -288,7 +288,7 @@ static void free_similarity_list(struct file_similarity *p)
 	}
 }
 
-static int find_same_files(void *ptr)
+static int find_same_files(void *ptr, void *data)
 {
 	int ret;
 	struct file_similarity *p = ptr;
@@ -343,13 +343,9 @@ static void insert_file_table(struct hash_table *table, int src_dst, int index,
 	entry->next = NULL;
 
 	hash = hash_filespec(filespec);
-	pos = insert_hash(hash, entry, table);
-
-	/* We already had an entry there? */
-	if (pos) {
-		entry->next = *pos;
-		*pos = entry;
-	}
+	pos = insert_hash(hash, table);
+	entry->next = *pos;
+	*pos = entry;
 }
 
 /*
@@ -372,7 +368,7 @@ static int find_exact_renames(void)
 		insert_file_table(&file_table, 1, i, rename_dst[i].two);
 
 	/* Find the renames */
-	i = for_each_hash(&file_table, find_same_files);
+	i = for_each_hash(&file_table, find_same_files, NULL);
 
 	/* .. and free the hash data structure */
 	free_hash(&file_table);
diff --git a/hash.c b/hash.c
index 7b492d4..dc5d20e 100644
--- a/hash.c
+++ b/hash.c
@@ -33,15 +33,13 @@ static struct hash_table_entry *lookup_hash_entry(unsigned int hash, struct hash
  * pointers or do anything else). If it didn't exist, return
  * NULL (and the caller knows the pointer has been inserted).
  */
-static void **insert_hash_entry(unsigned int hash, void *ptr, struct hash_table *table)
+static void **insert_hash_entry(unsigned int hash, struct hash_table *table)
 {
 	struct hash_table_entry *entry = lookup_hash_entry(hash, table);
 
 	if (!entry->ptr) {
-		entry->ptr = ptr;
 		entry->hash = hash;
 		table->nr++;
-		return NULL;
 	}
 	return &entry->ptr;
 }
@@ -60,8 +58,10 @@ static void grow_hash_table(struct hash_table *table)
 	for (i = 0; i < old_size; i++) {
 		unsigned int hash = old_array[i].hash;
 		void *ptr = old_array[i].ptr;
-		if (ptr)
-			insert_hash_entry(hash, ptr, table);
+		if (ptr) {
+			void **pos = insert_hash_entry(hash, table);
+			*pos = ptr;
+		}
 	}
 	free(old_array);
 }
@@ -73,15 +73,15 @@ void *lookup_hash(unsigned int hash, struct hash_table *table)
 	return &lookup_hash_entry(hash, table)->ptr;
 }
 
-void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table)
+void **insert_hash(unsigned int hash, struct hash_table *table)
 {
 	unsigned int nr = table->nr;
 	if (nr >= table->size/2)
 		grow_hash_table(table);
-	return insert_hash_entry(hash, ptr, table);
+	return insert_hash_entry(hash, table);
 }
 
-int for_each_hash(struct hash_table *table, int (*fn)(void *))
+int for_each_hash(struct hash_table *table, int (*fn)(void *, void *), void *d)
 {
 	int sum = 0;
 	unsigned int i;
@@ -92,7 +92,7 @@ int for_each_hash(struct hash_table *table, int (*fn)(void *))
 		void *ptr = array->ptr;
 		array++;
 		if (ptr) {
-			int val = fn(ptr);
+			int val = fn(ptr, d);
 			if (val < 0)
 				return val;
 			sum += val;
diff --git a/hash.h b/hash.h
index a8b0fbb..f9a50da 100644
--- a/hash.h
+++ b/hash.h
@@ -29,8 +29,9 @@ struct hash_table {
 };
 
 extern void *lookup_hash(unsigned int hash, struct hash_table *table);
-extern void **insert_hash(unsigned int hash, void *ptr, struct hash_table *table);
-extern int for_each_hash(struct hash_table *table, int (*fn)(void *));
+extern void **insert_hash(unsigned int hash, struct hash_table *table);
+extern int for_each_hash(struct hash_table *table, int (*fn)(void *, void*),
+		void *data);
 extern void free_hash(struct hash_table *table);
 
 static inline void init_hash(struct hash_table *table)
-- 
1.5.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH/RFC 2/3] introduce generic similarity library
  2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
  2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
@ 2007-10-30  4:24 ` Jeff King
  2007-10-30  4:24 ` [PATCH/RFC 3/3] handle renames using similarity engine Jeff King
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30  4:24 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano

This library attempts to find similarities among items
efficiently. It treats items as opaque pointers; callers are
responsible for providing an item along with its contents.

The algorithm is roughly:

  1. for each item, create a set of fingerprints for each
     chunk of the item (where each chunk is either delimited
     by a newline or is 64 characters, whichever is smaller
     -- this is the same fingerprint code from
     diffcore-delta.c). A hash stores a mapping of
     fingerprints to items, with each fingerprint having at
     most one 'source' item and one 'dest' item.

  2. for each fingerprint with a source and dest item,
     find the entry with key (source, dest) in a hash table
     and increment its value by the value of the fingerprint

  3. for each (source, dest) pair that had non-zero
     similarity, report the pair to the caller

The program test-similarity is a simple demonstration of the
code. It takes two list of files on stdin, with each file
separated by newlines and the two lists separated by a blank
line. It prints the similarity score of each non-zero pair
on stdout.

There are a few "interesting" design decisions, which should
probably be tweaked:

  - we store only one source and dest for each fingerprint
    item. We need to bound this list so that we don't get
    O(n^2) behavior for common fingerprints. This means that
    some files won't get "credit" for common fingerprints,
    which is probably OK, since those fingerprints are
    probably uninteresting. We could bound at some small
    number greater than one; one was chosen for speed and
    simplicity of implementation. We also don't store any
    overflow bit, so commonly used fingerprints will get
    assigned to whatever file is found with them last.

  - the similarity engine hands back absolute,
    non-normalized scores. That means that bigger items will
    have bigger scores, and the caller is responsible for
    normalizing. This also means that the engine cannot
    adjust normalization to account for chunks which are
    thrown out by the bounding mentioned above (i.e., if a
    file is 80 bytes, 64 of which are ignored as "common",
    then it can at most have 20% similarity (16/80) with
    another file. This means our normalized rename values
    (i.e., percentage similarity) will be lower than other
    algorithms.

Signed-off-by: Jeff King <peff@peff.net>
---
 .gitignore        |    1 +
 Makefile          |    4 +-
 similarity.c      |  152 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 similarity.h      |   24 ++++++++
 test-similarity.c |   54 +++++++++++++++++++
 5 files changed, 233 insertions(+), 2 deletions(-)
 create mode 100644 similarity.c
 create mode 100644 similarity.h
 create mode 100644 test-similarity.c

diff --git a/.gitignore b/.gitignore
index 62afef2..8ce6026 100644
--- a/.gitignore
+++ b/.gitignore
@@ -155,6 +155,7 @@ test-dump-cache-tree
 test-genrandom
 test-match-trees
 test-sha1
+test-similarity
 common-cmds.h
 *.tar.gz
 *.dsc
diff --git a/Makefile b/Makefile
index 2e6fd8f..0568d0d 100644
--- a/Makefile
+++ b/Makefile
@@ -300,7 +300,7 @@ DIFF_OBJS = \
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
 	date.o diff-delta.o entry.o exec_cmd.o ident.o \
-	interpolate.o hash.o \
+	interpolate.o hash.o similarity.o \
 	lockfile.o \
 	patch-ids.o \
 	object.o pack-check.o pack-write.o patch-delta.o path.o pkt-line.o \
@@ -975,7 +975,7 @@ endif
 
 ### Testing rules
 
-TEST_PROGRAMS = test-chmtime$X test-genrandom$X test-date$X test-delta$X test-sha1$X test-match-trees$X test-absolute-path$X
+TEST_PROGRAMS = test-chmtime$X test-genrandom$X test-date$X test-delta$X test-sha1$X test-match-trees$X test-absolute-path$X test-similarity$X
 
 all:: $(TEST_PROGRAMS)
 
diff --git a/similarity.c b/similarity.c
new file mode 100644
index 0000000..0bd135c
--- /dev/null
+++ b/similarity.c
@@ -0,0 +1,152 @@
+#include "cache.h"
+#include "similarity.h"
+
+struct fingerprint_entry {
+	void *src;
+	void *dst;
+	unsigned weight;
+};
+
+struct score_entry {
+	void *src;
+	void *dst;
+	unsigned score;
+	struct score_entry *next;
+};
+
+void similarity_init(struct similarity *s)
+{
+	init_hash(&s->fingerprints);
+	init_hash(&s->scores);
+}
+
+static int free_one_fingerprint(void *ve, void *data)
+{
+	struct fingerprint_entry *e = ve;
+	free(e);
+	return 0;
+}
+
+static int free_one_score(void *ve, void *data)
+{
+	struct score_entry *e = ve;
+	while (e) {
+		struct score_entry *next = e->next;
+		free(e);
+		e = next;
+	}
+	return 0;
+}
+
+void similarity_free(struct similarity *s)
+{
+	for_each_hash(&s->fingerprints, free_one_fingerprint, NULL);
+	free_hash(&s->fingerprints);
+	for_each_hash(&s->scores, free_one_score, NULL);
+	free_hash(&s->scores);
+}
+
+static void add_fingerprint(struct hash_table *h, unsigned int fp,
+		int type, void *data, unsigned weight)
+{
+	void **pos;
+	struct fingerprint_entry *e;
+
+	pos = insert_hash(fp, h);
+	if (!*pos) {
+		e = xmalloc(sizeof(*e));
+		e->weight = weight;
+		e->src = e->dst = NULL;
+		*pos = e;
+	}
+	else
+		e = *pos;
+
+	if (type == SIMILARITY_SOURCE)
+		e->src = data;
+	else
+		e->dst = data;
+}
+
+void similarity_add(struct similarity *sim, int type, void *data,
+		const char *buf, unsigned long sz, int is_text)
+{
+	int n;
+	unsigned int accum1, accum2, hashval;
+
+	n = 0;
+	accum1 = accum2 = 0;
+	while (sz) {
+		unsigned int c = *buf++;
+		unsigned int old_1 = accum1;
+		sz--;
+
+		/* Ignore CR in CRLF sequence if text */
+		if (!is_text && c == '\r' && sz && *buf == '\n')
+			continue;
+
+		accum1 = (accum1 << 7) ^ (accum2 >> 25);
+		accum2 = (accum2 << 7) ^ (old_1 >> 25);
+		accum1 += c;
+		if (++n < 64 && c != '\n')
+			continue;
+		hashval = accum1 + accum2 * 0x61;
+		add_fingerprint(&sim->fingerprints, hashval, type, data, n);
+		n = 0;
+		accum1 = accum2 = 0;
+	}
+}
+
+static unsigned hash_void_pair(void *a, void *b)
+{
+	return (unsigned)a + (unsigned)b;
+}
+
+static int score_one_entry(void *vfp, void *vsim)
+{
+	struct fingerprint_entry *fp = vfp;
+	struct similarity *sim = vsim;
+	struct score_entry *score;
+	void **pos;
+
+	if (!fp->src || !fp->dst)
+		return 0;
+
+	pos = insert_hash(hash_void_pair(fp->src, fp->dst), &sim->scores);
+	for (score = *pos; score; score = score->next) {
+		if (score->src == fp->src && score->dst == fp->dst) {
+			score->score += fp->weight;
+			return 0;
+		}
+	}
+
+	score = xmalloc(sizeof(*score));
+	score->src = fp->src;
+	score->dst = fp->dst;
+	score->score = fp->weight;
+	score->next = *pos;
+	*pos = score;
+
+	return 0;
+}
+
+void similarity_score(struct similarity *s)
+{
+	for_each_hash(&s->fingerprints, score_one_entry, s);
+}
+
+static int report_one_score(void *ve, void *vdata)
+{
+	struct score_entry *e;
+	void (*fn)(void *, void *, unsigned) = vdata;
+
+	for (e = ve; e; e = e->next)
+		fn(e->src, e->dst, e->score);
+	return 1;
+}
+
+void similarity_report(struct similarity *s,
+		void (*fn)(void *, void *, unsigned))
+{
+	for_each_hash(&s->scores, report_one_score, fn);
+}
diff --git a/similarity.h b/similarity.h
new file mode 100644
index 0000000..6409ab8
--- /dev/null
+++ b/similarity.h
@@ -0,0 +1,24 @@
+#ifndef SIMILARITY_H
+#define SIMILARITY_H
+
+#include "hash.h"
+
+#define SIMILARITY_SOURCE 0
+#define SIMILARITY_DEST   1
+
+struct similarity {
+	struct hash_table fingerprints;
+	struct hash_table scores;
+};
+
+void similarity_init(struct similarity *s);
+void similarity_free(struct similarity *s);
+
+void similarity_add(struct similarity *s, int type, void *data,
+		const char *buf, unsigned long sz, int is_text);
+
+void similarity_score(struct similarity *s);
+void similarity_report(struct similarity *s,
+		void (*fn)(void *, void *, unsigned));
+
+#endif /* SIMILARITY_H */
diff --git a/test-similarity.c b/test-similarity.c
new file mode 100644
index 0000000..5947420
--- /dev/null
+++ b/test-similarity.c
@@ -0,0 +1,54 @@
+#include <string.h>
+#include <stdio.h>
+#include <errno.h>
+
+#include "git-compat-util.h"
+#include "similarity.h"
+#include "strbuf.h"
+#include "xdiff-interface.h"
+
+static void print_rename(void *vone, void *vtwo, unsigned score) {
+	const char *one = vone, *two = vtwo;
+	printf("%u %s -> %s\n", score, one, two);
+}
+
+static void add_file(struct similarity *sim, int type, const char *fn)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int len;
+
+	len = strbuf_read_file(&sb, fn, 0);
+	if (len < 0)
+		die("unable to read %s: %s\n", fn, strerror(errno));
+
+	similarity_add(sim, type, strdup(fn),
+			sb.buf, sb.len, buffer_is_binary(sb.buf, sb.len));
+
+	strbuf_release(&sb);
+}
+
+int main(int argc, char **argv)
+{
+	struct similarity sim;
+	struct strbuf line;
+
+	similarity_init(&sim);
+	strbuf_init(&line, 0);
+
+	while (strbuf_getline(&line, stdin, '\n') != EOF) {
+		if (!line.len)
+			break;
+		add_file(&sim, SIMILARITY_SOURCE, line.buf);
+	}
+	while (strbuf_getline(&line, stdin, '\n') != EOF)
+		add_file(&sim, SIMILARITY_DEST, line.buf);
+
+	strbuf_release(&line);
+
+	similarity_score(&sim);
+	similarity_report(&sim, print_rename);
+
+	similarity_free(&sim);
+
+	return 0;
+}
-- 
1.5.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH/RFC 3/3] handle renames using similarity engine
  2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
  2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
  2007-10-30  4:24 ` [PATCH/RFC 2/3] introduce generic similarity library Jeff King
@ 2007-10-30  4:24 ` Jeff King
  2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
  2007-10-31  0:06 ` Andy C
  4 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30  4:24 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano

This changes diffcore-rename to use the engine in
similarity.c rather than doing an O(m*n) loop around
diffcore_count_changes.

Signed-off-by: Jeff King <peff@peff.net>
---
 diffcore-rename.c |  215 +++++++++++++++++++----------------------------------
 1 files changed, 76 insertions(+), 139 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index ba038af..6c0d2d0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -5,12 +5,21 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "hash.h"
+#include "similarity.h"
 
-/* Table of rename/copy destinations */
+/* Table of rename/copy src files */
+static struct diff_rename_src {
+	struct diff_filespec *one;
+	unsigned short score; /* to remember the break score */
+} *rename_src;
+static int rename_src_nr, rename_src_alloc;
 
+/* Table of rename/copy destinations */
 static struct diff_rename_dst {
 	struct diff_filespec *two;
 	struct diff_filepair *pair;
+	struct diff_rename_src *best_match;
+	unsigned score;
 } *rename_dst;
 static int rename_dst_nr, rename_dst_alloc;
 
@@ -49,16 +58,11 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two,
 	rename_dst[first].two = alloc_filespec(two->path);
 	fill_filespec(rename_dst[first].two, two->sha1, two->mode);
 	rename_dst[first].pair = NULL;
+	rename_dst[first].best_match = NULL;
+	rename_dst[first].score = 0;
 	return &(rename_dst[first]);
 }
 
-/* Table of rename/copy src files */
-static struct diff_rename_src {
-	struct diff_filespec *one;
-	unsigned short score; /* to remember the break score */
-} *rename_src;
-static int rename_src_nr, rename_src_alloc;
-
 static struct diff_rename_src *register_rename_src(struct diff_filespec *one,
 						   unsigned short score)
 {
@@ -109,88 +113,6 @@ static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 		(!dst_len || dst->path[dst_len - 1] == '/');
 }
 
-struct diff_score {
-	int src; /* index in rename_src */
-	int dst; /* index in rename_dst */
-	int score;
-	int name_score;
-};
-
-static int estimate_similarity(struct diff_filespec *src,
-			       struct diff_filespec *dst,
-			       int minimum_score)
-{
-	/* src points at a file that existed in the original tree (or
-	 * optionally a file in the destination tree) and dst points
-	 * at a newly created file.  They may be quite similar, in which
-	 * case we want to say src is renamed to dst or src is copied into
-	 * dst, and then some edit has been applied to dst.
-	 *
-	 * Compare them and return how similar they are, representing
-	 * the score as an integer between 0 and MAX_SCORE.
-	 *
-	 * When there is an exact match, it is considered a better
-	 * match than anything else; the destination does not even
-	 * call into this function in that case.
-	 */
-	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
-	unsigned long delta_limit;
-	int score;
-
-	/* We deal only with regular files.  Symlink renames are handled
-	 * only when they are exact matches --- in other words, no edits
-	 * after renaming.
-	 */
-	if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
-		return 0;
-
-	/*
-	 * Need to check that source and destination sizes are
-	 * filled in before comparing them.
-	 *
-	 * If we already have "cnt_data" filled in, we know it's
-	 * all good (avoid checking the size for zero, as that
-	 * is a possible size - we really should have a flag to
-	 * say whether the size is valid or not!)
-	 */
-	if (!src->cnt_data && diff_populate_filespec(src, 0))
-		return 0;
-	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
-		return 0;
-
-	max_size = ((src->size > dst->size) ? src->size : dst->size);
-	base_size = ((src->size < dst->size) ? src->size : dst->size);
-	delta_size = max_size - base_size;
-
-	/* We would not consider edits that change the file size so
-	 * drastically.  delta_size must be smaller than
-	 * (MAX_SCORE-minimum_score)/MAX_SCORE * min(src->size, dst->size).
-	 *
-	 * Note that base_size == 0 case is handled here already
-	 * and the final score computation below would not have a
-	 * divide-by-zero issue.
-	 */
-	if (base_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
-		return 0;
-
-	delta_limit = (unsigned long)
-		(base_size * (MAX_SCORE-minimum_score) / MAX_SCORE);
-	if (diffcore_count_changes(src, dst,
-				   &src->cnt_data, &dst->cnt_data,
-				   delta_limit,
-				   &src_copied, &literal_added))
-		return 0;
-
-	/* How similar are they?
-	 * what percentage of material in dst are from source?
-	 */
-	if (!dst->size)
-		score = 0; /* should not happen */
-	else
-		score = (int)(src_copied * MAX_SCORE / max_size);
-	return score;
-}
-
 static void record_rename_pair(int dst_index, int src_index, int score)
 {
 	struct diff_filespec *src, *dst;
@@ -215,20 +137,6 @@ static void record_rename_pair(int dst_index, int src_index, int score)
 	rename_dst[dst_index].pair = dp;
 }
 
-/*
- * We sort the rename similarity matrix with the score, in descending
- * order (the most similar first).
- */
-static int score_compare(const void *a_, const void *b_)
-{
-	const struct diff_score *a = a_, *b = b_;
-
-	if (a->score == b->score)
-		return b->name_score - a->name_score;
-
-	return b->score - a->score;
-}
-
 struct file_similarity {
 	int src_dst, index;
 	struct diff_filespec *filespec;
@@ -376,6 +284,67 @@ static int find_exact_renames(void)
 	return i;
 }
 
+static void record_similarity(void *vsrc, void *vdst, unsigned score)
+{
+	struct diff_rename_src *src = vsrc;
+	struct diff_rename_dst *dst = vdst;
+	unsigned max_size = (src->one->size > dst->two->size) ?
+				src->one->size : dst->two->size;
+
+	score = (dst->two->size != 0) ? (score * MAX_SCORE / max_size) : 0;
+
+	/* Is there a match already that is better than we are? */
+	if (dst->best_match) {
+		if (score < dst->score)
+			return;
+		if (score == dst->score && !basename_same(src->one, dst->two))
+			return;
+	}
+
+	dst->best_match = src;
+	dst->score = score;
+}
+
+static void find_approximate_renames(int minimum_score)
+{
+	struct similarity sim;
+	int i;
+
+	similarity_init(&sim);
+
+	for (i = 0; i < rename_src_nr; i++) {
+		struct diff_rename_src *s = &rename_src[i];
+		diff_populate_filespec(s->one, 0);
+		similarity_add(&sim, SIMILARITY_SOURCE, s,
+				s->one->data, s->one->size,
+				diff_filespec_is_binary(s->one));
+		diff_free_filespec_data(s->one);
+	}
+
+	for (i = 0; i < rename_dst_nr; i++) {
+		struct diff_rename_dst *d = &rename_dst[i];
+		if (d->pair)
+			continue;
+		diff_populate_filespec(d->two, 0);
+		similarity_add(&sim, SIMILARITY_DEST, d,
+				d->two->data, d->two->size,
+				diff_filespec_is_binary(d->two));
+		diff_free_filespec_data(d->two);
+	}
+
+	similarity_score(&sim);
+	similarity_report(&sim, record_similarity);
+
+	for (i = 0 ; i < rename_dst_nr; i++) {
+		struct diff_rename_dst *d = &rename_dst[i];
+		if (d->pair)
+			continue;
+		if (d->score < minimum_score)
+			continue;
+		record_rename_pair(i, d->best_match - rename_src, d->score);
+	}
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -383,9 +352,8 @@ void diffcore_rename(struct diff_options *options)
 	int rename_limit = options->rename_limit;
 	struct diff_queue_struct *q = &diff_queued_diff;
 	struct diff_queue_struct outq;
-	struct diff_score *mx;
-	int i, j, rename_count;
-	int num_create, num_src, dst_cnt;
+	int num_create, num_src;
+	int i, rename_count;
 
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -462,38 +430,7 @@ void diffcore_rename(struct diff_options *options)
 	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	mx = xmalloc(sizeof(*mx) * num_create * num_src);
-	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		int base = dst_cnt * num_src;
-		struct diff_filespec *two = rename_dst[i].two;
-		if (rename_dst[i].pair)
-			continue; /* dealt with exact match already. */
-		for (j = 0; j < rename_src_nr; j++) {
-			struct diff_filespec *one = rename_src[j].one;
-			struct diff_score *m = &mx[base+j];
-			m->src = j;
-			m->dst = i;
-			m->score = estimate_similarity(one, two,
-						       minimum_score);
-			m->name_score = basename_same(one, two);
-			diff_free_filespec_blob(one);
-		}
-		/* We do not need the text anymore */
-		diff_free_filespec_blob(two);
-		dst_cnt++;
-	}
-	/* cost matrix sorted by most to least similar pair */
-	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
-		if (dst->pair)
-			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
-		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
-		rename_count++;
-	}
-	free(mx);
+	find_approximate_renames(minimum_score);
 
  cleanup:
 	/* At this point, we have found some renames and copies and they
-- 
1.5.3.4

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
                   ` (2 preceding siblings ...)
  2007-10-30  4:24 ` [PATCH/RFC 3/3] handle renames using similarity engine Jeff King
@ 2007-10-30  5:06 ` Linus Torvalds
  2007-10-30  8:29   ` Junio C Hamano
  2007-10-30 13:43   ` Jeff King
  2007-10-31  0:06 ` Andy C
  4 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2007-10-30  5:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Andy C, Junio C Hamano



On Tue, 30 Oct 2007, Jeff King wrote:
>
>   - no improvement on smaller datasets. Running "git-whatchanged -M
>     --raw -l0" on the linux-2.6 repo takes about the same time with the
>     old and new code (presumably the algorithmic savings of the new code
>     are lost in a higher constant factor, so when n is small, it is a
>     wash).

Have you compared the results? IOW, does it find the *same* renames?

I'm a bit worried about the fact that you just pick a single (arbitrary) 
src/dst per fingerprint. Yes, it should be limited, but that seems to be a 
bit too *extremely* limited. But if it gives the same results in practice, 
maybe nobody cares?

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
@ 2007-10-30  8:29   ` Junio C Hamano
  2007-10-30 13:46     ` Jeff King
  2007-10-30 13:43   ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2007-10-30  8:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, git, Andy C

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, 30 Oct 2007, Jeff King wrote:
>>
>>   - no improvement on smaller datasets. Running "git-whatchanged -M
>>     --raw -l0" on the linux-2.6 repo takes about the same time with the
>>     old and new code (presumably the algorithmic savings of the new code
>>     are lost in a higher constant factor, so when n is small, it is a
>>     wash).
>
> Have you compared the results? IOW, does it find the *same* renames?
>
> I'm a bit worried about the fact that you just pick a single (arbitrary) 
> src/dst per fingerprint. Yes, it should be limited, but that seems to be a 
> bit too *extremely* limited. But if it gives the same results in practice, 
> maybe nobody cares?

If it always gives the same results in practice, obviously
nobody can even notice.

However, merging this series to 'pu' breaks rebase-merge test
t3402 among other things.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
  2007-10-30  8:29   ` Junio C Hamano
@ 2007-10-30 13:43   ` Jeff King
  2007-10-30 15:38     ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Jeff King @ 2007-10-30 13:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Andy C, Junio C Hamano

On Mon, Oct 29, 2007 at 10:06:11PM -0700, Linus Torvalds wrote:

> Have you compared the results? IOW, does it find the *same* renames?

>From my limited testing, it generally finds the same pairs. However,
there are a number of renames that it _doesn't_ find, because they are
composed of "uninteresting" lines, dropping them below the minimum
score. Try (in git.git):

  git-show --raw -M -l0 :/'Big tool rename'

with the old and new code. Pairs like Documentation/git-add-script.txt
-> Documentation/git-add.txt are not found, because the file is composed
almost entirely of boilerplate.

Moving the size normalization into the similarity engine should probably
fix that, and will let us compare old and new results more accurately.
I'll try to work on that.

> I'm a bit worried about the fact that you just pick a single (arbitrary) 
> src/dst per fingerprint. Yes, it should be limited, but that seems to be a 
> bit too *extremely* limited. But if it gives the same results in practice, 
> maybe nobody cares?

Yes, I have not convinced myself yet that it's the right approach (but
it seemed like a good place to try first, for simplicity and speed).  As
I noted, this approach seems to be a bit memory hungry on large, so I am
a bit concerned about increasing the size of the fingerprint_entry
structure.  However, Andy's sampling approach might help fix that.

The current code also doesn't bother marking overflow, so common lines
get attributes to some random file (actually, worse than random: if a
bunch of files have the same common lines, _all_ of the lines will go to
the last file, which means we subtly favor renames from the end of the
input list). So probably it should be tested as-is, with an "overflow,
this line is too common to be interesting" bit, and with a small-ish
limit (I had at one point tried 5, but the implementation was naive and
too memory-hungry).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30  8:29   ` Junio C Hamano
@ 2007-10-30 13:46     ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30 13:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git, Andy C

On Tue, Oct 30, 2007 at 01:29:22AM -0700, Junio C Hamano wrote:

> If it always gives the same results in practice, obviously
> nobody can even notice.
> 
> However, merging this series to 'pu' breaks rebase-merge test
> t3402 among other things.

Yes, sorry, I meant to mention the test breakage in the cover letter
(which I think is just related to the size/score normalization). This is
not really meant for applying, but more for "this is taking me a lot
longer than I hoped, so here is what is happening and you might be
interested to comment." I'm not even sure it's pu material. :)

I will continue to refine it as I mentioned in the mail to Linus, but I
am open to suggestions.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30 13:43   ` Jeff King
@ 2007-10-30 15:38     ` Linus Torvalds
  2007-10-30 20:20       ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2007-10-30 15:38 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Andy C, Junio C Hamano

On Tue, 30 Oct 2007, Jeff King wrote:
>
> On Mon, Oct 29, 2007 at 10:06:11PM -0700, Linus Torvalds wrote:
> 
> > Have you compared the results? IOW, does it find the *same* renames?
> 
> From my limited testing, it generally finds the same pairs. However,
> there are a number of renames that it _doesn't_ find, because they are
> composed of "uninteresting" lines, dropping them below the minimum
> score. Try (in git.git):
> 
>   git-show --raw -M -l0 :/'Big tool rename'
> 
> with the old and new code. Pairs like Documentation/git-add-script.txt
> -> Documentation/git-add.txt are not found, because the file is composed
> almost entirely of boilerplate.

Ok, that does imply to me that we cannot just drop boilerplate text, 
because the fact is, lots of files contain boilerplate, but people still 
think they are "similar".

We do actually depend on the similarity analysis being "good" - because it 
matters a lot for things like merging. The old code was actually very 
careful indeed, and while it didn't care about things like the exact 
*ordering* of lines (ie moving functions around in the same file resulted 
in the *exact* same fingerprint for the file!) it cared about everything 
else.

> Moving the size normalization into the similarity engine should probably
> fix that, and will let us compare old and new results more accurately.
> I'll try to work on that.

Hmm. I hope that is sufficient. But I suspect it may well not be. 
Especially since you ignore boiler-plate lines for *some* files but not 
others (ie it depends on which file you happen to find it in first).

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30 15:38     ` Linus Torvalds
@ 2007-10-30 20:20       ` Jeff King
  2007-10-30 20:39         ` Linus Torvalds
  2007-10-31  0:27         ` Andy C
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30 20:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Andy C, Junio C Hamano

On Tue, Oct 30, 2007 at 08:38:24AM -0700, Linus Torvalds wrote:

> > with the old and new code. Pairs like Documentation/git-add-script.txt
> > -> Documentation/git-add.txt are not found, because the file is composed
> > almost entirely of boilerplate.
> 
> Ok, that does imply to me that we cannot just drop boilerplate text, 
> because the fact is, lots of files contain boilerplate, but people still 
> think they are "similar".

Well, the problem is that instead of just "dropping" boilerplate text,
we fail to count it as a similarity, but it still counts towards the
file size. It may be that just dropping it totally is the right thing
(in which case those renames _will_ turn up, because they will be filled
with identical non-boilerplate goodness).

> Hmm. I hope that is sufficient. But I suspect it may well not be. 
> Especially since you ignore boiler-plate lines for *some* files but not 
> others (ie it depends on which file you happen to find it in first).

Yes, that part bothers me a little, so I think a "too common, ignore"
overflow flag would at least be better.

But I think the best thing to do now is for me to shut up and see what
the results look like with the tweaks I have mentioned.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30 20:20       ` Jeff King
@ 2007-10-30 20:39         ` Linus Torvalds
  2007-10-31  0:27         ` Andy C
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2007-10-30 20:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Andy C, Junio C Hamano



On Tue, 30 Oct 2007, Jeff King wrote:
> 
> Well, the problem is that instead of just "dropping" boilerplate text,
> we fail to count it as a similarity, but it still counts towards the
> file size. It may be that just dropping it totally is the right thing
> (in which case those renames _will_ turn up, because they will be filled
> with identical non-boilerplate goodness).

Yeah, you may well be right, and the normalization of the scores will just 
solve things.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
                   ` (3 preceding siblings ...)
  2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
@ 2007-10-31  0:06 ` Andy C
  4 siblings, 0 replies; 13+ messages in thread
From: Andy C @ 2007-10-31  0:06 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Linus Torvalds, Junio C Hamano

Sorry I have been AWOL... I was going to try to work on this, but I
got abjectly sick (long story).  But it's great to see this out.

On 10/29/07, Jeff King <peff@peff.net> wrote:
> This is my first stab at faster rename handling based on Andy's code.
> The patches are on top of next (to get Linus' recent work on exact
> renames). Most of the interesting stuff is in 2/3.
>
>   1/3: extension of hash interface
>   2/3: similarity detection code
>   3/3: integrate similarity detection into diffcore-rename
>
> The implementation is pretty basic, so I think there is room for
> code optimization (50% of the time is spent in hash lookups, so we might
> be able to micro-optimize that) as well as algorithmic improvements (like the
> sampling Andy mentioned).

For microoptimization, I was thinking that the hash tables could be
implemented without pointers per value (or memory allocation per
value), so everything is in a contiguous block of memory.   In C++ you
can do this trivially by declaring a small struct as the second
template parameter of the container; in C I guess you can simulate it
with a macro or something.

For the inverted indexing step, the values in the hash are going to be
quite small, especially if line_threshold=1.  Then you only need 2
integers for the left side and right side == 4 integers.  The integers
could just be indexes into the lists (like the current code uses).

For the count matrix step, the values are just going to be integers,
so storing it right in the hash table makes sense.

The sampling should be only necessary for binary files, I think.

> With these patches, I can get my monster binary diff down from about 2
> minutes to 17 seconds. And comparing all of linux-2.4 to all of
> linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

Hopefully that should be close to just reading the files off disk.
The algorithm should take a fraction of the time that simply reading
the files does, which presumably a git diff has to do.

I was timing that by comparing it to doing a "| xargs wc -l" on the
lists of files.

> There are a few downsides:
>   - the current implementation tends to give lower similarity values
>     compared to the old code (see discussion in 2/3), but this should be
>     tweakable
>   - on large datasets, it's more memory hungry than the old code because
>     the hash grows very large. This can be helped by bumping up the
>     binary chunk size (actually, the 17 seconds quoted above is using
>     256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
>     more like 24 seconds) as well as sampling.
>   - no improvement on smaller datasets. Running "git-whatchanged -M
>     --raw -l0" on the linux-2.6 repo takes about the same time with the
>     old and new code (presumably the algorithmic savings of the new code
>     are lost in a higher constant factor, so when n is small, it is a
>     wash).

I think the old code tries to respect the cache as much as possible,
from what I can tell.  The new code has to use hash tables which are
unpredictable of course.  Though for smaller data sets I would expect
the hash table to fit in cache.  What's your definition of small here?
 Are you sure the old code isn't triggering one of the limits that was
there?

thanks,
Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/3] faster inexact rename handling
  2007-10-30 20:20       ` Jeff King
  2007-10-30 20:39         ` Linus Torvalds
@ 2007-10-31  0:27         ` Andy C
  1 sibling, 0 replies; 13+ messages in thread
From: Andy C @ 2007-10-31  0:27 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, git, Junio C Hamano

On 10/30/07, Jeff King <peff@peff.net> wrote:
> On Tue, Oct 30, 2007 at 08:38:24AM -0700, Linus Torvalds wrote:
>
> > > with the old and new code. Pairs like Documentation/git-add-script.txt
> > > -> Documentation/git-add.txt are not found, because the file is composed
> > > almost entirely of boilerplate.
> >
> > Ok, that does imply to me that we cannot just drop boilerplate text,
> > because the fact is, lots of files contain boilerplate, but people still
> > think they are "similar".
>
> Well, the problem is that instead of just "dropping" boilerplate text,
> we fail to count it as a similarity, but it still counts towards the
> file size. It may be that just dropping it totally is the right thing
> (in which case those renames _will_ turn up, because they will be filled
> with identical non-boilerplate goodness).

Right, in the demo I make an extra pass after the inverted indexing
step to prune the index -- which means eliminating the common lines
*entirely* from the index (so they don't get attributed to a random
file) *and* decrementing all the file sizes by 1.  That way the
similarity scores shouldn't get skewed.

And as you mentioned we could bump the threshold from 1 to some other
small integer.  Intuitively I guess you could say it is common to copy
a file to 2 places or 3 places, and you don't want all the lines to
get thrown out because of that.  But usually you don't copy a file to
10 or 50 places.

Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-10-31  0:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
2007-10-30  4:24 ` [PATCH/RFC 2/3] introduce generic similarity library Jeff King
2007-10-30  4:24 ` [PATCH/RFC 3/3] handle renames using similarity engine Jeff King
2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
2007-10-30  8:29   ` Junio C Hamano
2007-10-30 13:46     ` Jeff King
2007-10-30 13:43   ` Jeff King
2007-10-30 15:38     ` Linus Torvalds
2007-10-30 20:20       ` Jeff King
2007-10-30 20:39         ` Linus Torvalds
2007-10-31  0:27         ` Andy C
2007-10-31  0:06 ` Andy C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).