git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fix up diffcore-rename scoring
@ 2006-03-13  6:26 Linus Torvalds
  2006-03-13  6:44 ` Linus Torvalds
  2006-03-13  6:46 ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-03-13  6:26 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


The "score" calculation for diffcore-rename was totally broken.

It scaled "score" as

	score = src_copied * MAX_SCORE / dst->size;

which means that you got a 100% similarity score even if src and dest were 
different, if just every byte of dst was copied from src, even if source 
was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
the remaining 15%).

That's clearly bogus. We should do the score calculation relative not to 
the destination size, but to the max size of the two.

This seems to fix it.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
diff --git a/diffcore-rename.c b/diffcore-rename.c
index ed99fe2..e992698 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -133,7 +133,7 @@ static int estimate_similarity(struct di
 	 * match than anything else; the destination does not even
 	 * call into this function in that case.
 	 */
-	unsigned long delta_size, base_size, src_copied, literal_added;
+	unsigned long max_size, delta_size, base_size, src_copied, literal_added;
 	unsigned long delta_limit;
 	int score;
 
@@ -144,9 +144,9 @@ static int estimate_similarity(struct di
 	if (!S_ISREG(src->mode) || !S_ISREG(dst->mode))
 		return 0;
 
-	delta_size = ((src->size < dst->size) ?
-		      (dst->size - src->size) : (src->size - dst->size));
+	max_size = ((src->size > dst->size) ? src->size : dst->size);
 	base_size = ((src->size < dst->size) ? src->size : dst->size);
+	delta_size = max_size - base_size;
 
 	/* We would not consider edits that change the file size so
 	 * drastically.  delta_size must be smaller than
@@ -174,12 +174,10 @@ static int estimate_similarity(struct di
 	/* How similar are they?
 	 * what percentage of material in dst are from source?
 	 */
-	if (dst->size < src_copied)
-		score = MAX_SCORE;
-	else if (!dst->size)
+	if (!dst->size)
 		score = 0; /* should not happen */
 	else
-		score = src_copied * MAX_SCORE / dst->size;
+		score = src_copied * MAX_SCORE / max_size;
 	return score;
 }
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  6:26 Fix up diffcore-rename scoring Linus Torvalds
@ 2006-03-13  6:44 ` Linus Torvalds
  2006-03-13  6:46 ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-03-13  6:44 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List



On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> The "score" calculation for diffcore-rename was totally broken.
> 
> It scaled "score" as
> 
> 	score = src_copied * MAX_SCORE / dst->size;
> 
> which means that you got a 100% similarity score even if src and dest were 
> different, if just every byte of dst was copied from src, even if source 
> was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> the remaining 15%).
> 
> That's clearly bogus. We should do the score calculation relative not to 
> the destination size, but to the max size of the two.
> 
> This seems to fix it.

Btw, interestingly, this seems to actually improve on the rename 
detection from your previous one, even though at the face of it, it 
should just have made the scores go down.

I'm not quite sure why, but perhaps it gave a bogus high score to some 
rename that wasn't very good, allowing the _real_ rename to make itself 
seen.

Or maybe I did some mistake in testing it.

		Linus

PS. You can still get a "similarity score" of 100 with the fixed scaling 
even if the source and the destination were different. That happens if 
every byte was marked as "copied" by the similarity estimator. Which can 
happen if you just move things around in the file - the end result is 
different, but all the bytes are copied from the source.

At least with the fixed heuristic, that "perfect similarity" score can be 
_somehow_ be explained. The files are very similar in that they have the 
same content, just in a different order ;)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  6:26 Fix up diffcore-rename scoring Linus Torvalds
  2006-03-13  6:44 ` Linus Torvalds
@ 2006-03-13  6:46 ` Junio C Hamano
  2006-03-13  7:09   ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2006-03-13  6:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> The "score" calculation for diffcore-rename was totally broken.
>
> It scaled "score" as
>
> 	score = src_copied * MAX_SCORE / dst->size;
>
> which means that you got a 100% similarity score even if src and dest were 
> different, if just every byte of dst was copied from src, even if source 
> was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> the remaining 15%).

Your reading of the code is correct, but that is deliberate.

>  	/* How similar are they?
>  	 * what percentage of material in dst are from source?
>  	 */

I wanted to say in such a case that dst was _really_ derived
from the source.  I think using max may make more sense, but I
need to convince myself by looking at filepairs that this change
stops detecting as renames, and this change starts detecting as
renames.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  6:46 ` Junio C Hamano
@ 2006-03-13  7:09   ` Linus Torvalds
  2006-03-13  7:42     ` Junio C Hamano
  2006-03-13  7:44     ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-03-13  7:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Sun, 12 Mar 2006, Junio C Hamano wrote:
>
> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > The "score" calculation for diffcore-rename was totally broken.
> >
> > It scaled "score" as
> >
> > 	score = src_copied * MAX_SCORE / dst->size;
> >
> > which means that you got a 100% similarity score even if src and dest were 
> > different, if just every byte of dst was copied from src, even if source 
> > was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ 
> > the remaining 15%).
> 
> Your reading of the code is correct, but that is deliberate.
> 
> >  	/* How similar are they?
> >  	 * what percentage of material in dst are from source?
> >  	 */
> 
> I wanted to say in such a case that dst was _really_ derived
> from the source.  I think using max may make more sense, but I
> need to convince myself by looking at filepairs that this change
> stops detecting as renames, and this change starts detecting as
> renames.

Just compare the result. Just eye-balling the difference between the 
rename data from 2.6.12 to 2.6.14, the fixed score actually gets better 
rename detection. It actually finds 133 renames (as opposed to 132 for the 
broken one), and the renames it finds are more sensible.

For example, the fixed version finds

	drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h

which actually matches the other i2c/chips/ renames, while the broken one 
does

	drivers/i2c/chips/lm75.h -> drivers/media/video/rds.h

which just doesn't make any sense at all.

Now, that said, they _both_ find some pretty funky renames. I think there 
is probably some serious room for improvement, regardless (or at least 
changing the default similarity cut-off to something better ;)

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  7:09   ` Linus Torvalds
@ 2006-03-13  7:42     ` Junio C Hamano
  2006-03-13  7:44     ` Linus Torvalds
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2006-03-13  7:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Just compare the result...
>
> Now, that said, they _both_ find some pretty funky renames. I think there 
> is probably some serious room for improvement, regardless (or at least 
> changing the default similarity cut-off to something better ;)

Yes.  The "compare with larger" seems to cull nonsensical ones
found by "next" one much better.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  7:09   ` Linus Torvalds
  2006-03-13  7:42     ` Junio C Hamano
@ 2006-03-13  7:44     ` Linus Torvalds
  2006-03-13 10:43       ` Junio C Hamano
  2006-04-06 21:01       ` Geert Bosch
  1 sibling, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-03-13  7:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> Now, that said, they _both_ find some pretty funky renames. I think there 
> is probably some serious room for improvement, regardless (or at least 
> changing the default similarity cut-off to something better ;)

I'm afraid that _good_ rename detection really ends up wanting to take 
"longest possible sequence" into account, exactly like the full xdelta 
does. 

Instead of doing a fixed-chunk thing and saying that any copy is 
equivalent to any other copy. That's simply not true. It's _much_ better 
to have one 24-byte copy than it is to have three 8-byte copies, but the 
new faster diffcore-delta.c just can't see that.

So one big reason as to why it is fast in the first place is that it 
fundamentally just doesn't do a very good job ;(

It might be that the fast delta thing is a good way to ask "is this even 
worth considering", to cut down the O(m*n) rename/copy detection to 
something much smaller, and then use xdelta() to actually figure out what 
is a good rename and what isn't from a much smaller set of potential 
targets.

That would actually allow us to be even _less_ precise. Screw that big 
hash-table etc, don't even try to be exact. Just try to be fairly fast, 
and then pick the top entries from the similarity array for more precise 
diffing if there are multiple choices that look like they might be 
possible.

The appended alternate "diffcore-delta.c" doesn't do any of the caching 
(ie I wrote it so that it would be easy to change to make the _caller_ 
keeps "src" constant, and iterates over destination - or the other way 
around - and would do the hash setup just once per src).

Still, even with the existing setup, it's pretty fast for me (not much 
slower than your caching version even though it recalculates everything 
every time). And it's not that far off, which tells me that if it was used 
as a "first-pass filter", we could afford to do a better job on the things 
that it says are likely candidates.

Hmm? It really does bother me how the suggested rename detector finds 
stuff that clearly isn't. 

			Linus

----
#include "cache.h"
#include "diff.h"
#include "diffcore.h"

#define CHUNK (16)
#define SILLYSIZE (65537)
static int hashnr[SILLYSIZE];

static void setup_hash(void)
{
	memset(hashnr, 0, sizeof(hashnr));
}

static void insert_hash(unsigned int hashval)
{
	hashval = hashval % SILLYSIZE;
	hashnr[hashval]++;
}

static int find_hash(unsigned int hashval)
{
	hashval = hashval % SILLYSIZE;
	if (hashnr[hashval]) {
		hashnr[hashval]--;
		return 1;
	}
	return 0;
}

int diffcore_count_changes(void *src, unsigned long src_size,
			   void *dst, unsigned long dst_size,
			   void **src_count_p,
			   void **dst_count_p,
			   unsigned long delta_limit,
			   unsigned long *src_copied,
			   unsigned long *literal_added)
{
	unsigned long copied = 0;
	unsigned long literal = 0;

	setup_hash();
	while (src_size >= CHUNK) {
		unsigned int hashval = adler32(0, src, CHUNK);
		insert_hash(hashval);
		src += CHUNK;
		src_size -= CHUNK;
	}

	while (dst_size >= CHUNK) {
		unsigned int hashval = adler32(0, dst, CHUNK);
		if (find_hash(hashval)) {
			copied += CHUNK;
			dst += CHUNK;
			dst_size -= CHUNK;
			continue;
		}
		literal++;
		if (literal > delta_limit)
			return -1;
		dst++;
		dst_size--;
	}
	literal += dst_size;

	*src_copied = copied;
	*literal_added = literal;
	return 0;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  7:44     ` Linus Torvalds
@ 2006-03-13 10:43       ` Junio C Hamano
  2006-03-13 15:38         ` Linus Torvalds
  2006-04-06 21:01       ` Geert Bosch
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2006-03-13 10:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Instead of doing a fixed-chunk thing and saying that any copy is 
> equivalent to any other copy. That's simply not true. It's _much_ better 
> to have one 24-byte copy than it is to have three 8-byte copies, but the 
> new faster diffcore-delta.c just can't see that.

Exactly.

You know what?  Once we start counting to detect 24-byte
straight copy and try distinguishing it from 3 separate 8-byte
copies, it will eventually lead us to what we have in
diff-delta.c anyway.  I avoided counting runs of bytes on
purpose because I wanted to see how far we can go without it.

The primary reason I started the jc/diff topic branch was
because we _might_ want to replace what is in the current
diff-delta.c with much finer-grained comparison code, and when
that happens, counting xdelta output for rename detection
purpose would have stopped making sense.  For now we decided to
postpone it for performance reasons, but we still might want to
when Nico comes back with a better implementation.

Now, I know the current diff-delta based similarity estimator we
have in "main" seems to do a reasonable if not perfect job,
within a reasonabe amount of time.  And it does know how to
count copying of consecutive bytes.  In the worst case we could
just fork the xdelta part of the code when Nico comes back with
improved finer-grained delta, and we can keep using the current
diff-delta code for rename detection.  Knowing we have that
fallback position, I wanted to pursue a different avenue.
Distinguishing a straight 24-byte run from three independent
8-byte run, using hash to find the offset in the source and
actually do maximum string match, is something we already know
how to do, because that is essentially what the current
diff-delta code does.

By the way, the reason the diffcore-delta code in "next" does
not do every-eight-bytes hash on the source material is to
somewhat alleviate the problem that comes from not detecting
copying of consecutive byte ranges.  If you have a 8-byte run
that is copied from source to destination, we would give it one
point (let's for now forget about false match coming from hash
collisions).  Since the source material is hashed at every byte
offset, if we have 9-byte run copied from source to destination,
that is awarded two points (for the first 8-byte we award one
point, and then another 8-byte sequence starting from the second
byte we award another point; we are talking about an overlapping
range).  That way, the code does reward copying consecutive
bytes around more heavily than copying things at random places.
At one extreme, if you copy 7-byte, throw in a garbage, another
7-byte, throw in a garbage, and keep going, you would not get
any point.

It's really a funky heuristics, and as you have seen, it
sometimes gives spectaculary phony matches.  But in practice,
with some tweaking it seems to do an OK job.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13 10:43       ` Junio C Hamano
@ 2006-03-13 15:38         ` Linus Torvalds
  2006-03-14  0:49           ` Rutger Nijlunsing
  2006-03-14  0:55           ` Junio C Hamano
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-03-13 15:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Mon, 13 Mar 2006, Junio C Hamano wrote:
> 
> By the way, the reason the diffcore-delta code in "next" does
> not do every-eight-bytes hash on the source material is to
> somewhat alleviate the problem that comes from not detecting
> copying of consecutive byte ranges.

Yes. However, there are better ways to do that in practice.

The most effective way that is generally used is to not use a fixed 
chunk-size, but use a terminating character, together with a 
minimum/maximum chunksize.

There's a pretty natural terminating character that works well for 
sources: '\n'.

So the natural way to do similarity detection when most of the code is 
line-based is to do the hashing on chunks that follow the rule "minimum of 
<n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".

So if you don't see any '\n' at all (or the only such one is less than <n> 
bytes into your current window), do the hash over a <2n>-byte chunk (this 
takes care of binaries and/or long lines).

This - for source code - allows you to ignore trivial byte offset things, 
because you have a character that is used for synchronization. So you 
don't need to do hashing at every byte in both files - you end up doing 
the hashing only at line boundaries in practice. And it still _works_ for 
binary files, although you effectively need bigger identical chunk-sizes 
to find similarities (for text-files, it finds similarities of size <n>, 
for binaries the similarities need to effectively be of size 3*n, because 
you chunk it up at ~2*n, and only generate the hash at certain offsets in 
the source binary).

		Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13 15:38         ` Linus Torvalds
@ 2006-03-14  0:49           ` Rutger Nijlunsing
  2006-03-14  0:55           ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Rutger Nijlunsing @ 2006-03-14  0:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

[-- Attachment #1: Type: text/plain, Size: 12313 bytes --]

From: Rutger Nijlunsing <rutger@nospam.com>
To: Linus Torvalds <torvalds@osdl.org>
Cc: Junio C Hamano <junkio@cox.net>, git@vger.kernel.org
Bcc: 
Subject: Re: Fix up diffcore-rename scoring
Reply-To: git@wingding.demon.nl
In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org>
Organization: M38c

On Mon, Mar 13, 2006 at 07:38:53AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 13 Mar 2006, Junio C Hamano wrote:
> > 
> > By the way, the reason the diffcore-delta code in "next" does
> > not do every-eight-bytes hash on the source material is to
> > somewhat alleviate the problem that comes from not detecting
> > copying of consecutive byte ranges.
> 
> Yes. However, there are better ways to do that in practice.
> 
> The most effective way that is generally used is to not use a fixed 
> chunk-size, but use a terminating character, together with a 
> minimum/maximum chunksize.
> 
> There's a pretty natural terminating character that works well for 
> sources: '\n'.
> 
> So the natural way to do similarity detection when most of the code is 
> line-based is to do the hashing on chunks that follow the rule "minimum of 
> <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".
> 
> So if you don't see any '\n' at all (or the only such one is less than <n> 
> bytes into your current window), do the hash over a <2n>-byte chunk (this 
> takes care of binaries and/or long lines).
> 
> This - for source code - allows you to ignore trivial byte offset things, 
> because you have a character that is used for synchronization. So you 
> don't need to do hashing at every byte in both files - you end up doing 
> the hashing only at line boundaries in practice. And it still _works_ for 
> binary files, although you effectively need bigger identical chunk-sizes 
> to find similarities (for text-files, it finds similarities of size <n>, 
> for binaries the similarities need to effectively be of size 3*n, because 
> you chunk it up at ~2*n, and only generate the hash at certain offsets in 
> the source binary).

This looks like something I did last year as an experiment in the
pre-git times. The idea was to generate a patch-with-renames from two
(large) source trees.

Algorithm:
  - determine md5sum for each file (same idea as git's SHA1 sum)
    if changed since last run
  - only look at md5sums which do not match
  - pool files into types, which might depend on extension and/or MIME type.
    This is an optimisation.
  - Only compare filepair _within_ one pool.
  - The filepair order in one pool is determined by filename-similarity.
    So pair [include/asm-ppc/ioctl.h, include/asm-powerpc/ioctl.h]
    is inspected before pair
       [include/asm-ppc/ioctl.h, arch/arm/plat-omap/clock.h] .
  - For each file, create a hash from String line -> Integer occuranced .
    Similarities are calculated by comparing two hashes.
  - Keep as a rename-match all files which:
    - have at most 50% new lines;
    - have at most 25% lines deleted from them.

I ran the code against v2.6.12 and v2.6.14 to be able to compare it
with the current contenders. Hopefully some ideas are harvestable...

Algorithm differences:
  - '\n' is used as boundary, independant on line length.
    This is bad for binary files, and maybe even bad for text files.
    So don't harvest :)
  - don't look at the intersection percentage, but look at two values:
    - percentage of lines added (default: max. 50%)
    - percentage of lines removed (default: max. 25%)
    This assumes files get bigger during development (at most 50%), and
    not too much code is deleted (at most 25%).
    Disadvantages:
      - Two magic numbers instead of one.
      - It's non-symmetrical. Diff A->B will find different renames from
        diff B->A. This scares me, actually.
  - to speed up the detection:
    - don't start comparing files at random. Start comparing files which
      have the same 'names' in it. So when v2.6.12 has a files called
      arch/arm/mach-omap/clock.c, start comparing with files which have
      most words the same. Currently, '-', '.', '_' and '/' are used
      as word separators.
      Advantage: don't match on the first match just above the
        match-threshold.
    (next heuristics are all optional:)
    - only compare files with the same extension. This splits up all files
      into groups, which makes it much faster.
      In general, there's no reason to compare a .h with a .c file.
    - only compare files with the same MIME type. Same as above, but also
      works for files without extensions (so don't compare README with
      Makefile)

Ok, the result:

$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | wc -l
104   <-- That's bad. We're missing some renames here.

$ shpatch.rb -d linux-2.6.12,linux-2.6.14 | sort -k 1.10

+ 0% -23% arch/arm/configs/omnimeter_defconfig -> arch/arm/configs/collie_defconfig
+ 5% - 9% arch/arm/mach-omap/board-generic.c -> arch/arm/mach-omap1/board-generic.c
+ 0% - 8% arch/arm/mach-omap/board-h2.c -> arch/arm/mach-omap1/board-h2.c
+ 0% - 5% arch/arm/mach-omap/board-h3.c -> arch/arm/mach-omap1/board-h3.c
+ 0% - 3% arch/arm/mach-omap/board-innovator.c -> arch/arm/mach-omap1/board-innovator.c
+ 0% - 9% arch/arm/mach-omap/board-netstar.c -> arch/arm/mach-omap1/board-netstar.c
+ 9% -10% arch/arm/mach-omap/board-osk.c -> arch/arm/mach-omap1/board-osk.c
+ 0% - 6% arch/arm/mach-omap/board-perseus2.c -> arch/arm/mach-omap1/board-perseus2.c
+ 3% - 8% arch/arm/mach-omap/board-voiceblue.c -> arch/arm/mach-omap1/board-voiceblue.c
+ 7% - 4% arch/arm/mach-omap/clock.c -> arch/arm/plat-omap/clock.c
+ 0% - 0% arch/arm/mach-omap/clock.h -> arch/arm/plat-omap/clock.h
+ 0% - 5% arch/arm/mach-omap/common.h -> include/asm-arm/arch-omap/common.h
+ 2% - 1% arch/arm/mach-omap/dma.c -> arch/arm/plat-omap/dma.c
+ 0% - 1% arch/arm/mach-omap/fpga.c -> arch/arm/mach-omap1/fpga.c
+11% -11% arch/arm/mach-omap/gpio.c -> arch/arm/plat-omap/gpio.c
+ 2% - 2% arch/arm/mach-omap/irq.c -> arch/arm/mach-omap1/irq.c
+ 0% - 4% arch/arm/mach-omap/leds.c -> arch/arm/mach-omap1/leds.c
+ 0% - 0% arch/arm/mach-omap/leds-h2p2-debug.c -> arch/arm/mach-omap1/leds-h2p2-debug.c
+ 0% - 0% arch/arm/mach-omap/leds-innovator.c -> arch/arm/mach-omap1/leds-innovator.c
+ 0% - 4% arch/arm/mach-omap/leds-osk.c -> arch/arm/mach-omap1/leds-osk.c
+ 0% -25% arch/arm/mach-omap/Makefile.boot -> arch/arm/mach-omap1/Makefile.boot
+ 1% - 2% arch/arm/mach-omap/mcbsp.c -> arch/arm/plat-omap/mcbsp.c
+ 0% - 6% arch/arm/mach-omap/mux.c -> arch/arm/plat-omap/mux.c
+ 0% - 0% arch/arm/mach-omap/ocpi.c -> arch/arm/plat-omap/ocpi.c
+ 1% -18% arch/arm/mach-omap/pm.c -> arch/arm/plat-omap/pm.c
+ 0% -11% arch/arm/mach-omap/sleep.S -> arch/arm/plat-omap/sleep.S
+ 6% - 4% arch/arm/mach-omap/time.c -> arch/arm/mach-omap1/time.c
+ 0% - 1% arch/arm/mach-omap/usb.c -> arch/arm/plat-omap/usb.c
+ 2% - 1% arch/ia64/sn/include/pci/pcibr_provider.h -> include/asm-ia64/sn/pcibr_provider.h
+ 0% - 2% arch/ia64/sn/include/pci/pic.h -> include/asm-ia64/sn/pic.h
+ 0% - 0% arch/ia64/sn/include/pci/tiocp.h -> include/asm-ia64/sn/tiocp.h
+ 3% -23% arch/m68knommu/platform/68VZ328/de2/config.c -> arch/m68knommu/platform/68VZ328/config.c
+ 1% -18% arch/mips/configs/osprey_defconfig -> arch/mips/configs/qemu_defconfig
+ 0% -12% arch/mips/vr41xx/zao-capcella/setup.c -> arch/mips/vr41xx/common/type.c
+ 0% - 0% arch/ppc64/oprofile/op_impl.h -> include/asm-ppc64/oprofile_impl.h
+ 3% -23% arch/ppc/configs/ash_defconfig -> arch/ppc64/configs/bpa_defconfig
+ 2% -21% arch/ppc/configs/beech_defconfig -> arch/ppc/configs/ev64360_defconfig
+ 5% -20% arch/ppc/configs/cedar_defconfig -> arch/ppc/configs/mpc8548_cds_defconfig
+ 9% -17% arch/ppc/configs/k2_defconfig -> arch/ppc/configs/bamboo_defconfig
+ 3% -25% arch/ppc/configs/mcpn765_defconfig -> arch/xtensa/configs/common_defconfig
+ 2% -23% arch/ppc/configs/oak_defconfig -> arch/frv/defconfig
+ 3% -16% arch/ppc/configs/SM850_defconfig -> arch/ppc/configs/mpc86x_ads_defconfig
+ 3% -13% arch/ppc/configs/SPD823TS_defconfig -> arch/ppc/configs/mpc885ads_defconfig
+19% -15% arch/um/kernel/tempfile.c -> arch/um/os-Linux/mem.c
+ 0% - 5% arch/x86_64/kernel/semaphore.c -> lib/semaphore-sleepers.c
+ 0% - 6% drivers/i2c/chips/adm1021.c -> drivers/hwmon/adm1021.c
+ 0% - 4% drivers/i2c/chips/adm1025.c -> drivers/hwmon/adm1025.c
+ 0% -17% drivers/i2c/chips/adm1026.c -> drivers/hwmon/adm1026.c
+ 0% - 3% drivers/i2c/chips/adm1031.c -> drivers/hwmon/adm1031.c
+ 0% - 4% drivers/i2c/chips/asb100.c -> drivers/hwmon/asb100.c
+ 1% - 4% drivers/i2c/chips/ds1621.c -> drivers/hwmon/ds1621.c
+ 0% - 1% drivers/i2c/chips/fscher.c -> drivers/hwmon/fscher.c
+ 0% - 2% drivers/i2c/chips/fscpos.c -> drivers/hwmon/fscpos.c
+ 0% - 2% drivers/i2c/chips/gl518sm.c -> drivers/hwmon/gl518sm.c
+ 0% - 2% drivers/i2c/chips/gl520sm.c -> drivers/hwmon/gl520sm.c
+ 3% -19% drivers/i2c/chips/it87.c -> drivers/hwmon/it87.c
+ 4% -22% drivers/i2c/chips/lm63.c -> drivers/hwmon/lm63.c
+ 0% - 6% drivers/i2c/chips/lm75.c -> drivers/hwmon/lm75.c
+ 0% - 2% drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h
+ 0% - 3% drivers/i2c/chips/lm77.c -> drivers/hwmon/lm77.c
+ 2% - 5% drivers/i2c/chips/lm78.c -> drivers/hwmon/lm78.c
+ 0% - 3% drivers/i2c/chips/lm80.c -> drivers/hwmon/lm80.c
+ 2% -21% drivers/i2c/chips/lm83.c -> drivers/hwmon/lm83.c
+ 0% - 3% drivers/i2c/chips/lm85.c -> drivers/hwmon/lm85.c
+ 0% - 4% drivers/i2c/chips/lm87.c -> drivers/hwmon/lm87.c
+ 4% -20% drivers/i2c/chips/lm90.c -> drivers/hwmon/lm90.c
+ 0% - 3% drivers/i2c/chips/lm92.c -> drivers/hwmon/lm92.c
+ 0% - 3% drivers/i2c/chips/max1619.c -> drivers/hwmon/max1619.c
+ 0% - 7% drivers/i2c/chips/sis5595.c -> drivers/hwmon/sis5595.c
+ 0% -11% drivers/i2c/chips/smsc47b397.c -> drivers/hwmon/smsc47b397.c
+ 0% - 9% drivers/i2c/chips/smsc47m1.c -> drivers/hwmon/smsc47m1.c
+ 0% -23% drivers/i2c/chips/via686a.c -> drivers/hwmon/via686a.c
+ 0% - 4% drivers/i2c/chips/w83627hf.c -> drivers/hwmon/w83627hf.c
+ 1% - 5% drivers/i2c/chips/w83781d.c -> drivers/hwmon/w83781d.c
+ 1% - 3% drivers/i2c/chips/w83l785ts.c -> drivers/hwmon/w83l785ts.c
+14% -17% drivers/i2c/i2c-sensor-vid.c -> drivers/hwmon/hwmon-vid.c
+ 0% - 0% drivers/infiniband/include/ib_cache.h -> include/rdma/ib_cache.h
+ 0% - 3% drivers/infiniband/include/ib_fmr_pool.h -> include/rdma/ib_fmr_pool.h
+ 9% - 7% drivers/infiniband/include/ib_mad.h -> include/rdma/ib_mad.h
+ 0% - 0% drivers/infiniband/include/ib_pack.h -> include/rdma/ib_pack.h
+ 1% - 6% drivers/infiniband/include/ib_sa.h -> include/rdma/ib_sa.h
+ 0% -11% drivers/infiniband/include/ib_smi.h -> include/rdma/ib_smi.h
+ 3% - 6% drivers/infiniband/include/ib_user_mad.h -> include/rdma/ib_user_mad.h
+ 4% - 2% drivers/infiniband/include/ib_verbs.h -> include/rdma/ib_verbs.h
+ 0% -16% include/asm-ppc64/ioctl.h -> include/asm-powerpc/ioctl.h
+ 0% - 9% include/asm-ppc64/ioctls.h -> include/asm-powerpc/ioctls.h
+ 5% - 9% include/asm-ppc64/mc146818rtc.h -> include/asm-powerpc/mc146818rtc.h
+ 0% - 5% include/asm-ppc64/mman.h -> include/asm-powerpc/mman.h
+ 2% -25% include/asm-ppc64/sembuf.h -> include/asm-powerpc/sembuf.h
+ 3% -13% include/asm-ppc64/shmbuf.h -> include/asm-powerpc/shmbuf.h
+ 0% -15% include/asm-ppc64/sockios.h -> include/asm-powerpc/sockios.h
+ 1% - 5% include/asm-ppc64/topology.h -> include/asm-powerpc/topology.h
+ 0% -15% include/asm-ppc64/user.h -> include/asm-powerpc/user.h
+ 0% -21% include/asm-ppc/agp.h -> include/asm-powerpc/agp.h
+12% -16% include/asm-ppc/msgbuf.h -> include/asm-xtensa/msgbuf.h
+ 5% -25% include/asm-ppc/namei.h -> include/asm-powerpc/namei.h
+ 4% -18% include/asm-ppc/param.h -> include/asm-powerpc/param.h
+ 0% -13% include/asm-ppc/poll.h -> include/asm-powerpc/poll.h
+ 0% -24% include/asm-ppc/shmbuf.h -> include/asm-xtensa/shmbuf.h
+ 1% -17% include/asm-ppc/socket.h -> include/asm-powerpc/socket.h
+ 0% - 9% include/asm-ppc/string.h -> include/asm-powerpc/string.h
+ 1% -10% include/asm-ppc/termbits.h -> include/asm-powerpc/termbits.h
+ 0% - 3% include/asm-ppc/termios.h -> include/asm-powerpc/termios.h
+ 5% -22% include/asm-ppc/unaligned.h -> include/asm-powerpc/unaligned.h

Regards,
Rutger.

-- 
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

[-- Attachment #2: shpatch.rb --]
[-- Type: text/plain, Size: 14758 bytes --]

#!/usr/bin/env ruby

# Usage: shpatch.rb --help

require 'md5'
require 'ostruct'
require 'optparse'

$config = OpenStruct.new
$config.command = :PATCH
$config.same_base = false
$config.same_ext = true
$config.same_mime = false
$config.changed_content = true
$config.max_removed = 25	# 0 .. 100
$config.max_added = 50
$config.verbose = false

# Default dirglobs to ignore
ignore_globs = [
  "BitKeeper", "PENDING", "SCCS", "CVS", "*.state", "*.o", "*.a", "*.so",
  "*~", "#*#", "*.orig", "*.dll"
]

# Option parsing
$opts = OptionParser.new
$opts.banner = %Q{\
Generate a shellpatch file, or perform the patch in a shellpatch file.
A shellpatch file is a patch file which contains shell-commands
including 'mv' and 'patch'.

Determining the renames uses a lot of heuristics and a brute-force
approach; your milage may vary. All trivial file renames are handled
by comparing the complete contents. All remaining files (the list of
added and removed files) in then searched through to find matching
pairs: this is quite costly

A cache of md5 sums is kept at the root of the repositories to make
finding differences fast.

(c)2005 R. Nijlunsing <shpatch@tux.tmfweb.nl>
License: GPLv2

Usage: shpatch [options]

Defaults options are within [brackets].

}
$opts.separator("Diff options")
$opts.on("-d", "--diff PATH1,PATH2", Array,
  "Generate a shellpatch of the diff", "between two directories") {
  |paths|
  if paths.size != 2
    raise Exception.new("Need two directories for --diff")
  end
  $config.command = :DIFF
  $config.paths = paths
}
$opts.separator("Diff options for heuristics to finding renames with changed content")
$opts.on("--[no-]changed-content",
  "Find renames with changed content [#{$config.changed_content}]" ) { |cc|
  $config.changed_content = cc
}
$opts.on("--[no-]same-base",
  "Rename only to files with same basename [#{$config.same_base}]") { |sb|
  $config.same_base = sb
}
$opts.on("--[no-]same-ext",
  "Rename only to same extention [#{$config.same_ext}]") { |se|
  $config.same_ext = se
}
$opts.on("--[no-]same-mime",
	 "Rename only to same mimetype [#{$config.same_mime}]") { |sm|
  $config.same_mime = sm
}
$opts.on("--max-removed PERC", String,
  "Max. percentage of source file which may",
  "be removed while still being considered",
  "a rename [#{$config.max_removed}]"
) { |perc| $config.max_removed = perc.to_i }
$opts.on("--max-added PERC", String,
  "Max. percentage of destination file which may",
  "be added while still being considered",
  "a rename [#{$config.max_added}]"
) { |perc| $config.max_added = perc.to_i }
$opts.separator("Options to add to current patch")
$opts.on("--mv SOURCE DEST", String, String,
  "Adds a rename to the current patch", "and perform the rename") {
  |path1, path2|
  $config.command = :MV
  $config.paths = [path1, path2]
}
$opts.separator("General options")
$opts.on("--[no-]verbose", "-v", "Be more verbose") { |v| $config.verbose = v }
$opts.on("--help", "-h", "This usage") { puts $opts; exit 1 }
%Q{

Examples:
  shpatch.rb --diff linux-2.6.8,linux-2.6.9 --max-removed 10
    Generate a shellpatch with renames from directories
    linux-2.6.8 to linux-2.6.9 . At most 10% of a file may be removed
    between versions, otherwise they are considered different.
}.split("\n").each { |line| $opts.separator(line) }
begin
  $opts.parse!(ARGV)
rescue Exception
  puts "#{$opts}\n!!! #{$!}"
  exit 1
end

module Shell
  # Escape string string so that it is parsed to the string itself
  # E.g. Shell.escapeString("what's in a name") = "what\'s\ in\ a\ name"
  # Compare to Regexp.escape
  def Shell.escape(string)
    string.gsub(%r{([^-._0-9a-zA-Z/])}i, '\\\\\1')
  end
end

# One hunk in the patch
class RenameHunk
  attr_accessor :from, :to	# Strings: pathname from and to

  def initialize(from, to)
#    puts "# Found a rename: #{Shell.escape(from)} -> #{Shell.escape(to)}"
    @from = from; @to = to
  end
  def command; "mv"; end
  def to_s; "#{command} #{Shell.escape(@from)} #{Shell.escape(@to)}"; end
  def execute(repo)
    File.rename("#{repo.root}/#@from", "#{repo.root}/#@to")
  end
end

class DeleteHunk
  attr_accessor :pathname
  def initialize(pathname); @pathname = pathname; end
  def command; "rm"; end
  def to_s; "#{command} #{Shell.escape(@pathname)}"; end
  def execute(repo); File.delete("#{repo.root}/#@pathname"); end
end

class PatchHunk
  attr_accessor :from, :to, :contents
  def initialize(repo1, from, repo2, to)
    @from = from; @to = to
  end
  def command; "patch"; end
  def to_s
    long_from = Shell.escape((from[0] == ?/ ? "" : repo1.root + "/") + from)
    long_to = Shell.escape((to[0] == ?/ ? "" : repo2.root + "/") + to)
    puts "# Diffing #{long_from} -> #{long_to}" if $config.verbose
    @contents = File.popen("diff --unified #{long_from} #{long_to}") { |io|
      io.read
    }

    mark = "_SHPATCHMARK_"
    # Make mark unique
    mark += rand(10).to_s while @contents.index(mark)
    "#{command} <<#{mark}\n#{@contents}#{mark}"
  end
end

# A filesystem as backing store
class FileSystem
  SHPATCHSTATE_FILE = ".shpatch.state"
  SHPATCHSTATE_VERSION_STRING = "shpatch.rb state version 20050418-2"

  attr_accessor :root
  attr_accessor :cache_file # String: filename with signatures
  attr_accessor :signature_cache # From Fixnum inode to Array [mtime, sig]
  attr_accessor :signature_cache_changed # Boolean

  # Reads the cache. When not readable in current directory, go
  # up a level ('..')
  def read_signatures
    @signature_cache = {}
    @signature_cache_changed = false
    @cache_file = File.expand_path("#@root/#{SHPATCHSTATE_FILE}")
    cache_file = @cache_file
    loop {
      if FileTest.readable?(cache_file)
	File.open(cache_file, "rb") do |file|
	  version_string = file.readline.chomp
	  if version_string == SHPATCHSTATE_VERSION_STRING
	    begin
	      @signature_cache = Marshal.load(file) 
	      puts "# Read signature cache with #{@signature_cache.size} signatures from #{cache_file.inspect}" if $config.verbose
	      @cache_file = cache_file
	      break
	    rescue ArgumentError, EOFError
	      puts "# (error reading state file: rebuilding file...)" if $config.verbose
	    end
	  end
	end
      end
      parent_cache_file = File.expand_path(
	File.dirname(cache_file) + "/../" + File.basename(cache_file)
      )
      break if parent_cache_file == cache_file
      cache_file = parent_cache_file
    }
  end

  def initialize(root)
    raise "#{root.inspect} does not exist" if not File.exists?(root)
    @root = root
    read_signatures
  end

  def save_signatures
    # Save all unsaved signature cache
    return if !@signature_cache_changed
    puts "# Saving #{@signature_cache.size} signatures..." if $config.verbose
    pf = @cache_file
    File.open("#{pf}.new", "wb+") do |file|
      file.puts SHPATCHSTATE_VERSION_STRING
      Marshal.dump(@signature_cache, file)
      File.rename("#{pf}.new", pf)
    end      
  end

  # Returns array of [mtime, one-line signature-string]
  def signature(stat, filename)
    signature = nil
    key = [stat.dev, stat.ino]
    cache = @signature_cache[key]
    if cache and (cache[0] == stat.mtime)
      signature = cache[1]
    else
      if $config.verbose
	why = (cache ? "#{(stat.mtime - cache[0]).to_i}s out of date" : "not indexed")
	puts "# Creating signature for #{filename.inspect} (#{why})" 
      end
      signature = MD5.new(File.read(filename)).digest
      @signature_cache[key] = [stat.mtime, signature]
      @signature_cache_changed = true
    end
    signature
  end

  def signature_from(prefix, res, from, ignoreRe)
    Dir.new("#{prefix}#{from}").entries.each { |elem|
      next if (elem == ".") or (elem == "..")
      fullname = "#{prefix}#{from}/#{elem}"
      if not fullname =~ ignoreRe
	stat = File.stat(fullname)
	if stat.directory?
	  signature_from(prefix, res, "#{from}/#{elem}", ignoreRe) 
	else
	  rel_filename = "#{from}/#{elem}"[1..-1]
	  res[rel_filename] = signature(stat, fullname)
	end
      end
    }
  end

  # Returns all filenames within this filesystem with all signatures
  def signatures(ignoreRe)
    res = {}
    prefix = File.expand_path(@root)
    signature_from(prefix, res, "", ignoreRe)
    save_signatures
    res
  end

  def mime_type(filename)
    path = @root + "/" + filename
    ($mime_cache ||= {})[path] ||=
      File.popen("file --mime #{Shell.escape(path)}") { |io| io.read }.
      gsub(%r{^.*:}, "").strip
  end

  # Read the contents of a file
  def read(filename); File.read(@root + "/" + filename); end
end

patch = []

dir1, dir2 = $config.paths
repo1 = FileSystem.new(dir1)
repo2 = FileSystem.new(dir2)

def re_from_globs(globs)
  Regexp.new(
    "(\\A|/)(" + globs.collect { |glob| 
       Regexp.escape(glob).gsub("\\*", "[^/]*")
    }.join("|") + ")$"
  )
end

ignore_globs += ["BitKeeper/etc/ignore", ".cvsignore"].collect { |a|
  ["#{dir1}/#{a}", "#{dir2}/#{a}"]
}.flatten.find_all { |f| File.exists?(f) }.collect { |f|
  File.readlines(f).collect { |line| line.chomp }
}.flatten
ignore_globs = ignore_globs.uniq.sort
ignoreRe = re_from_globs(ignore_globs)

puts "# Retrieving signatures of #{dir1.inspect}" if $config.verbose
file2sig1 = repo1.signatures(ignoreRe)
puts "# Retrieving signatures of #{dir2.inspect}" if $config.verbose
file2sig2 = repo2.signatures(ignoreRe)
files1 = file2sig1.keys.sort
files2 = file2sig2.keys.sort
common_files = files1 - (files1 - files2)

# Different hash, same filename: patch
common_files.each { |fname|
  if file2sig1[fname] != file2sig2[fname]
    patch << PatchHunk.new(repo1, fname, repo2, fname)
  end
  file2sig1.delete(fname)
  file2sig2.delete(fname)
}

# Same hash, different filename: rename
sig2file1 = file2sig1.invert
sig2file2 = file2sig2.invert
sigs1 = sig2file1.keys
sigs2 = sig2file2.keys
common_sigs = sigs1 - (sigs1 - sigs2)
common_sigs.each { |sig|
  from = sig2file1[sig]
  to = sig2file2[sig]
  patch << RenameHunk.new(from, to)
  sig2file1.delete(sig)
  sig2file2.delete(sig)
  file2sig1.delete(from)
  file2sig2.delete(to)
}

# statistics of contents of a file. Used for quick-compare
class FileContentStats
  attr_accessor :size		# Size of file in lines
  attr_accessor :lines		# Hash from String to Fixnum

  # Counter number of lines removed and added as a percentage
  # of the total file length. These are a measure for the degree
  # of matching between the files.
  def diff_match(other)
    added = 0
    removed = 0
    @lines.each_pair { |line, count|
      delta = other.lines[line] - count
      if delta > 0
	added += delta
      else
	removed += -delta
      end
    }
    other.lines.each_pair { |line, count|
      added += count if not @lines[line]
    }
    [added * 100 / other.size, removed * 100 / self.size]
  end

  def initialize(repo, path)
    @lines = Hash.new(0)
    size = 0
    repo.read(path).delete("\0").each_line { |line|
      @lines[line.intern] += 1
      size += 1
    }
    @size = size
  end

  def self.cached(repo, path)
    @@cache ||= {}
    @@cache[[repo, path]] ||= self.new(repo, path)
  end
end

# Categorize a file based on filename and/or contents
def pool_type(repo, path)
  res = []
  res << File.basename(path) if $config.same_base
  res << File.extname(path) if $config.same_ext
  res << repo.mime_type(path) if $config.same_mime
  res
end

# Determine how much a filename looks like another filename
# by splitting the filenames into words. Then count the
# words which are the same.
def path_correlation(path1, path2)
  comp1 = path1.split(%r{[-._/]})
  comp2 = path2.split(%r{[-._/]})
  (comp1 - (comp1 - comp2)).size
end

class Array
  # The inverse of an array is an hash from contents to index number.
  def inverse; res = {}; each_with_index { |e, idx| res[e] = idx }; res; end
end

if $config.changed_content
  files1 = file2sig1.keys.sort
  files2 = file2sig2.keys.sort
  all_added_files = files2 - files1
  all_removed_files = files1 - files2

  pools = {}			# Group files into 'pools'
  all_removed_files.each { |removed_file|
    (pools[pool_type(repo1, removed_file)] ||= [[], []])[0] << removed_file
  }
  all_added_files.each { |added_file|
    (pools[pool_type(repo2, added_file)] ||= [[], []])[1] << added_file
  }

  pools.each_pair { |key, pool|
    removed_files, added_files = *pool
    if $config.verbose and not removed_files.empty? and not added_files.empty?
      puts "# Comparing pool type #{key.inspect} with #{pool[0].size}x#{pool[1].size} filepairs" 
    end

    # Determine how 'special' or 'specific' a word is. We start with
    # filenames containing special words.
    words = {}			# Group files by 'words'
    removed_files.each { |removed_file|
      removed_file.split(%r{[-._/]+}).uniq.each { |word|
	words[word] ||= [[], []]
	words[word][0] << removed_file
      }
    }
    added_files.each { |added_file|
      added_file.split(%r{[-._/]+}).uniq.each { |word|
	words[word] ||= [[], []]
	words[word][1] << added_file
      }
    }
    word_importance = words.keys.find_all { |word|
      (words[word][0].size * words[word][1].size) > 0
    }.sort_by { |word|
      words[word][0].size * words[word][1].size
    }.reverse
#    p word_importance
    word_importance = word_importance.inverse
    word_importance.default = 0

    removed_files.sort_by { |removed_file|
      removed_file.split(%r{[-._/]+}).uniq.inject(0) { |s, e|
	[s, word_importance[e]].max
      }
    }.reverse.each { |removed_file|
#      puts removed_file
      removed_file_stats = FileContentStats.new(repo1, removed_file)
      added_files.sort_by { |f| -path_correlation(removed_file, f) }.
        each { |added_file|
	added_file_stats = FileContentStats.cached(repo2, added_file)
	removed_size = removed_file_stats.size
	added_size = added_file_stats.size
	min_added = (added_size - removed_size) * 100 / added_size
	next if min_added > $config.max_added
	min_removed = (removed_size - added_size) * 100 / removed_size
	next if min_removed > $config.max_removed
	
	# Calculate added & removed percentages
	added, removed = removed_file_stats.diff_match(added_file_stats)
	if (added <= $config.max_added) && (removed <= $config.max_removed)
	  # We found a rename-match!
	  puts "+%2i%% -%2i%% #{removed_file} -> #{added_file}" % [added, removed] #if $config.verbose
	  patch << RenameHunk.new(removed_file, added_file)
	  # Don't match again against this added file:
	  added_files -= [added_file]
	  all_added_files -= [added_file]
	  all_removed_files -= [removed_file]
	  patch << PatchHunk.new(repo1, removed_file, repo2, added_file)
	  break
	end
      }
    }
  }
end

all_added_files.each { |added_file|
  patch << PatchHunk.new(repo1, "/dev/null", repo2, added_file)
}
all_removed_files.each { |removed_file|
  patch << PatchHunk.new(repo1, removed_file, repo2, "/dev/null")
}

#patch.each { |hunk| puts hunk.to_s }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13 15:38         ` Linus Torvalds
  2006-03-14  0:49           ` Rutger Nijlunsing
@ 2006-03-14  0:55           ` Junio C Hamano
  1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2006-03-14  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> There's a pretty natural terminating character that works well for 
> sources: '\n'.

Good to know that great minds think alike ;-).  There is a
version that did this line-oriented hashing, buried in the next
branch.  I'll see how well it performs within the context of the
current somewhat restructured code.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-03-13  7:44     ` Linus Torvalds
  2006-03-13 10:43       ` Junio C Hamano
@ 2006-04-06 21:01       ` Geert Bosch
  2006-04-11 22:04         ` Junio C Hamano
  1 sibling, 1 reply; 13+ messages in thread
From: Geert Bosch @ 2006-04-06 21:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 5171 bytes --]


On Mar 13, 2006, at 02:44, Linus Torvalds wrote:
> It might be that the fast delta thing is a good way to ask "is this  
> even
> worth considering", to cut down the O(m*n) rename/copy detection to
> something much smaller, and then use xdelta() to actually figure  
> out what
> is a good rename and what isn't from a much smaller set of potential
> targets.

Here's a possible way to do that first cut. Basically,
compute a short (256-bit) fingerprint for each file, such
that the Hamming distance between two fingerprints is a measure
for their similarity. I'll include a draft write up below.

My initial implementation seems reasonably fast, works
great for 4000 (decompressed) files (25M) randomly plucked
from an old git.git repository without packs. It works OK for
comparing tar archives for GCC releases, but then it becomes
clear that random walks aren't that random anymore and
become dominated by repeated information, such as tar headers.

Speed is about 10MB/sec on my PowerBook, but one could cache
fingerprints so they only need to be computed once.
The nice thing is that one can quickly find similar files
only using the fingerprint (and in practice file size),
no filenames: this seems to fit the git model well.

I'll attach my test implementation below, it uses
David Mazieres Rabinpoly code and D. Phillips's fls code.
Please don't mind my C coding, it's not my native language.
Also, this may have some Darwinisms, although it should
work on Linux too.

   -Geert

Estimating Similarity

For estimating similarity between strings A and B, let
SA and SB be the collection of all substrings with length
W of A and B. Similarity now is defined as the ratio of
the intersection and the union of SA and SB.

The length W of these substrings is the window size, and here is
chosen somewhat arbitrarily to be 48. The idea is to make them not
so short that all context is lost (like counting symbol frequencies),
but not so long that a few small changes can affect a large portion
of substrings.  Of course, a single symbol change may affect up to
48 substrings.

Let "&" be the string concatenation operator.
If A = S2 & S1 & S2 & S3 & S2, and B = S2 & S3 & S2 & S1 & S2,
then if the length of S2 is at least W - 1, the strings
will have the same set of substrings and be considered equal
for purpose of similarity checking.  This behavior is actually
welcome, since reordering sufficiently separated pieces of a
document do not make it substantially different.

Instead of computing the ratio of identical substrings directly,
compute a 1-bit hash for each substring and calculate the difference
between the number of zeroes and ones. If the hashes appear random,
this difference follows a binomial distribution. Two files are
considered "likely similar" if their differences have the same sign.

The assumption that the hashes are randomly distributed, is not
true if there are many repeated substrings. For most applications,
it will be sufficient to ignore such repetitions (by using a small
cache of recently encountered hashes) as they do not convey much
actual information. For example, for purposes of finding small
deltas between strings, duplicating existing text will not significantly
increase the delta.

For a string with N substrings, of which K changed, perform a random
walk of N steps in 1-dimensional space (see [1]): what is the  
probability
the origin was crossed an odd number of times in the last K steps?
As the expected distance is Sqrt (2 * N / Pi), this probability
gets progressively smaller for larger N and a given ratio of N and K.
For larger files, the result should be quite stable.


In order to strengthen this similarity check and be able to
quantify the degree of similarity, many independent 1-bit hashes
are computed and counted for each string and assembled into
a bit vector of 256 bits, called the fingerprint. Each bit
of the fingerprint represents the result of independent
statistical experiment. For similar strings, corresponding bits
are more likely to be the same than for random strings.

For efficiency, a 64-bit hash is computed using a irreducible
Rabin polynomial of degree 63. The algebraic properties
of these allow for efficient calculation over a sliding window
of the input. [2] As the cryptographic advantages of randomly
generated hash functions are not required, a fixed polynomial
has been chosen.

This 64-bit hash is expanded to 256 bits by using three bits
to select 32 of the 256 bits in the fingerprint to update.
So, for every 8-bit character the polynomial needs updating,
and 32 counters are incremented or decremented.
So, each of the 256 counters represents a random walk that
is N / 4, for a string of length N.

The similarity of A and B can now be expressed as the Hamming
distance between the two bit vectors, divided by the expected
distance between two random vectors. This similarity score is
a number between 0 and 2, where smaller values mean the strings
are more similar, and values of 1 or more mean they are dissimilar.

One of the unique properties of this fingerprint is the
ability to compare files in different locations by only
transmitting their fingerprint.



[-- Attachment #2: gsimm.c --]
[-- Type: application/octet-stream, Size: 10801 bytes --]

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <libgen.h>
#include <stdio.h>
#include <assert.h>
#include <math.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

#include "rabinpoly.h"

/* Length of file message digest (MD) in bytes. Longer MD's are
   better, but increase processing time for diminishing returns.
   Must be multiple of NUM_HASHES_PER_CHAR / 8, and at least 24
   for good results 
*/
#define MD_LENGTH 32
#define MD_BITS (MD_LENGTH * 8)

/* Has to be power of two. Since the Rabin hash only has 63
   usable bits, the number of hashes is limited to 32.
   Lower powers of two could be used for speeding up processing
   of very large files.  */
#define NUM_HASHES_PER_CHAR 32


/* For the final counting, do not count each bit individually, but
   group them. Must be power of two, at most NUM_HASHES_PER_CHAR.
   However, larger sizes result in higher cache usage. Use 8 bits
   per group for efficient processing of large files on fast machines
   with decent caches, or 4 bits for faster processing of small files
   and for machines with small caches.  */
#define GROUP_BITS 4
#define GROUP_COUNTERS (1<<GROUP_BITS)


/* The RABIN_WINDOW_SIZE is the size of fingerprint window used by 
   Rabin algorithm. This is not a modifiable parameter.

   The first RABIN_WINDOW_SIZE - 1 bytes are skipped, in order to ensure
   fingerprints are good hashes. This does somewhat reduce the
   influence of the first few bytes in the file (they're part of
   fewer windows, like the last few bytes), but that actually isn't
   so bad as files often start with fixed content that may bias comparisons.
*/

/* The MIN_FILE_SIZE indicates the absolute minimal file size that
   can be processed. As indicated above, the first and last 
   RABIN_WINDOW_SIZE - 1 bytes are skipped. 
   In order to get at least an average of 12 samples
   per bit in the final message digest, require at least 3 * MD_LENGTH
   complete windows in the file.  */
#define MIN_FILE_SIZE (3 * MD_LENGTH + 2 * (RABIN_WINDOW_SIZE - 1))

/* Limit matching algorithm to files less than 256 MB, so we can use
   32 bit integers everywhere without fear of overflow. For larger
   files we should add logic to mmap the file by piece and accumulate
   the frequency counts. */
#define MAX_FILE_SIZE (256*1024*1024 - 1)

/* Size of cache used to eliminate duplicate substrings.
   Make small enough to comfortably fit in L1 cache.  */
#define DUP_CACHE_SIZE 256

#define MIN(x,y) ((y)<(x) ? (y) : (x))
#define MAX(x,y) ((y)>(x) ? (y) : (x))

typedef struct fileinfo
{ char		*name;
  size_t	length;
  u_char	md[MD_LENGTH];
  int		match;
} File;

int flag_verbose = 0;
int flag_debug = 0;
int flag_warning = 0;
char *flag_relative = 0;

char cmd[12] = "        ...";
char md_strbuf[MD_LENGTH * 2 + 1];
u_char relative_md [MD_LENGTH];

File *file;
int    file_count;
size_t file_bytes;

FILE *msgout;

char hex[17] = "0123456789abcdef";
double pi = 3.14159265358979323844;

int freq[MD_BITS];
u_int64_t freq_dups = 0;

void usage()
{  fprintf (stderr, "usage: %s [-dhvw] [-r fingerprint] file ...\n", cmd);
   fprintf (stderr, " -d\tdebug output, repeate for more verbosity\n");
   fprintf (stderr, " -h\tshow this usage information\n");
   fprintf (stderr, " -r\tshow distance relative to fingerprint "
                    "(%u hex digits)\n", MD_LENGTH * 2);
   fprintf (stderr, " -v\tverbose output, repeat for even more verbosity\n");
   fprintf (stderr, " -w\tenable warnings for suspect statistics\n");
   exit (1);
}

int dist (u_char *l, u_char *r)
{ int j, k;
  int d = 0;

  for (j = 0; j < MD_LENGTH; j++)
  { u_char ch = l[j] ^ r[j];

    for (k = 0; k < 8; k++) d += ((ch & (1<<k)) > 0);
  } 

  return d;
}

char *md_to_str(u_char *md)
{ int j;

  for (j = 0; j < MD_LENGTH; j++)
  { u_char ch = md[j];

    md_strbuf[j*2] = hex[ch >> 4];
    md_strbuf[j*2+1] = hex[ch & 0xF];
  }

  md_strbuf[j*2] = 0;
  return md_strbuf;
}

u_char *str_to_md(char *str, u_char *md)
{ int j;

  if (!md || !str) return 0;

  bzero (md, MD_LENGTH);
  
  for (j = 0; j < MD_LENGTH * 2; j++)
  { char ch = str[j];

    if (ch >= '0' && ch <= '9')
    { md [j/2] = (md [j/2] << 4) + (ch - '0'); 
    }
    else
    { ch |= 32;

      if (ch < 'a' || ch > 'f') break;
      md [j/2] = (md[j/2] << 4) + (ch - 'a' + 10);
  } } 

  return (j != MD_LENGTH * 2 || str[j] != 0) ? 0 : md;
}
    
void freq_to_md(u_char *md)
{ int j, k;
  int num = MD_BITS;

  for (j = 0; j < MD_LENGTH; j++)
  { u_char ch = 0;

    for (k = 0; k < 8; k++) ch = 2*ch + (freq[8*j+k] > 0);
    md[j] = ch;
  }

  if (flag_debug)
  { for (j = 0; j < num; j++)
    { if (j % 8 == 0) printf ("\n%3u: ", j);
      printf ("%7i ", freq[j]);
    }
    printf ("\n");
  }
  bzero (freq, sizeof(freq));
  freq_dups = 0;
}

void process_data (char *name, u_char *data, unsigned len, u_char *md)
{ size_t j = 0;
  u_int32_t ofs;
  u_int32_t dup_cache[DUP_CACHE_SIZE];
  u_int32_t count [MD_BITS * (GROUP_COUNTERS/GROUP_BITS)];
  bzero (dup_cache, DUP_CACHE_SIZE * sizeof (u_int32_t));
  bzero (count, (MD_BITS * (GROUP_COUNTERS/GROUP_BITS) * sizeof (u_int32_t)));

  /* Ignore incomplete substrings */
  while (j < len && j < RABIN_WINDOW_SIZE) rabin_slide8 (data[j++]);

  while (j < len)
  { u_int64_t hash;
    u_int32_t ofs, sum;
    u_char idx;
    int k;

    hash = rabin_slide8 (data[j++]);

    /* In order to update a much larger frequency table
       with only 32 bits of checksum, randomly select a
       part of the table to update. The selection should
       only depend on the content of the represented data,
       and be independent of the bits used for the update.
       
       Instead of updating 32 individual counters, process
       the checksum in MD_BITS / GROUP_BITS groups of 
       GROUP_BITS bits, and count the frequency of each bit pattern.
    */

    idx = (hash >> 32);
    sum = (u_int32_t) hash;
    ofs = idx % (MD_BITS / NUM_HASHES_PER_CHAR) * NUM_HASHES_PER_CHAR;
    idx %= DUP_CACHE_SIZE;
    if (dup_cache[idx] == sum)
    { freq_dups++; 
    }
    else
    { dup_cache[idx] = sum; 
      for (k = 0; k < NUM_HASHES_PER_CHAR / GROUP_BITS; k++)
      { count[ofs * GROUP_COUNTERS / GROUP_BITS + (sum % GROUP_COUNTERS)]++;
        ofs += GROUP_BITS;
        sum >>= GROUP_BITS;
  } } }

  /* Distribute the occurrences of each bit group over the frequency table. */
  for (ofs = 0; ofs < MD_BITS; ofs += GROUP_BITS)
  { int j;
    for (j = 0; j < GROUP_COUNTERS; j++)
    { int k;
      for (k = 0; k < GROUP_BITS; k++)
      { freq[ofs + k] += ((1<<k) & j) 
          ? count[ofs * GROUP_COUNTERS / GROUP_BITS + j]
          : -count[ofs * GROUP_COUNTERS / GROUP_BITS + j];
  } } }
      
  { int j;
    int num = MD_BITS;
    int stat_warn = 0;
    double sum = 0.0;
    double sumsqr = 0.0;
    double average, variance, stddev, bits, exp_average, max_average;

    assert (num >= 2);

    sum = 0;

    for (j = 0; j < num; j++)
    { double f = abs ((double) freq[j]);
      sum += f;
      sumsqr += f*f;
    }

    variance = (sumsqr - (sum * sum / num)) / (num - 1);
    average = sum / num;
    stddev = sqrt (variance);
    bits = (NUM_HASHES_PER_CHAR * (file[file_count].length - freq_dups)) 
             / (8 * MD_LENGTH);
    /* Random files, or short files with few repetitions should have
       average very close to the expected average. Large deviations
       show there is too much redundancy, or there is another problem
       with the statistical fundamentals of the algorithm. */
    exp_average = sqrt (2 * bits / pi);
    max_average = 2.0 * pow (2 * bits / pi, 0.6);

    stat_warn = flag_warning
      && (average < exp_average * 0.5 || average > max_average);
    if (stat_warn)
    { fprintf (stdout, "%s: warning: "
               "too much redundancy, fingerprint may not be accurate\n",
               file[file_count].name);
      
    }

    if (flag_verbose > 1 || (flag_verbose && stat_warn))
    { printf 
        ("%i frequencies, average %5.1f, std dev %5.1f, %2.1f %% duplicates, "
         "\"%s\"\n",
         num, average, stddev,
         100.0 * freq_dups / (double) file[file_count].length,
         file[file_count].name);
      printf
        ("%1.0f expected bits per frequency, "
         "expected average %1.1f, max average %1.1f\n",
         bits, exp_average, max_average);
  } }

  if (md)
  { rabin_reset();
    freq_to_md (md);
    if (flag_relative)
    { int d = dist (md, relative_md);
      double sim = 1.0 - MIN (1.0, (double) (d) / (MD_LENGTH * 4 - 1));
      fprintf (stdout, "%s %llu %u %s %u %3.1f\n", 
               md_to_str (md), (long long unsigned) 0, len, name, 
               d, 100.0 * sim);
    }
    else
    {
      fprintf (stdout, "%s %llu %u %s\n", 
               md_to_str (md), (long long unsigned) 0, len, name);
} } }

void process_file (char *name)
{ int fd;
  struct stat fs;
  u_char *data;
  File *fi = file+file_count;;

  fd = open (name, O_RDONLY, 0);
  if (fd < 0) 
  { perror (name);
    exit (2);
  }

  if (fstat (fd, &fs))
  { perror (name);
    exit (2);
  }

  if (fs.st_size >= MIN_FILE_SIZE
      && fs.st_size <= MAX_FILE_SIZE)
  { fi->length = fs.st_size;
    fi->name = name;

    data = (u_char *) mmap (0, fs.st_size, PROT_READ, MAP_PRIVATE, fd, 0);

    if (data == (u_char *) -1)
    { perror (name);
      exit (2);
    }

    process_data (name, data, fs.st_size, fi->md);
    munmap (data, fs.st_size);
    file_bytes += fs.st_size;
    file_count++;
  } else if (flag_verbose) 
  { fprintf (stdout, "skipping %s (size %llu)\n", name, fs.st_size); }

  close (fd);
}

int main (int argc, char *argv[])
{ int ch, j;

  strncpy (cmd, basename (argv[0]), 8);
  msgout = stdout;

  while ((ch = getopt(argc, argv, "dhr:vw")) != -1)
  { switch (ch) 
    { case 'd': flag_debug++;
		break;
      case 'r': if (!optarg)
                { fprintf (stderr, "%s: missing argument for -r\n", cmd);
                  return 1;
                }
                if (str_to_md (optarg, relative_md)) flag_relative = optarg;
                else
                { fprintf (stderr, "%s: not a valid fingerprint\n", optarg);
                  return 1;
                }
                break;
      case 'v': flag_verbose++;
                break;
      case 'w': flag_warning++;
                break;
      default : usage();
                return (ch != 'h');
  } }

  argc -= optind;
  argv += optind;

  if (argc == 0) usage();

  rabin_reset ();
  if (flag_verbose && flag_relative)
  { fprintf (stdout, "distances are relative to %s\n", flag_relative);
  }

  file = (File *) calloc (argc, sizeof (File));

  for (j = 0; j < argc; j++) process_file (argv[j]);

  if (flag_verbose) 
  { fprintf (stdout, "%li bytes in %i files\n", file_bytes, file_count);
  }

  return 0;
}

[-- Attachment #3: rabinpoly.c --]
[-- Type: application/octet-stream, Size: 3648 bytes --]

/*
 *
 * Copyright (C) 1999 David Mazieres (dm@uun.org)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; either version 2, or (at
 * your option) any later version.
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
 * USA
 *
 */

  /* Faster generic_fls */
  /* (c) 2002, D.Phillips and Sistina Software */

#include "rabinpoly.h"
#define MSB64 0x8000000000000000ULL

static inline unsigned fls8(unsigned n)
{
       return n & 0xf0?
           n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
           n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
}

static inline unsigned fls16(unsigned n)
{
       return n & 0xff00? fls8(n >> 8) + 8: fls8(n);
}

static inline unsigned fls32(unsigned n)
{
       return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n);
}

static inline unsigned fls64(unsigned long long n) /* should be u64 */
{
       return n & 0xffffffff00000000ULL? fls32(n >> 32) + 32: fls32(n);
}


static u_int64_t polymod (u_int64_t nh, u_int64_t nl, u_int64_t d);
static void      polymult (u_int64_t *php, u_int64_t *plp,
                           u_int64_t x, u_int64_t y);
static u_int64_t polymmult (u_int64_t x, u_int64_t y, u_int64_t d);

static u_int64_t poly = 0xb15e234bd3792f63ull;	// Actual polynomial
static u_int64_t T[256];			// Lookup table for mod
static int shift;

u_int64_t append8 (u_int64_t p, u_char m) 
{ return ((p << 8) | m) ^ T[p >> shift]; 
}

static u_int64_t
polymod (u_int64_t nh, u_int64_t nl, u_int64_t d)
{ assert (d);
  int i;
  int k = fls64 (d) - 1;
  d <<= 63 - k;

  if (nh) {
    if (nh & MSB64)
      nh ^= d;
    for (i = 62; i >= 0; i--)
      if (nh & 1ULL << i) {
	nh ^= d >> (63 - i);
	nl ^= d << (i + 1);
      }
  }
  for (i = 63; i >= k; i--)
    if (nl & 1ULL << i)
      nl ^= d >> (63 - i);
  return nl;
}

static void
polymult (u_int64_t *php, u_int64_t *plp, u_int64_t x, u_int64_t y)
{ int i;
  u_int64_t ph = 0, pl = 0;
  if (x & 1)
    pl = y;
  for (i = 1; i < 64; i++)
    if (x & (1ULL << i)) {
      ph ^= y >> (64 - i);
      pl ^= y << i;
    }
  if (php)
    *php = ph;
  if (plp)
    *plp = pl;
}

static u_int64_t
polymmult (u_int64_t x, u_int64_t y, u_int64_t d)
{
  u_int64_t h, l;
  polymult (&h, &l, x, y);
  return polymod (h, l, d);
}

static int size = RABIN_WINDOW_SIZE;
static u_int64_t fingerprint = 0;
static int bufpos = -1;
static u_int64_t U[256];
static u_char buf[RABIN_WINDOW_SIZE];

void rabin_init ()
{ assert (poly >= 0x100);
  u_int64_t sizeshift = 1;
  int xshift = fls64 (poly) - 1;
  int i, j;
  shift = xshift - 8;
  u_int64_t T1 = polymod (0, 1ULL << xshift, poly);
  for (j = 0; j < 256; j++)
    T[j] = polymmult (j, T1, poly) | ((u_int64_t) j << xshift);

  for (i = 1; i < size; i++)
    sizeshift = append8 (sizeshift, 0);
  for (i = 0; i < 256; i++)
    U[i] = polymmult (i, sizeshift, poly);
  bzero (buf, sizeof (buf));
}

void
rabin_reset ()
{ rabin_init();
  fingerprint = 0; 
  bzero (buf, sizeof (buf));
}

u_int64_t
rabin_slide8 (u_char m)
{ u_char om;
  if (++bufpos >= size) bufpos = 0;

  om = buf[bufpos];
  buf[bufpos] = m;
  fingerprint = append8 (fingerprint ^ U[om], m);

  return fingerprint;
}
  

[-- Attachment #4: rabinpoly.h --]
[-- Type: application/octet-stream, Size: 1015 bytes --]

/*
 *
 * Copyright (C) 2000 David Mazieres (dm@uun.org)
 *
 * This program is free software; you can redistribute it and/or
 * modify it under the terms of the GNU General Public License as
 * published by the Free Software Foundation; either version 2, or (at
 * your option) any later version.
 *
 * This program is distributed in the hope that it will be useful, but
 * WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
 * USA
 *
 * Translated to C and simplified by Geert Bosch (bosch@gnat.com)
 */

#include <assert.h>
#include <strings.h>
#include <sys/types.h>

#ifndef RABIN_WINDOW_SIZE
#define RABIN_WINDOW_SIZE 48
#endif
void rabin_reset(); 
u_int64_t rabin_slide8(u_char c); 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-04-06 21:01       ` Geert Bosch
@ 2006-04-11 22:04         ` Junio C Hamano
  2006-04-14 17:46           ` Geert Bosch
  0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2006-04-11 22:04 UTC (permalink / raw)
  To: Geert Bosch; +Cc: git

Geert Bosch <bosch@adacore.com> writes:

> On Mar 13, 2006, at 02:44, Linus Torvalds wrote:
>
>> It might be that the fast delta thing is a good way to ask
>> "is this even worth considering", to cut down the O(m*n)
>> rename/copy detection to something much smaller, and then use
>> xdelta() to actually figure out what is a good rename and
>> what isn't from a much smaller set of potential targets.
>
>
> Here's a possible way to do that first cut. Basically,
> compute a short (256-bit) fingerprint for each file, such
> that the Hamming distance between two fingerprints is a measure
> for their similarity. I'll include a draft write up below.

Thanks for starting this.

There are a few things I need to talk about the way "similarity"
is _used_ in the current algorithms.

Rename/copy detection outputs "similarity" but I suspect what
the algorithm wants is slightly different from what humans think
of "similarity".  It is somewhere between "similarity" and
"commonness".  When you are grading a 130-page report a student
submitted, you would want to notice that last 30 pages are
almost verbatim copy from somebody else's report.  The student
in question added 100-page original contents so maybe this is
not too bad, but if the report were a 30-page one, and the
entier 30 pages were borrowed from somebody else's 130-page
report, you would _really_ want to notice.

While reorganizaing a program, a nontrivial amount of text is
often removed from an existing file and moved to a newly created
file.  Right now, the way similarity score is calculated has a
heuristical cap to reject two files whose sizes are very
different, but to detect and show this kind of file split, the
sizes of files should matter less.

On the other hand, taking this "commonness matters" to its
extreme is not what we want.  We are producing "diff", so if a
30-line new file was created by moving these 30 lines from
originally 130-line file (which is now 100 lines long), showing
it as "copy from the-130-line-file" with a diff to remove
100-lines is usually not what we want.  That's why the size cap
makes sense in rename similarity estimator.

Another place we use "similarity" is to break a file that got
modified too much.  This is done for two independent purposes.

One is to detect a case like this:

	mv B C
        mv A B
        small-edit B

File B's content is not related to what it had originally, but
is derived from what was originally in A.  Usually rename/copy
detection tries to find rename/copy into files that _disappear_
from the result, but with the above sequence, B never
disappears.  By looking at how dissimilar the preimage and
postimage of B are, we tell the rename/copy detector that B,
although it does not disappear, might have been renamed/copied
from somewhere else.

Another is to present the final diff output as a complete
rewrite.  When -B (break) is used without -M (rename) or -C
(copy), or a file that got a lot of edit and got "broken" turned
out to be purely a total edit (i.e. not renamed/copied from
somewhere else), we would present it as diff output that has
only one hunk, with bunch of '-' (removal) to remove all
original contents first and then '+' (addition) to add all the
new contents, which is often easier to read than ordinary
unidiff between two unrelated contents that matches up lines
that happen to be the same.  Empirically, it seems to give
better result if the "similarity" threshold to "break" a file
(i.e. to consider it might have been renamed/copied from
somewhere else) is set lower than the threashold to show the
diff as a complete rewrite patch.

Also we can make commonness matter even more in the similarlity
used to "break" a file than rename detector, because if we are
going to break it, we will not have to worry about the issue of
showing an annoying diff that removes 100 lines after copying a
130-line file.  This implies that the break algorithm needs to
use two different kinds of similarity, one for breaking and then
another for deciding how to show the broken pieces as a diff.

Sorry if this write-up does not make much sense.  It ended up
being a lot more incoherent than I hoped it to be.

Anyway, sometime this week I'll find time to play with your code
myself.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Fix up diffcore-rename scoring
  2006-04-11 22:04         ` Junio C Hamano
@ 2006-04-14 17:46           ` Geert Bosch
  0 siblings, 0 replies; 13+ messages in thread
From: Geert Bosch @ 2006-04-14 17:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git


On Apr 11, 2006, at 18:04, Junio C Hamano wrote:
>> Here's a possible way to do that first cut. Basically,
>> compute a short (256-bit) fingerprint for each file, such
>> that the Hamming distance between two fingerprints is a measure
>> for their similarity. I'll include a draft write up below.
>
> Thanks for starting this.
>
> There are a few things I need to talk about the way "similarity"
> is _used_ in the current algorithms.
>
> Rename/copy detection outputs "similarity" but I suspect what
> the algorithm wants is slightly different from what humans think
> of "similarity".  It is somewhere between "similarity" and
> "commonness".  When you are grading a 130-page report a student
> submitted, you would want to notice that last 30 pages are
> almost verbatim copy from somebody else's report.  The student
> in question added 100-page original contents so maybe this is
> not too bad, but if the report were a 30-page one, and the
> entier 30 pages were borrowed from somebody else's 130-page
> report, you would _really_ want to notice.

There just isn't enough information in a 256-bit fingerprint
to be able to determine if two strings have a long common
substring. Also, when the input gets longer, like a few MB,
or when the input has little information content (compresses
very well), statistical bias will reduce reliability.

Still, I used the similarity test on large tar archives, such
as complete GCC releases, and it does give reasonable
similarity estimates. Non-related inputs rarely have scores
above 5.

potomac%../gsimm - 
rd026c470aab28a1086403768a428358f218bba049d47e7d49f8589c2c0baca0c *.tar
55746560 gcc-2.95.1.tar 123 3.1
55797760 gcc-2.95.2.tar 112 11.8
55787520 gcc-2.95.3.tar 112 11.8
87490560 gcc-3.0.1.tar 112 11.8
88156160 gcc-3.0.2.tar 78 38.6
86630400 gcc-3.0.tar 80 37.0
132495360 gcc-3.1.tar 0 100.0

I'm mostly interested in the data storage aspects of git,
looking bottom-up at the blobs stored and deriving information
from that. My similarity estimator allows one to look at thousands
of large checked in files and quickly identify similar files.
For example, in the above case, you'd find it makes sense
to store gcc-3.1.tar as a difference from gcc-3.0.tar.
Doing an actual diff between these two archives takes a few
seconds, while the fingerprints can be compared in microseconds.

> While reorganizaing a program, a nontrivial amount of text is
> often removed from an existing file and moved to a newly created
> file.  Right now, the way similarity score is calculated has a
> heuristical cap to reject two files whose sizes are very
> different, but to detect and show this kind of file split, the
> sizes of files should matter less.
The way to do this is to split a file at content-determined
breakpoints: check the last n bits of a cyclic checksum over
a sliding window, and break if they match a magic number.
This would split the file in blocks with expected size of 2^n.
Then you'd store a fingerprint per chunk.
> [...]
> Another place we use "similarity" is to break a file that got
> modified too much.  This is done for two independent purposes.
This could be done directly using the given algorithm.

> [...] Usually rename/copy
> detection tries to find rename/copy into files that _disappear_
> from the result, but with the above sequence, B never
> disappears.  By looking at how dissimilar the preimage and
> postimage of B are, we tell the rename/copy detector that B,
> although it does not disappear, might have been renamed/copied
> from somewhere else.
This could also be cheaply determined by my similarity estimator.
Almost always, you'd have a high similarity score. When there is
a low score, you could verify with a more precise and expensive
algorithm to have a consistent decision on what is considered
a break.

There is a -v option that gives more verbose output, including
estimated and actual average distances from the origin for the
random walks. For random input they'll be very close, but for
input with a lot of repetition the actual average will be far
larger. The ratio can be used as a measure of reliability of
the fingerprint: ratio's closer to 1 are better.
> Also we can make commonness matter even more in the similarlity
> used to "break" a file than rename detector, because if we are
> going to break it, we will not have to worry about the issue of
> showing an annoying diff that removes 100 lines after copying a
> 130-line file.  This implies that the break algorithm needs to
> use two different kinds of similarity, one for breaking and then
> another for deciding how to show the broken pieces as a diff.
>
> Sorry if this write-up does not make much sense.  It ended up
> being a lot more incoherent than I hoped it to be.
Regular diff algorithms will always give the most precise result.
What my similarity estimator does is give a probability that
two files have a lot of common substrings. Say, you'd have a
git archive with 10,000 blobs of about 1 MB, and you'd want
to determine how to pack this. You clearly can't use diff
programs to solve this, but you can use the estimates.

> Anyway, sometime this week I'll find time to play with your code
> myself.
Thanks, I'm looking forward to your comments.

   -Geert

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2006-04-14 17:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-13  6:26 Fix up diffcore-rename scoring Linus Torvalds
2006-03-13  6:44 ` Linus Torvalds
2006-03-13  6:46 ` Junio C Hamano
2006-03-13  7:09   ` Linus Torvalds
2006-03-13  7:42     ` Junio C Hamano
2006-03-13  7:44     ` Linus Torvalds
2006-03-13 10:43       ` Junio C Hamano
2006-03-13 15:38         ` Linus Torvalds
2006-03-14  0:49           ` Rutger Nijlunsing
2006-03-14  0:55           ` Junio C Hamano
2006-04-06 21:01       ` Geert Bosch
2006-04-11 22:04         ` Junio C Hamano
2006-04-14 17:46           ` Geert Bosch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).