* Fix up diffcore-rename scoring @ 2006-03-13 6:26 Linus Torvalds 2006-03-13 6:44 ` Linus Torvalds 2006-03-13 6:46 ` Junio C Hamano 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-03-13 6:26 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List The "score" calculation for diffcore-rename was totally broken. It scaled "score" as score = src_copied * MAX_SCORE / dst->size; which means that you got a 100% similarity score even if src and dest were different, if just every byte of dst was copied from src, even if source was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ the remaining 15%). That's clearly bogus. We should do the score calculation relative not to the destination size, but to the max size of the two. This seems to fix it. Signed-off-by: Linus Torvalds <torvalds@osdl.org> --- diff --git a/diffcore-rename.c b/diffcore-rename.c index ed99fe2..e992698 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -133,7 +133,7 @@ static int estimate_similarity(struct di * match than anything else; the destination does not even * call into this function in that case. */ - unsigned long delta_size, base_size, src_copied, literal_added; + unsigned long max_size, delta_size, base_size, src_copied, literal_added; unsigned long delta_limit; int score; @@ -144,9 +144,9 @@ static int estimate_similarity(struct di if (!S_ISREG(src->mode) || !S_ISREG(dst->mode)) return 0; - delta_size = ((src->size < dst->size) ? - (dst->size - src->size) : (src->size - dst->size)); + max_size = ((src->size > dst->size) ? src->size : dst->size); base_size = ((src->size < dst->size) ? src->size : dst->size); + delta_size = max_size - base_size; /* We would not consider edits that change the file size so * drastically. delta_size must be smaller than @@ -174,12 +174,10 @@ static int estimate_similarity(struct di /* How similar are they? * what percentage of material in dst are from source? */ - if (dst->size < src_copied) - score = MAX_SCORE; - else if (!dst->size) + if (!dst->size) score = 0; /* should not happen */ else - score = src_copied * MAX_SCORE / dst->size; + score = src_copied * MAX_SCORE / max_size; return score; } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 6:26 Fix up diffcore-rename scoring Linus Torvalds @ 2006-03-13 6:44 ` Linus Torvalds 2006-03-13 6:46 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2006-03-13 6:44 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List On Sun, 12 Mar 2006, Linus Torvalds wrote: > > The "score" calculation for diffcore-rename was totally broken. > > It scaled "score" as > > score = src_copied * MAX_SCORE / dst->size; > > which means that you got a 100% similarity score even if src and dest were > different, if just every byte of dst was copied from src, even if source > was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ > the remaining 15%). > > That's clearly bogus. We should do the score calculation relative not to > the destination size, but to the max size of the two. > > This seems to fix it. Btw, interestingly, this seems to actually improve on the rename detection from your previous one, even though at the face of it, it should just have made the scores go down. I'm not quite sure why, but perhaps it gave a bogus high score to some rename that wasn't very good, allowing the _real_ rename to make itself seen. Or maybe I did some mistake in testing it. Linus PS. You can still get a "similarity score" of 100 with the fixed scaling even if the source and the destination were different. That happens if every byte was marked as "copied" by the similarity estimator. Which can happen if you just move things around in the file - the end result is different, but all the bytes are copied from the source. At least with the fixed heuristic, that "perfect similarity" score can be _somehow_ be explained. The files are very similar in that they have the same content, just in a different order ;) ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 6:26 Fix up diffcore-rename scoring Linus Torvalds 2006-03-13 6:44 ` Linus Torvalds @ 2006-03-13 6:46 ` Junio C Hamano 2006-03-13 7:09 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2006-03-13 6:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List Linus Torvalds <torvalds@osdl.org> writes: > The "score" calculation for diffcore-rename was totally broken. > > It scaled "score" as > > score = src_copied * MAX_SCORE / dst->size; > > which means that you got a 100% similarity score even if src and dest were > different, if just every byte of dst was copied from src, even if source > was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ > the remaining 15%). Your reading of the code is correct, but that is deliberate. > /* How similar are they? > * what percentage of material in dst are from source? > */ I wanted to say in such a case that dst was _really_ derived from the source. I think using max may make more sense, but I need to convince myself by looking at filepairs that this change stops detecting as renames, and this change starts detecting as renames. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 6:46 ` Junio C Hamano @ 2006-03-13 7:09 ` Linus Torvalds 2006-03-13 7:42 ` Junio C Hamano 2006-03-13 7:44 ` Linus Torvalds 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-03-13 7:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Sun, 12 Mar 2006, Junio C Hamano wrote: > > Linus Torvalds <torvalds@osdl.org> writes: > > > The "score" calculation for diffcore-rename was totally broken. > > > > It scaled "score" as > > > > score = src_copied * MAX_SCORE / dst->size; > > > > which means that you got a 100% similarity score even if src and dest were > > different, if just every byte of dst was copied from src, even if source > > was much larger than dst (eg we had copied 85% of the bytes, but _deleted_ > > the remaining 15%). > > Your reading of the code is correct, but that is deliberate. > > > /* How similar are they? > > * what percentage of material in dst are from source? > > */ > > I wanted to say in such a case that dst was _really_ derived > from the source. I think using max may make more sense, but I > need to convince myself by looking at filepairs that this change > stops detecting as renames, and this change starts detecting as > renames. Just compare the result. Just eye-balling the difference between the rename data from 2.6.12 to 2.6.14, the fixed score actually gets better rename detection. It actually finds 133 renames (as opposed to 132 for the broken one), and the renames it finds are more sensible. For example, the fixed version finds drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h which actually matches the other i2c/chips/ renames, while the broken one does drivers/i2c/chips/lm75.h -> drivers/media/video/rds.h which just doesn't make any sense at all. Now, that said, they _both_ find some pretty funky renames. I think there is probably some serious room for improvement, regardless (or at least changing the default similarity cut-off to something better ;) Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 7:09 ` Linus Torvalds @ 2006-03-13 7:42 ` Junio C Hamano 2006-03-13 7:44 ` Linus Torvalds 1 sibling, 0 replies; 13+ messages in thread From: Junio C Hamano @ 2006-03-13 7:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Just compare the result... > > Now, that said, they _both_ find some pretty funky renames. I think there > is probably some serious room for improvement, regardless (or at least > changing the default similarity cut-off to something better ;) Yes. The "compare with larger" seems to cull nonsensical ones found by "next" one much better. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 7:09 ` Linus Torvalds 2006-03-13 7:42 ` Junio C Hamano @ 2006-03-13 7:44 ` Linus Torvalds 2006-03-13 10:43 ` Junio C Hamano 2006-04-06 21:01 ` Geert Bosch 1 sibling, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-03-13 7:44 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Sun, 12 Mar 2006, Linus Torvalds wrote: > > Now, that said, they _both_ find some pretty funky renames. I think there > is probably some serious room for improvement, regardless (or at least > changing the default similarity cut-off to something better ;) I'm afraid that _good_ rename detection really ends up wanting to take "longest possible sequence" into account, exactly like the full xdelta does. Instead of doing a fixed-chunk thing and saying that any copy is equivalent to any other copy. That's simply not true. It's _much_ better to have one 24-byte copy than it is to have three 8-byte copies, but the new faster diffcore-delta.c just can't see that. So one big reason as to why it is fast in the first place is that it fundamentally just doesn't do a very good job ;( It might be that the fast delta thing is a good way to ask "is this even worth considering", to cut down the O(m*n) rename/copy detection to something much smaller, and then use xdelta() to actually figure out what is a good rename and what isn't from a much smaller set of potential targets. That would actually allow us to be even _less_ precise. Screw that big hash-table etc, don't even try to be exact. Just try to be fairly fast, and then pick the top entries from the similarity array for more precise diffing if there are multiple choices that look like they might be possible. The appended alternate "diffcore-delta.c" doesn't do any of the caching (ie I wrote it so that it would be easy to change to make the _caller_ keeps "src" constant, and iterates over destination - or the other way around - and would do the hash setup just once per src). Still, even with the existing setup, it's pretty fast for me (not much slower than your caching version even though it recalculates everything every time). And it's not that far off, which tells me that if it was used as a "first-pass filter", we could afford to do a better job on the things that it says are likely candidates. Hmm? It really does bother me how the suggested rename detector finds stuff that clearly isn't. Linus ---- #include "cache.h" #include "diff.h" #include "diffcore.h" #define CHUNK (16) #define SILLYSIZE (65537) static int hashnr[SILLYSIZE]; static void setup_hash(void) { memset(hashnr, 0, sizeof(hashnr)); } static void insert_hash(unsigned int hashval) { hashval = hashval % SILLYSIZE; hashnr[hashval]++; } static int find_hash(unsigned int hashval) { hashval = hashval % SILLYSIZE; if (hashnr[hashval]) { hashnr[hashval]--; return 1; } return 0; } int diffcore_count_changes(void *src, unsigned long src_size, void *dst, unsigned long dst_size, void **src_count_p, void **dst_count_p, unsigned long delta_limit, unsigned long *src_copied, unsigned long *literal_added) { unsigned long copied = 0; unsigned long literal = 0; setup_hash(); while (src_size >= CHUNK) { unsigned int hashval = adler32(0, src, CHUNK); insert_hash(hashval); src += CHUNK; src_size -= CHUNK; } while (dst_size >= CHUNK) { unsigned int hashval = adler32(0, dst, CHUNK); if (find_hash(hashval)) { copied += CHUNK; dst += CHUNK; dst_size -= CHUNK; continue; } literal++; if (literal > delta_limit) return -1; dst++; dst_size--; } literal += dst_size; *src_copied = copied; *literal_added = literal; return 0; } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 7:44 ` Linus Torvalds @ 2006-03-13 10:43 ` Junio C Hamano 2006-03-13 15:38 ` Linus Torvalds 2006-04-06 21:01 ` Geert Bosch 1 sibling, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2006-03-13 10:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Instead of doing a fixed-chunk thing and saying that any copy is > equivalent to any other copy. That's simply not true. It's _much_ better > to have one 24-byte copy than it is to have three 8-byte copies, but the > new faster diffcore-delta.c just can't see that. Exactly. You know what? Once we start counting to detect 24-byte straight copy and try distinguishing it from 3 separate 8-byte copies, it will eventually lead us to what we have in diff-delta.c anyway. I avoided counting runs of bytes on purpose because I wanted to see how far we can go without it. The primary reason I started the jc/diff topic branch was because we _might_ want to replace what is in the current diff-delta.c with much finer-grained comparison code, and when that happens, counting xdelta output for rename detection purpose would have stopped making sense. For now we decided to postpone it for performance reasons, but we still might want to when Nico comes back with a better implementation. Now, I know the current diff-delta based similarity estimator we have in "main" seems to do a reasonable if not perfect job, within a reasonabe amount of time. And it does know how to count copying of consecutive bytes. In the worst case we could just fork the xdelta part of the code when Nico comes back with improved finer-grained delta, and we can keep using the current diff-delta code for rename detection. Knowing we have that fallback position, I wanted to pursue a different avenue. Distinguishing a straight 24-byte run from three independent 8-byte run, using hash to find the offset in the source and actually do maximum string match, is something we already know how to do, because that is essentially what the current diff-delta code does. By the way, the reason the diffcore-delta code in "next" does not do every-eight-bytes hash on the source material is to somewhat alleviate the problem that comes from not detecting copying of consecutive byte ranges. If you have a 8-byte run that is copied from source to destination, we would give it one point (let's for now forget about false match coming from hash collisions). Since the source material is hashed at every byte offset, if we have 9-byte run copied from source to destination, that is awarded two points (for the first 8-byte we award one point, and then another 8-byte sequence starting from the second byte we award another point; we are talking about an overlapping range). That way, the code does reward copying consecutive bytes around more heavily than copying things at random places. At one extreme, if you copy 7-byte, throw in a garbage, another 7-byte, throw in a garbage, and keep going, you would not get any point. It's really a funky heuristics, and as you have seen, it sometimes gives spectaculary phony matches. But in practice, with some tweaking it seems to do an OK job. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 10:43 ` Junio C Hamano @ 2006-03-13 15:38 ` Linus Torvalds 2006-03-14 0:49 ` Rutger Nijlunsing 2006-03-14 0:55 ` Junio C Hamano 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-03-13 15:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, 13 Mar 2006, Junio C Hamano wrote: > > By the way, the reason the diffcore-delta code in "next" does > not do every-eight-bytes hash on the source material is to > somewhat alleviate the problem that comes from not detecting > copying of consecutive byte ranges. Yes. However, there are better ways to do that in practice. The most effective way that is generally used is to not use a fixed chunk-size, but use a terminating character, together with a minimum/maximum chunksize. There's a pretty natural terminating character that works well for sources: '\n'. So the natural way to do similarity detection when most of the code is line-based is to do the hashing on chunks that follow the rule "minimum of <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n". So if you don't see any '\n' at all (or the only such one is less than <n> bytes into your current window), do the hash over a <2n>-byte chunk (this takes care of binaries and/or long lines). This - for source code - allows you to ignore trivial byte offset things, because you have a character that is used for synchronization. So you don't need to do hashing at every byte in both files - you end up doing the hashing only at line boundaries in practice. And it still _works_ for binary files, although you effectively need bigger identical chunk-sizes to find similarities (for text-files, it finds similarities of size <n>, for binaries the similarities need to effectively be of size 3*n, because you chunk it up at ~2*n, and only generate the hash at certain offsets in the source binary). Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 15:38 ` Linus Torvalds @ 2006-03-14 0:49 ` Rutger Nijlunsing 2006-03-14 0:55 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Rutger Nijlunsing @ 2006-03-14 0:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git [-- Attachment #1: Type: text/plain, Size: 12313 bytes --] From: Rutger Nijlunsing <rutger@nospam.com> To: Linus Torvalds <torvalds@osdl.org> Cc: Junio C Hamano <junkio@cox.net>, git@vger.kernel.org Bcc: Subject: Re: Fix up diffcore-rename scoring Reply-To: git@wingding.demon.nl In-Reply-To: <Pine.LNX.4.64.0603130727350.3618@g5.osdl.org> Organization: M38c On Mon, Mar 13, 2006 at 07:38:53AM -0800, Linus Torvalds wrote: > > > On Mon, 13 Mar 2006, Junio C Hamano wrote: > > > > By the way, the reason the diffcore-delta code in "next" does > > not do every-eight-bytes hash on the source material is to > > somewhat alleviate the problem that comes from not detecting > > copying of consecutive byte ranges. > > Yes. However, there are better ways to do that in practice. > > The most effective way that is generally used is to not use a fixed > chunk-size, but use a terminating character, together with a > minimum/maximum chunksize. > > There's a pretty natural terminating character that works well for > sources: '\n'. > > So the natural way to do similarity detection when most of the code is > line-based is to do the hashing on chunks that follow the rule "minimum of > <n> bytes, maximum of <2*n> bytes, try to begin/end at a \n". > > So if you don't see any '\n' at all (or the only such one is less than <n> > bytes into your current window), do the hash over a <2n>-byte chunk (this > takes care of binaries and/or long lines). > > This - for source code - allows you to ignore trivial byte offset things, > because you have a character that is used for synchronization. So you > don't need to do hashing at every byte in both files - you end up doing > the hashing only at line boundaries in practice. And it still _works_ for > binary files, although you effectively need bigger identical chunk-sizes > to find similarities (for text-files, it finds similarities of size <n>, > for binaries the similarities need to effectively be of size 3*n, because > you chunk it up at ~2*n, and only generate the hash at certain offsets in > the source binary). This looks like something I did last year as an experiment in the pre-git times. The idea was to generate a patch-with-renames from two (large) source trees. Algorithm: - determine md5sum for each file (same idea as git's SHA1 sum) if changed since last run - only look at md5sums which do not match - pool files into types, which might depend on extension and/or MIME type. This is an optimisation. - Only compare filepair _within_ one pool. - The filepair order in one pool is determined by filename-similarity. So pair [include/asm-ppc/ioctl.h, include/asm-powerpc/ioctl.h] is inspected before pair [include/asm-ppc/ioctl.h, arch/arm/plat-omap/clock.h] . - For each file, create a hash from String line -> Integer occuranced . Similarities are calculated by comparing two hashes. - Keep as a rename-match all files which: - have at most 50% new lines; - have at most 25% lines deleted from them. I ran the code against v2.6.12 and v2.6.14 to be able to compare it with the current contenders. Hopefully some ideas are harvestable... Algorithm differences: - '\n' is used as boundary, independant on line length. This is bad for binary files, and maybe even bad for text files. So don't harvest :) - don't look at the intersection percentage, but look at two values: - percentage of lines added (default: max. 50%) - percentage of lines removed (default: max. 25%) This assumes files get bigger during development (at most 50%), and not too much code is deleted (at most 25%). Disadvantages: - Two magic numbers instead of one. - It's non-symmetrical. Diff A->B will find different renames from diff B->A. This scares me, actually. - to speed up the detection: - don't start comparing files at random. Start comparing files which have the same 'names' in it. So when v2.6.12 has a files called arch/arm/mach-omap/clock.c, start comparing with files which have most words the same. Currently, '-', '.', '_' and '/' are used as word separators. Advantage: don't match on the first match just above the match-threshold. (next heuristics are all optional:) - only compare files with the same extension. This splits up all files into groups, which makes it much faster. In general, there's no reason to compare a .h with a .c file. - only compare files with the same MIME type. Same as above, but also works for files without extensions (so don't compare README with Makefile) Ok, the result: $ shpatch.rb -d linux-2.6.12,linux-2.6.14 | wc -l 104 <-- That's bad. We're missing some renames here. $ shpatch.rb -d linux-2.6.12,linux-2.6.14 | sort -k 1.10 + 0% -23% arch/arm/configs/omnimeter_defconfig -> arch/arm/configs/collie_defconfig + 5% - 9% arch/arm/mach-omap/board-generic.c -> arch/arm/mach-omap1/board-generic.c + 0% - 8% arch/arm/mach-omap/board-h2.c -> arch/arm/mach-omap1/board-h2.c + 0% - 5% arch/arm/mach-omap/board-h3.c -> arch/arm/mach-omap1/board-h3.c + 0% - 3% arch/arm/mach-omap/board-innovator.c -> arch/arm/mach-omap1/board-innovator.c + 0% - 9% arch/arm/mach-omap/board-netstar.c -> arch/arm/mach-omap1/board-netstar.c + 9% -10% arch/arm/mach-omap/board-osk.c -> arch/arm/mach-omap1/board-osk.c + 0% - 6% arch/arm/mach-omap/board-perseus2.c -> arch/arm/mach-omap1/board-perseus2.c + 3% - 8% arch/arm/mach-omap/board-voiceblue.c -> arch/arm/mach-omap1/board-voiceblue.c + 7% - 4% arch/arm/mach-omap/clock.c -> arch/arm/plat-omap/clock.c + 0% - 0% arch/arm/mach-omap/clock.h -> arch/arm/plat-omap/clock.h + 0% - 5% arch/arm/mach-omap/common.h -> include/asm-arm/arch-omap/common.h + 2% - 1% arch/arm/mach-omap/dma.c -> arch/arm/plat-omap/dma.c + 0% - 1% arch/arm/mach-omap/fpga.c -> arch/arm/mach-omap1/fpga.c +11% -11% arch/arm/mach-omap/gpio.c -> arch/arm/plat-omap/gpio.c + 2% - 2% arch/arm/mach-omap/irq.c -> arch/arm/mach-omap1/irq.c + 0% - 4% arch/arm/mach-omap/leds.c -> arch/arm/mach-omap1/leds.c + 0% - 0% arch/arm/mach-omap/leds-h2p2-debug.c -> arch/arm/mach-omap1/leds-h2p2-debug.c + 0% - 0% arch/arm/mach-omap/leds-innovator.c -> arch/arm/mach-omap1/leds-innovator.c + 0% - 4% arch/arm/mach-omap/leds-osk.c -> arch/arm/mach-omap1/leds-osk.c + 0% -25% arch/arm/mach-omap/Makefile.boot -> arch/arm/mach-omap1/Makefile.boot + 1% - 2% arch/arm/mach-omap/mcbsp.c -> arch/arm/plat-omap/mcbsp.c + 0% - 6% arch/arm/mach-omap/mux.c -> arch/arm/plat-omap/mux.c + 0% - 0% arch/arm/mach-omap/ocpi.c -> arch/arm/plat-omap/ocpi.c + 1% -18% arch/arm/mach-omap/pm.c -> arch/arm/plat-omap/pm.c + 0% -11% arch/arm/mach-omap/sleep.S -> arch/arm/plat-omap/sleep.S + 6% - 4% arch/arm/mach-omap/time.c -> arch/arm/mach-omap1/time.c + 0% - 1% arch/arm/mach-omap/usb.c -> arch/arm/plat-omap/usb.c + 2% - 1% arch/ia64/sn/include/pci/pcibr_provider.h -> include/asm-ia64/sn/pcibr_provider.h + 0% - 2% arch/ia64/sn/include/pci/pic.h -> include/asm-ia64/sn/pic.h + 0% - 0% arch/ia64/sn/include/pci/tiocp.h -> include/asm-ia64/sn/tiocp.h + 3% -23% arch/m68knommu/platform/68VZ328/de2/config.c -> arch/m68knommu/platform/68VZ328/config.c + 1% -18% arch/mips/configs/osprey_defconfig -> arch/mips/configs/qemu_defconfig + 0% -12% arch/mips/vr41xx/zao-capcella/setup.c -> arch/mips/vr41xx/common/type.c + 0% - 0% arch/ppc64/oprofile/op_impl.h -> include/asm-ppc64/oprofile_impl.h + 3% -23% arch/ppc/configs/ash_defconfig -> arch/ppc64/configs/bpa_defconfig + 2% -21% arch/ppc/configs/beech_defconfig -> arch/ppc/configs/ev64360_defconfig + 5% -20% arch/ppc/configs/cedar_defconfig -> arch/ppc/configs/mpc8548_cds_defconfig + 9% -17% arch/ppc/configs/k2_defconfig -> arch/ppc/configs/bamboo_defconfig + 3% -25% arch/ppc/configs/mcpn765_defconfig -> arch/xtensa/configs/common_defconfig + 2% -23% arch/ppc/configs/oak_defconfig -> arch/frv/defconfig + 3% -16% arch/ppc/configs/SM850_defconfig -> arch/ppc/configs/mpc86x_ads_defconfig + 3% -13% arch/ppc/configs/SPD823TS_defconfig -> arch/ppc/configs/mpc885ads_defconfig +19% -15% arch/um/kernel/tempfile.c -> arch/um/os-Linux/mem.c + 0% - 5% arch/x86_64/kernel/semaphore.c -> lib/semaphore-sleepers.c + 0% - 6% drivers/i2c/chips/adm1021.c -> drivers/hwmon/adm1021.c + 0% - 4% drivers/i2c/chips/adm1025.c -> drivers/hwmon/adm1025.c + 0% -17% drivers/i2c/chips/adm1026.c -> drivers/hwmon/adm1026.c + 0% - 3% drivers/i2c/chips/adm1031.c -> drivers/hwmon/adm1031.c + 0% - 4% drivers/i2c/chips/asb100.c -> drivers/hwmon/asb100.c + 1% - 4% drivers/i2c/chips/ds1621.c -> drivers/hwmon/ds1621.c + 0% - 1% drivers/i2c/chips/fscher.c -> drivers/hwmon/fscher.c + 0% - 2% drivers/i2c/chips/fscpos.c -> drivers/hwmon/fscpos.c + 0% - 2% drivers/i2c/chips/gl518sm.c -> drivers/hwmon/gl518sm.c + 0% - 2% drivers/i2c/chips/gl520sm.c -> drivers/hwmon/gl520sm.c + 3% -19% drivers/i2c/chips/it87.c -> drivers/hwmon/it87.c + 4% -22% drivers/i2c/chips/lm63.c -> drivers/hwmon/lm63.c + 0% - 6% drivers/i2c/chips/lm75.c -> drivers/hwmon/lm75.c + 0% - 2% drivers/i2c/chips/lm75.h -> drivers/hwmon/lm75.h + 0% - 3% drivers/i2c/chips/lm77.c -> drivers/hwmon/lm77.c + 2% - 5% drivers/i2c/chips/lm78.c -> drivers/hwmon/lm78.c + 0% - 3% drivers/i2c/chips/lm80.c -> drivers/hwmon/lm80.c + 2% -21% drivers/i2c/chips/lm83.c -> drivers/hwmon/lm83.c + 0% - 3% drivers/i2c/chips/lm85.c -> drivers/hwmon/lm85.c + 0% - 4% drivers/i2c/chips/lm87.c -> drivers/hwmon/lm87.c + 4% -20% drivers/i2c/chips/lm90.c -> drivers/hwmon/lm90.c + 0% - 3% drivers/i2c/chips/lm92.c -> drivers/hwmon/lm92.c + 0% - 3% drivers/i2c/chips/max1619.c -> drivers/hwmon/max1619.c + 0% - 7% drivers/i2c/chips/sis5595.c -> drivers/hwmon/sis5595.c + 0% -11% drivers/i2c/chips/smsc47b397.c -> drivers/hwmon/smsc47b397.c + 0% - 9% drivers/i2c/chips/smsc47m1.c -> drivers/hwmon/smsc47m1.c + 0% -23% drivers/i2c/chips/via686a.c -> drivers/hwmon/via686a.c + 0% - 4% drivers/i2c/chips/w83627hf.c -> drivers/hwmon/w83627hf.c + 1% - 5% drivers/i2c/chips/w83781d.c -> drivers/hwmon/w83781d.c + 1% - 3% drivers/i2c/chips/w83l785ts.c -> drivers/hwmon/w83l785ts.c +14% -17% drivers/i2c/i2c-sensor-vid.c -> drivers/hwmon/hwmon-vid.c + 0% - 0% drivers/infiniband/include/ib_cache.h -> include/rdma/ib_cache.h + 0% - 3% drivers/infiniband/include/ib_fmr_pool.h -> include/rdma/ib_fmr_pool.h + 9% - 7% drivers/infiniband/include/ib_mad.h -> include/rdma/ib_mad.h + 0% - 0% drivers/infiniband/include/ib_pack.h -> include/rdma/ib_pack.h + 1% - 6% drivers/infiniband/include/ib_sa.h -> include/rdma/ib_sa.h + 0% -11% drivers/infiniband/include/ib_smi.h -> include/rdma/ib_smi.h + 3% - 6% drivers/infiniband/include/ib_user_mad.h -> include/rdma/ib_user_mad.h + 4% - 2% drivers/infiniband/include/ib_verbs.h -> include/rdma/ib_verbs.h + 0% -16% include/asm-ppc64/ioctl.h -> include/asm-powerpc/ioctl.h + 0% - 9% include/asm-ppc64/ioctls.h -> include/asm-powerpc/ioctls.h + 5% - 9% include/asm-ppc64/mc146818rtc.h -> include/asm-powerpc/mc146818rtc.h + 0% - 5% include/asm-ppc64/mman.h -> include/asm-powerpc/mman.h + 2% -25% include/asm-ppc64/sembuf.h -> include/asm-powerpc/sembuf.h + 3% -13% include/asm-ppc64/shmbuf.h -> include/asm-powerpc/shmbuf.h + 0% -15% include/asm-ppc64/sockios.h -> include/asm-powerpc/sockios.h + 1% - 5% include/asm-ppc64/topology.h -> include/asm-powerpc/topology.h + 0% -15% include/asm-ppc64/user.h -> include/asm-powerpc/user.h + 0% -21% include/asm-ppc/agp.h -> include/asm-powerpc/agp.h +12% -16% include/asm-ppc/msgbuf.h -> include/asm-xtensa/msgbuf.h + 5% -25% include/asm-ppc/namei.h -> include/asm-powerpc/namei.h + 4% -18% include/asm-ppc/param.h -> include/asm-powerpc/param.h + 0% -13% include/asm-ppc/poll.h -> include/asm-powerpc/poll.h + 0% -24% include/asm-ppc/shmbuf.h -> include/asm-xtensa/shmbuf.h + 1% -17% include/asm-ppc/socket.h -> include/asm-powerpc/socket.h + 0% - 9% include/asm-ppc/string.h -> include/asm-powerpc/string.h + 1% -10% include/asm-ppc/termbits.h -> include/asm-powerpc/termbits.h + 0% - 3% include/asm-ppc/termios.h -> include/asm-powerpc/termios.h + 5% -22% include/asm-ppc/unaligned.h -> include/asm-powerpc/unaligned.h Regards, Rutger. -- Rutger Nijlunsing ---------------------------------- eludias ed dse.nl never attribute to a conspiracy which can be explained by incompetence ---------------------------------------------------------------------- [-- Attachment #2: shpatch.rb --] [-- Type: text/plain, Size: 14758 bytes --] #!/usr/bin/env ruby # Usage: shpatch.rb --help require 'md5' require 'ostruct' require 'optparse' $config = OpenStruct.new $config.command = :PATCH $config.same_base = false $config.same_ext = true $config.same_mime = false $config.changed_content = true $config.max_removed = 25 # 0 .. 100 $config.max_added = 50 $config.verbose = false # Default dirglobs to ignore ignore_globs = [ "BitKeeper", "PENDING", "SCCS", "CVS", "*.state", "*.o", "*.a", "*.so", "*~", "#*#", "*.orig", "*.dll" ] # Option parsing $opts = OptionParser.new $opts.banner = %Q{\ Generate a shellpatch file, or perform the patch in a shellpatch file. A shellpatch file is a patch file which contains shell-commands including 'mv' and 'patch'. Determining the renames uses a lot of heuristics and a brute-force approach; your milage may vary. All trivial file renames are handled by comparing the complete contents. All remaining files (the list of added and removed files) in then searched through to find matching pairs: this is quite costly A cache of md5 sums is kept at the root of the repositories to make finding differences fast. (c)2005 R. Nijlunsing <shpatch@tux.tmfweb.nl> License: GPLv2 Usage: shpatch [options] Defaults options are within [brackets]. } $opts.separator("Diff options") $opts.on("-d", "--diff PATH1,PATH2", Array, "Generate a shellpatch of the diff", "between two directories") { |paths| if paths.size != 2 raise Exception.new("Need two directories for --diff") end $config.command = :DIFF $config.paths = paths } $opts.separator("Diff options for heuristics to finding renames with changed content") $opts.on("--[no-]changed-content", "Find renames with changed content [#{$config.changed_content}]" ) { |cc| $config.changed_content = cc } $opts.on("--[no-]same-base", "Rename only to files with same basename [#{$config.same_base}]") { |sb| $config.same_base = sb } $opts.on("--[no-]same-ext", "Rename only to same extention [#{$config.same_ext}]") { |se| $config.same_ext = se } $opts.on("--[no-]same-mime", "Rename only to same mimetype [#{$config.same_mime}]") { |sm| $config.same_mime = sm } $opts.on("--max-removed PERC", String, "Max. percentage of source file which may", "be removed while still being considered", "a rename [#{$config.max_removed}]" ) { |perc| $config.max_removed = perc.to_i } $opts.on("--max-added PERC", String, "Max. percentage of destination file which may", "be added while still being considered", "a rename [#{$config.max_added}]" ) { |perc| $config.max_added = perc.to_i } $opts.separator("Options to add to current patch") $opts.on("--mv SOURCE DEST", String, String, "Adds a rename to the current patch", "and perform the rename") { |path1, path2| $config.command = :MV $config.paths = [path1, path2] } $opts.separator("General options") $opts.on("--[no-]verbose", "-v", "Be more verbose") { |v| $config.verbose = v } $opts.on("--help", "-h", "This usage") { puts $opts; exit 1 } %Q{ Examples: shpatch.rb --diff linux-2.6.8,linux-2.6.9 --max-removed 10 Generate a shellpatch with renames from directories linux-2.6.8 to linux-2.6.9 . At most 10% of a file may be removed between versions, otherwise they are considered different. }.split("\n").each { |line| $opts.separator(line) } begin $opts.parse!(ARGV) rescue Exception puts "#{$opts}\n!!! #{$!}" exit 1 end module Shell # Escape string string so that it is parsed to the string itself # E.g. Shell.escapeString("what's in a name") = "what\'s\ in\ a\ name" # Compare to Regexp.escape def Shell.escape(string) string.gsub(%r{([^-._0-9a-zA-Z/])}i, '\\\\\1') end end # One hunk in the patch class RenameHunk attr_accessor :from, :to # Strings: pathname from and to def initialize(from, to) # puts "# Found a rename: #{Shell.escape(from)} -> #{Shell.escape(to)}" @from = from; @to = to end def command; "mv"; end def to_s; "#{command} #{Shell.escape(@from)} #{Shell.escape(@to)}"; end def execute(repo) File.rename("#{repo.root}/#@from", "#{repo.root}/#@to") end end class DeleteHunk attr_accessor :pathname def initialize(pathname); @pathname = pathname; end def command; "rm"; end def to_s; "#{command} #{Shell.escape(@pathname)}"; end def execute(repo); File.delete("#{repo.root}/#@pathname"); end end class PatchHunk attr_accessor :from, :to, :contents def initialize(repo1, from, repo2, to) @from = from; @to = to end def command; "patch"; end def to_s long_from = Shell.escape((from[0] == ?/ ? "" : repo1.root + "/") + from) long_to = Shell.escape((to[0] == ?/ ? "" : repo2.root + "/") + to) puts "# Diffing #{long_from} -> #{long_to}" if $config.verbose @contents = File.popen("diff --unified #{long_from} #{long_to}") { |io| io.read } mark = "_SHPATCHMARK_" # Make mark unique mark += rand(10).to_s while @contents.index(mark) "#{command} <<#{mark}\n#{@contents}#{mark}" end end # A filesystem as backing store class FileSystem SHPATCHSTATE_FILE = ".shpatch.state" SHPATCHSTATE_VERSION_STRING = "shpatch.rb state version 20050418-2" attr_accessor :root attr_accessor :cache_file # String: filename with signatures attr_accessor :signature_cache # From Fixnum inode to Array [mtime, sig] attr_accessor :signature_cache_changed # Boolean # Reads the cache. When not readable in current directory, go # up a level ('..') def read_signatures @signature_cache = {} @signature_cache_changed = false @cache_file = File.expand_path("#@root/#{SHPATCHSTATE_FILE}") cache_file = @cache_file loop { if FileTest.readable?(cache_file) File.open(cache_file, "rb") do |file| version_string = file.readline.chomp if version_string == SHPATCHSTATE_VERSION_STRING begin @signature_cache = Marshal.load(file) puts "# Read signature cache with #{@signature_cache.size} signatures from #{cache_file.inspect}" if $config.verbose @cache_file = cache_file break rescue ArgumentError, EOFError puts "# (error reading state file: rebuilding file...)" if $config.verbose end end end end parent_cache_file = File.expand_path( File.dirname(cache_file) + "/../" + File.basename(cache_file) ) break if parent_cache_file == cache_file cache_file = parent_cache_file } end def initialize(root) raise "#{root.inspect} does not exist" if not File.exists?(root) @root = root read_signatures end def save_signatures # Save all unsaved signature cache return if !@signature_cache_changed puts "# Saving #{@signature_cache.size} signatures..." if $config.verbose pf = @cache_file File.open("#{pf}.new", "wb+") do |file| file.puts SHPATCHSTATE_VERSION_STRING Marshal.dump(@signature_cache, file) File.rename("#{pf}.new", pf) end end # Returns array of [mtime, one-line signature-string] def signature(stat, filename) signature = nil key = [stat.dev, stat.ino] cache = @signature_cache[key] if cache and (cache[0] == stat.mtime) signature = cache[1] else if $config.verbose why = (cache ? "#{(stat.mtime - cache[0]).to_i}s out of date" : "not indexed") puts "# Creating signature for #{filename.inspect} (#{why})" end signature = MD5.new(File.read(filename)).digest @signature_cache[key] = [stat.mtime, signature] @signature_cache_changed = true end signature end def signature_from(prefix, res, from, ignoreRe) Dir.new("#{prefix}#{from}").entries.each { |elem| next if (elem == ".") or (elem == "..") fullname = "#{prefix}#{from}/#{elem}" if not fullname =~ ignoreRe stat = File.stat(fullname) if stat.directory? signature_from(prefix, res, "#{from}/#{elem}", ignoreRe) else rel_filename = "#{from}/#{elem}"[1..-1] res[rel_filename] = signature(stat, fullname) end end } end # Returns all filenames within this filesystem with all signatures def signatures(ignoreRe) res = {} prefix = File.expand_path(@root) signature_from(prefix, res, "", ignoreRe) save_signatures res end def mime_type(filename) path = @root + "/" + filename ($mime_cache ||= {})[path] ||= File.popen("file --mime #{Shell.escape(path)}") { |io| io.read }. gsub(%r{^.*:}, "").strip end # Read the contents of a file def read(filename); File.read(@root + "/" + filename); end end patch = [] dir1, dir2 = $config.paths repo1 = FileSystem.new(dir1) repo2 = FileSystem.new(dir2) def re_from_globs(globs) Regexp.new( "(\\A|/)(" + globs.collect { |glob| Regexp.escape(glob).gsub("\\*", "[^/]*") }.join("|") + ")$" ) end ignore_globs += ["BitKeeper/etc/ignore", ".cvsignore"].collect { |a| ["#{dir1}/#{a}", "#{dir2}/#{a}"] }.flatten.find_all { |f| File.exists?(f) }.collect { |f| File.readlines(f).collect { |line| line.chomp } }.flatten ignore_globs = ignore_globs.uniq.sort ignoreRe = re_from_globs(ignore_globs) puts "# Retrieving signatures of #{dir1.inspect}" if $config.verbose file2sig1 = repo1.signatures(ignoreRe) puts "# Retrieving signatures of #{dir2.inspect}" if $config.verbose file2sig2 = repo2.signatures(ignoreRe) files1 = file2sig1.keys.sort files2 = file2sig2.keys.sort common_files = files1 - (files1 - files2) # Different hash, same filename: patch common_files.each { |fname| if file2sig1[fname] != file2sig2[fname] patch << PatchHunk.new(repo1, fname, repo2, fname) end file2sig1.delete(fname) file2sig2.delete(fname) } # Same hash, different filename: rename sig2file1 = file2sig1.invert sig2file2 = file2sig2.invert sigs1 = sig2file1.keys sigs2 = sig2file2.keys common_sigs = sigs1 - (sigs1 - sigs2) common_sigs.each { |sig| from = sig2file1[sig] to = sig2file2[sig] patch << RenameHunk.new(from, to) sig2file1.delete(sig) sig2file2.delete(sig) file2sig1.delete(from) file2sig2.delete(to) } # statistics of contents of a file. Used for quick-compare class FileContentStats attr_accessor :size # Size of file in lines attr_accessor :lines # Hash from String to Fixnum # Counter number of lines removed and added as a percentage # of the total file length. These are a measure for the degree # of matching between the files. def diff_match(other) added = 0 removed = 0 @lines.each_pair { |line, count| delta = other.lines[line] - count if delta > 0 added += delta else removed += -delta end } other.lines.each_pair { |line, count| added += count if not @lines[line] } [added * 100 / other.size, removed * 100 / self.size] end def initialize(repo, path) @lines = Hash.new(0) size = 0 repo.read(path).delete("\0").each_line { |line| @lines[line.intern] += 1 size += 1 } @size = size end def self.cached(repo, path) @@cache ||= {} @@cache[[repo, path]] ||= self.new(repo, path) end end # Categorize a file based on filename and/or contents def pool_type(repo, path) res = [] res << File.basename(path) if $config.same_base res << File.extname(path) if $config.same_ext res << repo.mime_type(path) if $config.same_mime res end # Determine how much a filename looks like another filename # by splitting the filenames into words. Then count the # words which are the same. def path_correlation(path1, path2) comp1 = path1.split(%r{[-._/]}) comp2 = path2.split(%r{[-._/]}) (comp1 - (comp1 - comp2)).size end class Array # The inverse of an array is an hash from contents to index number. def inverse; res = {}; each_with_index { |e, idx| res[e] = idx }; res; end end if $config.changed_content files1 = file2sig1.keys.sort files2 = file2sig2.keys.sort all_added_files = files2 - files1 all_removed_files = files1 - files2 pools = {} # Group files into 'pools' all_removed_files.each { |removed_file| (pools[pool_type(repo1, removed_file)] ||= [[], []])[0] << removed_file } all_added_files.each { |added_file| (pools[pool_type(repo2, added_file)] ||= [[], []])[1] << added_file } pools.each_pair { |key, pool| removed_files, added_files = *pool if $config.verbose and not removed_files.empty? and not added_files.empty? puts "# Comparing pool type #{key.inspect} with #{pool[0].size}x#{pool[1].size} filepairs" end # Determine how 'special' or 'specific' a word is. We start with # filenames containing special words. words = {} # Group files by 'words' removed_files.each { |removed_file| removed_file.split(%r{[-._/]+}).uniq.each { |word| words[word] ||= [[], []] words[word][0] << removed_file } } added_files.each { |added_file| added_file.split(%r{[-._/]+}).uniq.each { |word| words[word] ||= [[], []] words[word][1] << added_file } } word_importance = words.keys.find_all { |word| (words[word][0].size * words[word][1].size) > 0 }.sort_by { |word| words[word][0].size * words[word][1].size }.reverse # p word_importance word_importance = word_importance.inverse word_importance.default = 0 removed_files.sort_by { |removed_file| removed_file.split(%r{[-._/]+}).uniq.inject(0) { |s, e| [s, word_importance[e]].max } }.reverse.each { |removed_file| # puts removed_file removed_file_stats = FileContentStats.new(repo1, removed_file) added_files.sort_by { |f| -path_correlation(removed_file, f) }. each { |added_file| added_file_stats = FileContentStats.cached(repo2, added_file) removed_size = removed_file_stats.size added_size = added_file_stats.size min_added = (added_size - removed_size) * 100 / added_size next if min_added > $config.max_added min_removed = (removed_size - added_size) * 100 / removed_size next if min_removed > $config.max_removed # Calculate added & removed percentages added, removed = removed_file_stats.diff_match(added_file_stats) if (added <= $config.max_added) && (removed <= $config.max_removed) # We found a rename-match! puts "+%2i%% -%2i%% #{removed_file} -> #{added_file}" % [added, removed] #if $config.verbose patch << RenameHunk.new(removed_file, added_file) # Don't match again against this added file: added_files -= [added_file] all_added_files -= [added_file] all_removed_files -= [removed_file] patch << PatchHunk.new(repo1, removed_file, repo2, added_file) break end } } } end all_added_files.each { |added_file| patch << PatchHunk.new(repo1, "/dev/null", repo2, added_file) } all_removed_files.each { |removed_file| patch << PatchHunk.new(repo1, removed_file, repo2, "/dev/null") } #patch.each { |hunk| puts hunk.to_s } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 15:38 ` Linus Torvalds 2006-03-14 0:49 ` Rutger Nijlunsing @ 2006-03-14 0:55 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Junio C Hamano @ 2006-03-14 0:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > There's a pretty natural terminating character that works well for > sources: '\n'. Good to know that great minds think alike ;-). There is a version that did this line-oriented hashing, buried in the next branch. I'll see how well it performs within the context of the current somewhat restructured code. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-03-13 7:44 ` Linus Torvalds 2006-03-13 10:43 ` Junio C Hamano @ 2006-04-06 21:01 ` Geert Bosch 2006-04-11 22:04 ` Junio C Hamano 1 sibling, 1 reply; 13+ messages in thread From: Geert Bosch @ 2006-04-06 21:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List [-- Attachment #1: Type: text/plain, Size: 5171 bytes --] On Mar 13, 2006, at 02:44, Linus Torvalds wrote: > It might be that the fast delta thing is a good way to ask "is this > even > worth considering", to cut down the O(m*n) rename/copy detection to > something much smaller, and then use xdelta() to actually figure > out what > is a good rename and what isn't from a much smaller set of potential > targets. Here's a possible way to do that first cut. Basically, compute a short (256-bit) fingerprint for each file, such that the Hamming distance between two fingerprints is a measure for their similarity. I'll include a draft write up below. My initial implementation seems reasonably fast, works great for 4000 (decompressed) files (25M) randomly plucked from an old git.git repository without packs. It works OK for comparing tar archives for GCC releases, but then it becomes clear that random walks aren't that random anymore and become dominated by repeated information, such as tar headers. Speed is about 10MB/sec on my PowerBook, but one could cache fingerprints so they only need to be computed once. The nice thing is that one can quickly find similar files only using the fingerprint (and in practice file size), no filenames: this seems to fit the git model well. I'll attach my test implementation below, it uses David Mazieres Rabinpoly code and D. Phillips's fls code. Please don't mind my C coding, it's not my native language. Also, this may have some Darwinisms, although it should work on Linux too. -Geert Estimating Similarity For estimating similarity between strings A and B, let SA and SB be the collection of all substrings with length W of A and B. Similarity now is defined as the ratio of the intersection and the union of SA and SB. The length W of these substrings is the window size, and here is chosen somewhat arbitrarily to be 48. The idea is to make them not so short that all context is lost (like counting symbol frequencies), but not so long that a few small changes can affect a large portion of substrings. Of course, a single symbol change may affect up to 48 substrings. Let "&" be the string concatenation operator. If A = S2 & S1 & S2 & S3 & S2, and B = S2 & S3 & S2 & S1 & S2, then if the length of S2 is at least W - 1, the strings will have the same set of substrings and be considered equal for purpose of similarity checking. This behavior is actually welcome, since reordering sufficiently separated pieces of a document do not make it substantially different. Instead of computing the ratio of identical substrings directly, compute a 1-bit hash for each substring and calculate the difference between the number of zeroes and ones. If the hashes appear random, this difference follows a binomial distribution. Two files are considered "likely similar" if their differences have the same sign. The assumption that the hashes are randomly distributed, is not true if there are many repeated substrings. For most applications, it will be sufficient to ignore such repetitions (by using a small cache of recently encountered hashes) as they do not convey much actual information. For example, for purposes of finding small deltas between strings, duplicating existing text will not significantly increase the delta. For a string with N substrings, of which K changed, perform a random walk of N steps in 1-dimensional space (see [1]): what is the probability the origin was crossed an odd number of times in the last K steps? As the expected distance is Sqrt (2 * N / Pi), this probability gets progressively smaller for larger N and a given ratio of N and K. For larger files, the result should be quite stable. In order to strengthen this similarity check and be able to quantify the degree of similarity, many independent 1-bit hashes are computed and counted for each string and assembled into a bit vector of 256 bits, called the fingerprint. Each bit of the fingerprint represents the result of independent statistical experiment. For similar strings, corresponding bits are more likely to be the same than for random strings. For efficiency, a 64-bit hash is computed using a irreducible Rabin polynomial of degree 63. The algebraic properties of these allow for efficient calculation over a sliding window of the input. [2] As the cryptographic advantages of randomly generated hash functions are not required, a fixed polynomial has been chosen. This 64-bit hash is expanded to 256 bits by using three bits to select 32 of the 256 bits in the fingerprint to update. So, for every 8-bit character the polynomial needs updating, and 32 counters are incremented or decremented. So, each of the 256 counters represents a random walk that is N / 4, for a string of length N. The similarity of A and B can now be expressed as the Hamming distance between the two bit vectors, divided by the expected distance between two random vectors. This similarity score is a number between 0 and 2, where smaller values mean the strings are more similar, and values of 1 or more mean they are dissimilar. One of the unique properties of this fingerprint is the ability to compare files in different locations by only transmitting their fingerprint. [-- Attachment #2: gsimm.c --] [-- Type: application/octet-stream, Size: 10801 bytes --] #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <libgen.h> #include <stdio.h> #include <assert.h> #include <math.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include "rabinpoly.h" /* Length of file message digest (MD) in bytes. Longer MD's are better, but increase processing time for diminishing returns. Must be multiple of NUM_HASHES_PER_CHAR / 8, and at least 24 for good results */ #define MD_LENGTH 32 #define MD_BITS (MD_LENGTH * 8) /* Has to be power of two. Since the Rabin hash only has 63 usable bits, the number of hashes is limited to 32. Lower powers of two could be used for speeding up processing of very large files. */ #define NUM_HASHES_PER_CHAR 32 /* For the final counting, do not count each bit individually, but group them. Must be power of two, at most NUM_HASHES_PER_CHAR. However, larger sizes result in higher cache usage. Use 8 bits per group for efficient processing of large files on fast machines with decent caches, or 4 bits for faster processing of small files and for machines with small caches. */ #define GROUP_BITS 4 #define GROUP_COUNTERS (1<<GROUP_BITS) /* The RABIN_WINDOW_SIZE is the size of fingerprint window used by Rabin algorithm. This is not a modifiable parameter. The first RABIN_WINDOW_SIZE - 1 bytes are skipped, in order to ensure fingerprints are good hashes. This does somewhat reduce the influence of the first few bytes in the file (they're part of fewer windows, like the last few bytes), but that actually isn't so bad as files often start with fixed content that may bias comparisons. */ /* The MIN_FILE_SIZE indicates the absolute minimal file size that can be processed. As indicated above, the first and last RABIN_WINDOW_SIZE - 1 bytes are skipped. In order to get at least an average of 12 samples per bit in the final message digest, require at least 3 * MD_LENGTH complete windows in the file. */ #define MIN_FILE_SIZE (3 * MD_LENGTH + 2 * (RABIN_WINDOW_SIZE - 1)) /* Limit matching algorithm to files less than 256 MB, so we can use 32 bit integers everywhere without fear of overflow. For larger files we should add logic to mmap the file by piece and accumulate the frequency counts. */ #define MAX_FILE_SIZE (256*1024*1024 - 1) /* Size of cache used to eliminate duplicate substrings. Make small enough to comfortably fit in L1 cache. */ #define DUP_CACHE_SIZE 256 #define MIN(x,y) ((y)<(x) ? (y) : (x)) #define MAX(x,y) ((y)>(x) ? (y) : (x)) typedef struct fileinfo { char *name; size_t length; u_char md[MD_LENGTH]; int match; } File; int flag_verbose = 0; int flag_debug = 0; int flag_warning = 0; char *flag_relative = 0; char cmd[12] = " ..."; char md_strbuf[MD_LENGTH * 2 + 1]; u_char relative_md [MD_LENGTH]; File *file; int file_count; size_t file_bytes; FILE *msgout; char hex[17] = "0123456789abcdef"; double pi = 3.14159265358979323844; int freq[MD_BITS]; u_int64_t freq_dups = 0; void usage() { fprintf (stderr, "usage: %s [-dhvw] [-r fingerprint] file ...\n", cmd); fprintf (stderr, " -d\tdebug output, repeate for more verbosity\n"); fprintf (stderr, " -h\tshow this usage information\n"); fprintf (stderr, " -r\tshow distance relative to fingerprint " "(%u hex digits)\n", MD_LENGTH * 2); fprintf (stderr, " -v\tverbose output, repeat for even more verbosity\n"); fprintf (stderr, " -w\tenable warnings for suspect statistics\n"); exit (1); } int dist (u_char *l, u_char *r) { int j, k; int d = 0; for (j = 0; j < MD_LENGTH; j++) { u_char ch = l[j] ^ r[j]; for (k = 0; k < 8; k++) d += ((ch & (1<<k)) > 0); } return d; } char *md_to_str(u_char *md) { int j; for (j = 0; j < MD_LENGTH; j++) { u_char ch = md[j]; md_strbuf[j*2] = hex[ch >> 4]; md_strbuf[j*2+1] = hex[ch & 0xF]; } md_strbuf[j*2] = 0; return md_strbuf; } u_char *str_to_md(char *str, u_char *md) { int j; if (!md || !str) return 0; bzero (md, MD_LENGTH); for (j = 0; j < MD_LENGTH * 2; j++) { char ch = str[j]; if (ch >= '0' && ch <= '9') { md [j/2] = (md [j/2] << 4) + (ch - '0'); } else { ch |= 32; if (ch < 'a' || ch > 'f') break; md [j/2] = (md[j/2] << 4) + (ch - 'a' + 10); } } return (j != MD_LENGTH * 2 || str[j] != 0) ? 0 : md; } void freq_to_md(u_char *md) { int j, k; int num = MD_BITS; for (j = 0; j < MD_LENGTH; j++) { u_char ch = 0; for (k = 0; k < 8; k++) ch = 2*ch + (freq[8*j+k] > 0); md[j] = ch; } if (flag_debug) { for (j = 0; j < num; j++) { if (j % 8 == 0) printf ("\n%3u: ", j); printf ("%7i ", freq[j]); } printf ("\n"); } bzero (freq, sizeof(freq)); freq_dups = 0; } void process_data (char *name, u_char *data, unsigned len, u_char *md) { size_t j = 0; u_int32_t ofs; u_int32_t dup_cache[DUP_CACHE_SIZE]; u_int32_t count [MD_BITS * (GROUP_COUNTERS/GROUP_BITS)]; bzero (dup_cache, DUP_CACHE_SIZE * sizeof (u_int32_t)); bzero (count, (MD_BITS * (GROUP_COUNTERS/GROUP_BITS) * sizeof (u_int32_t))); /* Ignore incomplete substrings */ while (j < len && j < RABIN_WINDOW_SIZE) rabin_slide8 (data[j++]); while (j < len) { u_int64_t hash; u_int32_t ofs, sum; u_char idx; int k; hash = rabin_slide8 (data[j++]); /* In order to update a much larger frequency table with only 32 bits of checksum, randomly select a part of the table to update. The selection should only depend on the content of the represented data, and be independent of the bits used for the update. Instead of updating 32 individual counters, process the checksum in MD_BITS / GROUP_BITS groups of GROUP_BITS bits, and count the frequency of each bit pattern. */ idx = (hash >> 32); sum = (u_int32_t) hash; ofs = idx % (MD_BITS / NUM_HASHES_PER_CHAR) * NUM_HASHES_PER_CHAR; idx %= DUP_CACHE_SIZE; if (dup_cache[idx] == sum) { freq_dups++; } else { dup_cache[idx] = sum; for (k = 0; k < NUM_HASHES_PER_CHAR / GROUP_BITS; k++) { count[ofs * GROUP_COUNTERS / GROUP_BITS + (sum % GROUP_COUNTERS)]++; ofs += GROUP_BITS; sum >>= GROUP_BITS; } } } /* Distribute the occurrences of each bit group over the frequency table. */ for (ofs = 0; ofs < MD_BITS; ofs += GROUP_BITS) { int j; for (j = 0; j < GROUP_COUNTERS; j++) { int k; for (k = 0; k < GROUP_BITS; k++) { freq[ofs + k] += ((1<<k) & j) ? count[ofs * GROUP_COUNTERS / GROUP_BITS + j] : -count[ofs * GROUP_COUNTERS / GROUP_BITS + j]; } } } { int j; int num = MD_BITS; int stat_warn = 0; double sum = 0.0; double sumsqr = 0.0; double average, variance, stddev, bits, exp_average, max_average; assert (num >= 2); sum = 0; for (j = 0; j < num; j++) { double f = abs ((double) freq[j]); sum += f; sumsqr += f*f; } variance = (sumsqr - (sum * sum / num)) / (num - 1); average = sum / num; stddev = sqrt (variance); bits = (NUM_HASHES_PER_CHAR * (file[file_count].length - freq_dups)) / (8 * MD_LENGTH); /* Random files, or short files with few repetitions should have average very close to the expected average. Large deviations show there is too much redundancy, or there is another problem with the statistical fundamentals of the algorithm. */ exp_average = sqrt (2 * bits / pi); max_average = 2.0 * pow (2 * bits / pi, 0.6); stat_warn = flag_warning && (average < exp_average * 0.5 || average > max_average); if (stat_warn) { fprintf (stdout, "%s: warning: " "too much redundancy, fingerprint may not be accurate\n", file[file_count].name); } if (flag_verbose > 1 || (flag_verbose && stat_warn)) { printf ("%i frequencies, average %5.1f, std dev %5.1f, %2.1f %% duplicates, " "\"%s\"\n", num, average, stddev, 100.0 * freq_dups / (double) file[file_count].length, file[file_count].name); printf ("%1.0f expected bits per frequency, " "expected average %1.1f, max average %1.1f\n", bits, exp_average, max_average); } } if (md) { rabin_reset(); freq_to_md (md); if (flag_relative) { int d = dist (md, relative_md); double sim = 1.0 - MIN (1.0, (double) (d) / (MD_LENGTH * 4 - 1)); fprintf (stdout, "%s %llu %u %s %u %3.1f\n", md_to_str (md), (long long unsigned) 0, len, name, d, 100.0 * sim); } else { fprintf (stdout, "%s %llu %u %s\n", md_to_str (md), (long long unsigned) 0, len, name); } } } void process_file (char *name) { int fd; struct stat fs; u_char *data; File *fi = file+file_count;; fd = open (name, O_RDONLY, 0); if (fd < 0) { perror (name); exit (2); } if (fstat (fd, &fs)) { perror (name); exit (2); } if (fs.st_size >= MIN_FILE_SIZE && fs.st_size <= MAX_FILE_SIZE) { fi->length = fs.st_size; fi->name = name; data = (u_char *) mmap (0, fs.st_size, PROT_READ, MAP_PRIVATE, fd, 0); if (data == (u_char *) -1) { perror (name); exit (2); } process_data (name, data, fs.st_size, fi->md); munmap (data, fs.st_size); file_bytes += fs.st_size; file_count++; } else if (flag_verbose) { fprintf (stdout, "skipping %s (size %llu)\n", name, fs.st_size); } close (fd); } int main (int argc, char *argv[]) { int ch, j; strncpy (cmd, basename (argv[0]), 8); msgout = stdout; while ((ch = getopt(argc, argv, "dhr:vw")) != -1) { switch (ch) { case 'd': flag_debug++; break; case 'r': if (!optarg) { fprintf (stderr, "%s: missing argument for -r\n", cmd); return 1; } if (str_to_md (optarg, relative_md)) flag_relative = optarg; else { fprintf (stderr, "%s: not a valid fingerprint\n", optarg); return 1; } break; case 'v': flag_verbose++; break; case 'w': flag_warning++; break; default : usage(); return (ch != 'h'); } } argc -= optind; argv += optind; if (argc == 0) usage(); rabin_reset (); if (flag_verbose && flag_relative) { fprintf (stdout, "distances are relative to %s\n", flag_relative); } file = (File *) calloc (argc, sizeof (File)); for (j = 0; j < argc; j++) process_file (argv[j]); if (flag_verbose) { fprintf (stdout, "%li bytes in %i files\n", file_bytes, file_count); } return 0; } [-- Attachment #3: rabinpoly.c --] [-- Type: application/octet-stream, Size: 3648 bytes --] /* * * Copyright (C) 1999 David Mazieres (dm@uun.org) * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as * published by the Free Software Foundation; either version 2, or (at * your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 * USA * */ /* Faster generic_fls */ /* (c) 2002, D.Phillips and Sistina Software */ #include "rabinpoly.h" #define MSB64 0x8000000000000000ULL static inline unsigned fls8(unsigned n) { return n & 0xf0? n & 0xc0? (n >> 7) + 7: (n >> 5) + 5: n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2); } static inline unsigned fls16(unsigned n) { return n & 0xff00? fls8(n >> 8) + 8: fls8(n); } static inline unsigned fls32(unsigned n) { return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n); } static inline unsigned fls64(unsigned long long n) /* should be u64 */ { return n & 0xffffffff00000000ULL? fls32(n >> 32) + 32: fls32(n); } static u_int64_t polymod (u_int64_t nh, u_int64_t nl, u_int64_t d); static void polymult (u_int64_t *php, u_int64_t *plp, u_int64_t x, u_int64_t y); static u_int64_t polymmult (u_int64_t x, u_int64_t y, u_int64_t d); static u_int64_t poly = 0xb15e234bd3792f63ull; // Actual polynomial static u_int64_t T[256]; // Lookup table for mod static int shift; u_int64_t append8 (u_int64_t p, u_char m) { return ((p << 8) | m) ^ T[p >> shift]; } static u_int64_t polymod (u_int64_t nh, u_int64_t nl, u_int64_t d) { assert (d); int i; int k = fls64 (d) - 1; d <<= 63 - k; if (nh) { if (nh & MSB64) nh ^= d; for (i = 62; i >= 0; i--) if (nh & 1ULL << i) { nh ^= d >> (63 - i); nl ^= d << (i + 1); } } for (i = 63; i >= k; i--) if (nl & 1ULL << i) nl ^= d >> (63 - i); return nl; } static void polymult (u_int64_t *php, u_int64_t *plp, u_int64_t x, u_int64_t y) { int i; u_int64_t ph = 0, pl = 0; if (x & 1) pl = y; for (i = 1; i < 64; i++) if (x & (1ULL << i)) { ph ^= y >> (64 - i); pl ^= y << i; } if (php) *php = ph; if (plp) *plp = pl; } static u_int64_t polymmult (u_int64_t x, u_int64_t y, u_int64_t d) { u_int64_t h, l; polymult (&h, &l, x, y); return polymod (h, l, d); } static int size = RABIN_WINDOW_SIZE; static u_int64_t fingerprint = 0; static int bufpos = -1; static u_int64_t U[256]; static u_char buf[RABIN_WINDOW_SIZE]; void rabin_init () { assert (poly >= 0x100); u_int64_t sizeshift = 1; int xshift = fls64 (poly) - 1; int i, j; shift = xshift - 8; u_int64_t T1 = polymod (0, 1ULL << xshift, poly); for (j = 0; j < 256; j++) T[j] = polymmult (j, T1, poly) | ((u_int64_t) j << xshift); for (i = 1; i < size; i++) sizeshift = append8 (sizeshift, 0); for (i = 0; i < 256; i++) U[i] = polymmult (i, sizeshift, poly); bzero (buf, sizeof (buf)); } void rabin_reset () { rabin_init(); fingerprint = 0; bzero (buf, sizeof (buf)); } u_int64_t rabin_slide8 (u_char m) { u_char om; if (++bufpos >= size) bufpos = 0; om = buf[bufpos]; buf[bufpos] = m; fingerprint = append8 (fingerprint ^ U[om], m); return fingerprint; } [-- Attachment #4: rabinpoly.h --] [-- Type: application/octet-stream, Size: 1015 bytes --] /* * * Copyright (C) 2000 David Mazieres (dm@uun.org) * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as * published by the Free Software Foundation; either version 2, or (at * your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 * USA * * Translated to C and simplified by Geert Bosch (bosch@gnat.com) */ #include <assert.h> #include <strings.h> #include <sys/types.h> #ifndef RABIN_WINDOW_SIZE #define RABIN_WINDOW_SIZE 48 #endif void rabin_reset(); u_int64_t rabin_slide8(u_char c); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-04-06 21:01 ` Geert Bosch @ 2006-04-11 22:04 ` Junio C Hamano 2006-04-14 17:46 ` Geert Bosch 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2006-04-11 22:04 UTC (permalink / raw) To: Geert Bosch; +Cc: git Geert Bosch <bosch@adacore.com> writes: > On Mar 13, 2006, at 02:44, Linus Torvalds wrote: > >> It might be that the fast delta thing is a good way to ask >> "is this even worth considering", to cut down the O(m*n) >> rename/copy detection to something much smaller, and then use >> xdelta() to actually figure out what is a good rename and >> what isn't from a much smaller set of potential targets. > > > Here's a possible way to do that first cut. Basically, > compute a short (256-bit) fingerprint for each file, such > that the Hamming distance between two fingerprints is a measure > for their similarity. I'll include a draft write up below. Thanks for starting this. There are a few things I need to talk about the way "similarity" is _used_ in the current algorithms. Rename/copy detection outputs "similarity" but I suspect what the algorithm wants is slightly different from what humans think of "similarity". It is somewhere between "similarity" and "commonness". When you are grading a 130-page report a student submitted, you would want to notice that last 30 pages are almost verbatim copy from somebody else's report. The student in question added 100-page original contents so maybe this is not too bad, but if the report were a 30-page one, and the entier 30 pages were borrowed from somebody else's 130-page report, you would _really_ want to notice. While reorganizaing a program, a nontrivial amount of text is often removed from an existing file and moved to a newly created file. Right now, the way similarity score is calculated has a heuristical cap to reject two files whose sizes are very different, but to detect and show this kind of file split, the sizes of files should matter less. On the other hand, taking this "commonness matters" to its extreme is not what we want. We are producing "diff", so if a 30-line new file was created by moving these 30 lines from originally 130-line file (which is now 100 lines long), showing it as "copy from the-130-line-file" with a diff to remove 100-lines is usually not what we want. That's why the size cap makes sense in rename similarity estimator. Another place we use "similarity" is to break a file that got modified too much. This is done for two independent purposes. One is to detect a case like this: mv B C mv A B small-edit B File B's content is not related to what it had originally, but is derived from what was originally in A. Usually rename/copy detection tries to find rename/copy into files that _disappear_ from the result, but with the above sequence, B never disappears. By looking at how dissimilar the preimage and postimage of B are, we tell the rename/copy detector that B, although it does not disappear, might have been renamed/copied from somewhere else. Another is to present the final diff output as a complete rewrite. When -B (break) is used without -M (rename) or -C (copy), or a file that got a lot of edit and got "broken" turned out to be purely a total edit (i.e. not renamed/copied from somewhere else), we would present it as diff output that has only one hunk, with bunch of '-' (removal) to remove all original contents first and then '+' (addition) to add all the new contents, which is often easier to read than ordinary unidiff between two unrelated contents that matches up lines that happen to be the same. Empirically, it seems to give better result if the "similarity" threshold to "break" a file (i.e. to consider it might have been renamed/copied from somewhere else) is set lower than the threashold to show the diff as a complete rewrite patch. Also we can make commonness matter even more in the similarlity used to "break" a file than rename detector, because if we are going to break it, we will not have to worry about the issue of showing an annoying diff that removes 100 lines after copying a 130-line file. This implies that the break algorithm needs to use two different kinds of similarity, one for breaking and then another for deciding how to show the broken pieces as a diff. Sorry if this write-up does not make much sense. It ended up being a lot more incoherent than I hoped it to be. Anyway, sometime this week I'll find time to play with your code myself. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix up diffcore-rename scoring 2006-04-11 22:04 ` Junio C Hamano @ 2006-04-14 17:46 ` Geert Bosch 0 siblings, 0 replies; 13+ messages in thread From: Geert Bosch @ 2006-04-14 17:46 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Apr 11, 2006, at 18:04, Junio C Hamano wrote: >> Here's a possible way to do that first cut. Basically, >> compute a short (256-bit) fingerprint for each file, such >> that the Hamming distance between two fingerprints is a measure >> for their similarity. I'll include a draft write up below. > > Thanks for starting this. > > There are a few things I need to talk about the way "similarity" > is _used_ in the current algorithms. > > Rename/copy detection outputs "similarity" but I suspect what > the algorithm wants is slightly different from what humans think > of "similarity". It is somewhere between "similarity" and > "commonness". When you are grading a 130-page report a student > submitted, you would want to notice that last 30 pages are > almost verbatim copy from somebody else's report. The student > in question added 100-page original contents so maybe this is > not too bad, but if the report were a 30-page one, and the > entier 30 pages were borrowed from somebody else's 130-page > report, you would _really_ want to notice. There just isn't enough information in a 256-bit fingerprint to be able to determine if two strings have a long common substring. Also, when the input gets longer, like a few MB, or when the input has little information content (compresses very well), statistical bias will reduce reliability. Still, I used the similarity test on large tar archives, such as complete GCC releases, and it does give reasonable similarity estimates. Non-related inputs rarely have scores above 5. potomac%../gsimm - rd026c470aab28a1086403768a428358f218bba049d47e7d49f8589c2c0baca0c *.tar 55746560 gcc-2.95.1.tar 123 3.1 55797760 gcc-2.95.2.tar 112 11.8 55787520 gcc-2.95.3.tar 112 11.8 87490560 gcc-3.0.1.tar 112 11.8 88156160 gcc-3.0.2.tar 78 38.6 86630400 gcc-3.0.tar 80 37.0 132495360 gcc-3.1.tar 0 100.0 I'm mostly interested in the data storage aspects of git, looking bottom-up at the blobs stored and deriving information from that. My similarity estimator allows one to look at thousands of large checked in files and quickly identify similar files. For example, in the above case, you'd find it makes sense to store gcc-3.1.tar as a difference from gcc-3.0.tar. Doing an actual diff between these two archives takes a few seconds, while the fingerprints can be compared in microseconds. > While reorganizaing a program, a nontrivial amount of text is > often removed from an existing file and moved to a newly created > file. Right now, the way similarity score is calculated has a > heuristical cap to reject two files whose sizes are very > different, but to detect and show this kind of file split, the > sizes of files should matter less. The way to do this is to split a file at content-determined breakpoints: check the last n bits of a cyclic checksum over a sliding window, and break if they match a magic number. This would split the file in blocks with expected size of 2^n. Then you'd store a fingerprint per chunk. > [...] > Another place we use "similarity" is to break a file that got > modified too much. This is done for two independent purposes. This could be done directly using the given algorithm. > [...] Usually rename/copy > detection tries to find rename/copy into files that _disappear_ > from the result, but with the above sequence, B never > disappears. By looking at how dissimilar the preimage and > postimage of B are, we tell the rename/copy detector that B, > although it does not disappear, might have been renamed/copied > from somewhere else. This could also be cheaply determined by my similarity estimator. Almost always, you'd have a high similarity score. When there is a low score, you could verify with a more precise and expensive algorithm to have a consistent decision on what is considered a break. There is a -v option that gives more verbose output, including estimated and actual average distances from the origin for the random walks. For random input they'll be very close, but for input with a lot of repetition the actual average will be far larger. The ratio can be used as a measure of reliability of the fingerprint: ratio's closer to 1 are better. > Also we can make commonness matter even more in the similarlity > used to "break" a file than rename detector, because if we are > going to break it, we will not have to worry about the issue of > showing an annoying diff that removes 100 lines after copying a > 130-line file. This implies that the break algorithm needs to > use two different kinds of similarity, one for breaking and then > another for deciding how to show the broken pieces as a diff. > > Sorry if this write-up does not make much sense. It ended up > being a lot more incoherent than I hoped it to be. Regular diff algorithms will always give the most precise result. What my similarity estimator does is give a probability that two files have a lot of common substrings. Say, you'd have a git archive with 10,000 blobs of about 1 MB, and you'd want to determine how to pack this. You clearly can't use diff programs to solve this, but you can use the estimates. > Anyway, sometime this week I'll find time to play with your code > myself. Thanks, I'm looking forward to your comments. -Geert ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-04-14 17:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-13 6:26 Fix up diffcore-rename scoring Linus Torvalds 2006-03-13 6:44 ` Linus Torvalds 2006-03-13 6:46 ` Junio C Hamano 2006-03-13 7:09 ` Linus Torvalds 2006-03-13 7:42 ` Junio C Hamano 2006-03-13 7:44 ` Linus Torvalds 2006-03-13 10:43 ` Junio C Hamano 2006-03-13 15:38 ` Linus Torvalds 2006-03-14 0:49 ` Rutger Nijlunsing 2006-03-14 0:55 ` Junio C Hamano 2006-04-06 21:01 ` Geert Bosch 2006-04-11 22:04 ` Junio C Hamano 2006-04-14 17:46 ` Geert Bosch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).