* git-diff-tree -M performance regression in 'next' @ 2006-03-11 17:28 Fredrik Kuivinen 2006-03-12 3:10 ` Junio C Hamano 2006-03-12 12:28 ` Junio C Hamano 0 siblings, 2 replies; 17+ messages in thread From: Fredrik Kuivinen @ 2006-03-11 17:28 UTC (permalink / raw) To: git; +Cc: junkio Hi, I added some time logging code to git-merge-recursive to see exactly what we spend all the time on in merges which involves many changes, such as a merge of a slightly modified v2.6.12 and an unmodified v2.6.15. I turned out that the rename detection took almost 10 minutes on my machine. More specifically, git-diff-tree -r -M --diff-filter=R v2.6.12 v2.6.14 takes ~9 minutes with the current 'next'. With 65416758cd83772f2f3c69f1bd1f501608e64745, which uses the delta code to compute the similarity measure, the above git-diff-tree invocation takes 1.50 minutes. - Fredrik ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-11 17:28 git-diff-tree -M performance regression in 'next' Fredrik Kuivinen @ 2006-03-12 3:10 ` Junio C Hamano 2006-03-12 12:28 ` Junio C Hamano 1 sibling, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-03-12 3:10 UTC (permalink / raw) To: Fredrik Kuivinen; +Cc: git, junkio Fredrik Kuivinen <freku045@student.liu.se> writes: > I turned out that the rename detection took almost 10 minutes on my > machine. Yes, that is one of the reasons why it still is in "next", not in "master". The rename-detector change was done primarily to work around the correctness problem the finer-grained delta changes would have introduced. The new delta code would have produced far more copies from the source than the current xdelta code, but the nature of the new copies it would have found was quite different from what we would usually call "file being renamed". Now we decided to shelve the finer-grained delta code for now, I do not see a pressing reason to have the experimental rename detector graduate to "master" until we resolve its performance issues. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-11 17:28 git-diff-tree -M performance regression in 'next' Fredrik Kuivinen 2006-03-12 3:10 ` Junio C Hamano @ 2006-03-12 12:28 ` Junio C Hamano 2006-03-12 17:00 ` Linus Torvalds 1 sibling, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-03-12 12:28 UTC (permalink / raw) To: Fredrik Kuivinen; +Cc: git Fredrik Kuivinen <freku045@student.liu.se> writes: > I turned out that the rename detection took almost 10 minutes on my > machine. More specifically, > > git-diff-tree -r -M --diff-filter=R v2.6.12 v2.6.14 > > takes ~9 minutes with the current 'next'. I have some updates to "next" tonight. On my otherwise idle Duron 750 with slow disks, I am getting something like these: 0.99.9m : 130m virtual, 40m resident, (0major+14205minor) 67.62user 0.08system 1:15.95elapsed master : 130m virtual, 40m resident, (0major+12510minor) 66.06user 0.07system 1:10.95elapsed "next" : 150m virtual, 65m resident, (0major+49858minor) 51.41user 0.45system 0.57.55elapsed The result will _not_ exactly match, because the similarity estimation algorithms are different. Judging the differences objectively is a bit hard, but my impression is that the "next" one tends to find more sensible renames. To name a few: * Documentation/dvb/README.dibusb from v2.6.12 seems pretty similar to Documentation/dvb/README.dvb-usb from v2.6.14, and "next" finds them, but "master" does not. * "master" says arch/arm/configs/omnimeter_defconfig was copy-edited to produce arch/arm/configs/collie_defconfig; The diff output shows ~350 new lines and ~270 deleted lines, while these files are 800-900 lines long; "next" rejects them. I think this is a border-line case. * "next" finds Kconfig and Makefile in arch/arm/mach-omap-1/ are derived from arch/arm/mach-omap/; manual inspection of these files makes me feel that decision is sensible. "master" does not find them. * Same thing for config.c in arch/m68knommu/platform/68VZ328/; found to be derived from arch/m68knommu/platform/68VZ328/de2/ by "next" but not by "master". * Other examples "next" finds but "master" misses include: arch/um/kernel/process.c arch/um/os-Linux/start_up.c arch/um/kernel/tt/unmap.c arch/um/sys-x86_64/unmap.c drivers/ide/cris/ide-v10.c drivers/ide/cris/ide-cris.c include/asm-ppc/cputime.h include/asm-xtensa/cputime.h include/asm-ppc64/ioctl.h include/asm-xtensa/ioctl.h include/asm-ppc64/ioctls.h include/asm-xtensa/ioctls.h include/asm-ppc64/mman.h include/asm-xtensa/mman.h include/asm-ppc64/poll.h include/asm-xtensa/poll.h include/asm-ppc64/shmbuf.h include/asm-xtensa/shmbuf.h include/asm-ppc64/socket.h include/asm-xtensa/socket.h include/asm-ppc64/termbits.h include/asm-xtensa/termbits.h * The "next" one is not perfect. There are some quite bad choices made by it: include/asm-ppc64/timex.h include/asm-powerpc/bugs.h include/asm-ppc64/iSeries/LparData.h include/linux/i2c-isa.h ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-12 12:28 ` Junio C Hamano @ 2006-03-12 17:00 ` Linus Torvalds 2006-03-12 19:34 ` Junio C Hamano 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2006-03-12 17:00 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Junio C Hamano wrote: > > On my otherwise idle Duron 750 with slow disks, I am getting > something like these: > > 0.99.9m : 130m virtual, 40m resident, (0major+14205minor) > 67.62user 0.08system 1:15.95elapsed > master : 130m virtual, 40m resident, (0major+12510minor) > 66.06user 0.07system 1:10.95elapsed > "next" : 150m virtual, 65m resident, (0major+49858minor) > 51.41user 0.45system 0.57.55elapsed Any way to fix that "4 times as many page misses, and 70% bigger rss?" thing? It looks like you're not very careful about your memory use. I realize that git in general wants a lot of memory, but I see that as a failure most of the time. I've got 2GB in most of my machines, but I shouldn't _need_ to have it.. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-12 17:00 ` Linus Torvalds @ 2006-03-12 19:34 ` Junio C Hamano 2006-03-13 0:42 ` Junio C Hamano 0 siblings, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-03-12 19:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Fredrik Kuivinen, git Linus Torvalds <torvalds@osdl.org> writes: > On Sun, 12 Mar 2006, Junio C Hamano wrote: >> >> master : 130m virtual, 40m resident, (0major+12510minor) >> 66.06user 0.07system 1:10.95elapsed >> "next" : 150m virtual, 65m resident, (0major+49858minor) >> 51.41user 0.45system 0.57.55elapsed > > Any way to fix that "4 times as many page misses, and 70% bigger rss?" > thing? It looks like you're not very careful about your memory use. I started from "80 times as many page misses and 5 times bigger rss", shrunk it down to the above after doing more careful memory use, running out of ideas to shrink it more, and pushed it out. So... ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-12 19:34 ` Junio C Hamano @ 2006-03-13 0:42 ` Junio C Hamano 2006-03-13 1:09 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-03-13 0:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Fredrik Kuivinen, git > Linus Torvalds <torvalds@osdl.org> writes: > >> On Sun, 12 Mar 2006, Junio C Hamano wrote: >>> >>> master : 130m virtual, 40m resident, (0major+12510minor) >>> 66.06user 0.07system 1:10.95elapsed >>> "next" : 150m virtual, 65m resident, (0major+49858minor) >>> 51.41user 0.45system 0.57.55elapsed >> >> Any way to fix that "4 times as many page misses, and 70% bigger rss?" >> thing? It looks like you're not very careful about your memory use. "this" : 145m virtual, 57m resident, (0major+18855minor) 39.81user 0.28system 0:42.16elapsed 50% more page misses, 45% bigger rss, 65% less usertime. Slowly getting there... -- >8 -- [PATCH] diffcore-delta: make the hash a bit denser. To reduce wasted memory, wait until the hash fills up more densely before we rehash. This reduces the working set size a bit further. Signed-off-by: Junio C Hamano <junkio@cox.net> --- diffcore-delta.c | 13 +++++++++---- diffcore-rename.c | 4 +++- 2 files changed, 12 insertions(+), 5 deletions(-) af0b459589edaa77c51a892dd7dc44329634d253 diff --git a/diffcore-delta.c b/diffcore-delta.c index 471b98f..f8a7518 100644 --- a/diffcore-delta.c +++ b/diffcore-delta.c @@ -25,8 +25,12 @@ */ /* Wild guess at the initial hash size */ -#define INITIAL_HASH_SIZE 10 +#define INITIAL_HASH_SIZE 9 #define HASHBASE 65537 /* next_prime(2^16) */ +/* We leave more room in smaller hash but do not let it + * grow to have unused hole too much. + */ +#define INITIAL_FREE(sz_log2) ((1<<(sz_log2))*(sz_log2-3)/(sz_log2)) struct spanhash { unsigned long hashval; @@ -38,7 +42,8 @@ struct spanhash_top { struct spanhash data[FLEX_ARRAY]; }; -static struct spanhash *spanhash_find(struct spanhash_top *top, unsigned long hashval) +static struct spanhash *spanhash_find(struct spanhash_top *top, + unsigned long hashval) { int sz = 1 << top->alloc_log2; int bucket = hashval & (sz - 1); @@ -62,7 +67,7 @@ static struct spanhash_top *spanhash_reh new = xmalloc(sizeof(*orig) + sizeof(struct spanhash) * sz); new->alloc_log2 = orig->alloc_log2 + 1; - new->free = osz; + new->free = INITIAL_FREE(new->alloc_log2); memset(new->data, 0, sizeof(struct spanhash) * sz); for (i = 0; i < osz; i++) { struct spanhash *o = &(orig->data[i]); @@ -122,7 +127,7 @@ static struct spanhash_top *hash_chars(u i = INITIAL_HASH_SIZE; hash = xmalloc(sizeof(*hash) + sizeof(struct spanhash) * (1<<i)); hash->alloc_log2 = i; - hash->free = (1<<i)/2; + hash->free = INITIAL_FREE(i); memset(hash->data, 0, sizeof(struct spanhash) * (1<<i)); /* an 8-byte shift register made of accum1 and accum2. New diff --git a/diffcore-rename.c b/diffcore-rename.c index b80b432..8380049 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -307,8 +307,10 @@ void diffcore_rename(struct diff_options m->score = estimate_similarity(one, two, minimum_score); } + /* We do not need the text anymore */ free(two->cnt_data); - two->cnt_data = NULL; + free(two->data); + two->data = two->cnt_data = NULL; dst_cnt++; } /* cost matrix sorted by most to least similar pair */ -- 1.2.4.g3dcf ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 0:42 ` Junio C Hamano @ 2006-03-13 1:09 ` Linus Torvalds 2006-03-13 1:22 ` Junio C Hamano 2006-03-13 1:29 ` Linus Torvalds 0 siblings, 2 replies; 17+ messages in thread From: Linus Torvalds @ 2006-03-13 1:09 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Junio C Hamano wrote: > > To reduce wasted memory, wait until the hash fills up more > densely before we rehash. This reduces the working set size a > bit further. Umm. Why do you rehash at all? Just take the size of the "src" file as the initial hash size. Also, I think it is likely really wasteful to try to actually hash at _each_ character. Instead, let's say that the chunk-size is 8 bytes (like you do now), and let's say that you have a 32-bit good hash of those 8 bytes. What you can do is: - for each 8 bytes in the source, hash those 8 bytes (not every byte) - for each byte in the other file, hash 8 next bytes. IF it matches a hash in the source with a non-zero count, subtract the count for that hash and move up by _eight_ characters! If it doesn't, add one to "characters not matched" counter, and move up _one_ character, and try again. At the end of this, you have two counts: the count of characters that you couldn't match in the other file, and the count of 8-byte hash-chunks that you couldn't match in the first one. Use those two counts to decide if it's close or not. Especially for good matches, this should basically cut your work into an eight of what you do now. Actually, even for bad matches, you cut the first source overhead into one eight (the second file will obviously do the "update by 1 byte" most of the time). Don't you think that would be as accurate as what you're doing now (it's the same basic notion, after all), and noticeably faster? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 1:09 ` Linus Torvalds @ 2006-03-13 1:22 ` Junio C Hamano 2006-03-13 1:39 ` Linus Torvalds 2006-03-13 1:29 ` Linus Torvalds 1 sibling, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-03-13 1:22 UTC (permalink / raw) To: Linus Torvalds; +Cc: Fredrik Kuivinen, git Linus Torvalds <torvalds@osdl.org> writes: > Umm. Why do you rehash at all? > > Just take the size of the "src" file as the initial hash size. The code uses close to 16-bit hash and I had 65k flat array as a hashtable. That one was what you commented as "4-times as many page misses". Interestingly enough, that kind of flat array representation seems to be too sparse and gives very bad performance behaviour. The improvement I mentioned in the message you are replying to is the result of making it into smaller (starting at (1<<9) or something like that) linear-overflowing hash. The latter suggestion I need to think about it a bit more. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 1:22 ` Junio C Hamano @ 2006-03-13 1:39 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2006-03-13 1:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Junio C Hamano wrote: > > The code uses close to 16-bit hash and I had 65k flat array as a > hashtable. That one was what you commented as "4-times as many > page misses". Ahh. That explains the limited bits in the hash function too. I only looked at the current sources, not at the historic ones. Btw, the page misses may come from the fact that you allocated and re-allocated the flat array all the time. That can be very expensive for big allocations, since most libraries may decide that it's a big enough area that it should be map/unmap'ed in order to give memory back to the system (without realizing that there's another allocation coming soon afterwards of the same size). If you map/unmap, the kernel will end up having to not just use new pages, but obviously also clear them for security reasons. So it ends up sucking on many levels. In contrast, if you just have a 64k-entry array of "int", and allocate it _once_ (instead of once per file-pair) you'll still have to clear it in between file-pairs, but at least you won't have the overhead of mapping/unmapping it. The clearing can still be pretty expensive (64k "int" entries is 256kB, and since most _files_ are just in the ~4-8kB range, you're spending a _lot_ of time just memset'ing). Which is why it's probably a good idea to instead default to having just "filesize / 8" entries, but then you can obviously not use the hash as the index any more. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 1:09 ` Linus Torvalds 2006-03-13 1:22 ` Junio C Hamano @ 2006-03-13 1:29 ` Linus Torvalds 2006-03-13 1:31 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 17+ messages in thread From: Linus Torvalds @ 2006-03-13 1:29 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Linus Torvalds wrote: > > Also, I think it is likely really wasteful to try to actually hash at > _each_ character. Instead, let's say that the chunk-size is 8 bytes (like > you do now), and let's say that you have a 32-bit good hash of those 8 > bytes. What you can do is: Side note: regardless, your new algorithm clearly does seem faster. However, it worries me a bit that you don't check the source strings, especially since the hash you use seems somewhat suspect (why limit it to essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime that fits in the "hashval" thing, which is at least 32 bits? Also, shouldn't you make that spanhash thing use a "unsigned int" instead of "unsigned long"?) So I would suggest instead the hash function to be: typedef unsigned long long u64; /* Biggest prime in 32 bits */ #define HASHVAL (4294967291u) u64 value = *(u64 *)src; src += 8; hash = value % 4294967291u; which does a 64-bit modulus, but hey, 64-bit hw isn't _that_ uncommon any more, and it's not _excessively_ slow on x86 (gcc doesn't generate good code, but we could actually use the kernel "do_div()" routine for much faster division of 64 % 32 than what gcc can do - since the dividend is 32-bit, you actually only need to do one 32/32 division and one 64/32 division, so the optimized hash function on a good x86 will be just in the teens of cycles for the 64-bit modulus). Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 1:29 ` Linus Torvalds @ 2006-03-13 1:31 ` Linus Torvalds 2006-03-13 2:29 ` Linus Torvalds 2006-03-13 4:14 ` Junio C Hamano 2 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2006-03-13 1:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Linus Torvalds wrote: > > u64 value = *(u64 *)src; > src += 8; > hash = value % 4294967291u; Btw, this assumes the "only hash every 8 bytes" at the source, in which case this is ok even on architectures that need aligned reads. For the non-aligned reads, you'd need your "shift the value" approach. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 1:29 ` Linus Torvalds 2006-03-13 1:31 ` Linus Torvalds @ 2006-03-13 2:29 ` Linus Torvalds 2006-03-13 2:53 ` Linus Torvalds 2006-03-13 4:14 ` Junio C Hamano 2 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2006-03-13 2:29 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Linus Torvalds wrote: > > So I would suggest instead the hash function to be: > > typedef unsigned long long u64; > > /* Biggest prime in 32 bits */ > #define HASHVAL (4294967291u) > > > u64 value = *(u64 *)src; > src += 8; > hash = value % 4294967291u; > > which does a 64-bit modulus, but hey, 64-bit hw isn't _that_ uncommon any > more, and it's not _excessively_ slow on x86 (gcc doesn't generate good > code, but we could actually use the kernel "do_div()" routine for much > faster division of 64 % 32 than what gcc can do - since the dividend is > 32-bit, you actually only need to do one 32/32 division and one 64/32 > division, so the optimized hash function on a good x86 will be just in > the teens of cycles for the 64-bit modulus). Actually, on x86, where we can do the 64-by-32 division with a single instruction, this seems to be the best possible code: #define HASH_VAL (4294967291u) static inline unsigned int hash64x32(unsigned long long a) { unsigned int high, low; unsigned int modulus; asm("" :"=a" (low), "=d" (high) :"A" (a)); if (high > HASH_VAL) high -= HASH_VAL; asm("divl %2" :"=a" (low), "=d" (modulus) :"rm" (HASH_VAL), "0" (low), "1" (high)); return modulus; } at least as far as I can think. The magic is the reduction of the high 32 bits: for the general case you want a modulus for that reduction, but since we're dividing with a really big value, the modulus of the high bits really does end up becoming just that simple if (high > HASH_VAL) high -= HASH_VAL; and then we just need to do a single "divl" instruction. So if you want a 32-bit hash of an 8-byte area, this should be a pretty good and fast way to calculate it. Of course, maybe just doing an adler32() is simpler/better. But with the above, at least on x86, I suspect you get an even better distribution at a lower cost (of course, different coress do differently well on large divisions, but integer division being pretty important for some codes, modern cores actually tend to be pretty good at it). Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 2:29 ` Linus Torvalds @ 2006-03-13 2:53 ` Linus Torvalds 0 siblings, 0 replies; 17+ messages in thread From: Linus Torvalds @ 2006-03-13 2:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Sun, 12 Mar 2006, Linus Torvalds wrote: > > if (high > HASH_VAL) > high -= HASH_VAL; That should be ">= HASH_VAL", of course. I'm a retard. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 1:29 ` Linus Torvalds 2006-03-13 1:31 ` Linus Torvalds 2006-03-13 2:29 ` Linus Torvalds @ 2006-03-13 4:14 ` Junio C Hamano 2006-03-14 2:55 ` Junio C Hamano 2 siblings, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-03-13 4:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Fredrik Kuivinen, git Linus Torvalds <torvalds@osdl.org> writes: > However, it worries me a bit that you don't check the source strings, > especially since the hash you use seems somewhat suspect (why limit it to > essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime > that fits in the "hashval" thing, which is at least 32 bits? Also, > shouldn't you make that spanhash thing use a "unsigned int" instead of > "unsigned long"?) > > So I would suggest instead the hash function to be: > > typedef unsigned long long u64; > > /* Biggest prime in 32 bits */ > #define HASHVAL (4294967291u) Actually what you are seeing is the best compromise I've come up with. There were about a dozen crappoid that turned out to be suboptimal I threw away. The hashsize is an interesting example of such a compromise between precision and performance. It is true that full 32-bit hash gives us more precise match detection. If you change the current hash function to divide with (4294967291u), the rename similarity score assigned to some (but not all) paths in the example dataset we have been using (diff between v2.6.12 and v2.6.14) decreases by about 1% -- this comes from the fact that the hashvalue capped to 16-bit gives some false hits that larger hashbase value can tell apart. Because of it, some paths near the rename detection threshold stop being recognized as renames. However, the runtime of the same dataset increases quite a bit when we do this. I think this is because we need to keep more different hash values in the hashtable and the cost to randomly access into a huge table starts to hurt, unlike the 16-bit capped version whose hash table never needs to grow beyond 65k entries. next 39.77user 0.22system 0:40.78elapsed 0inputs+0outputs (0major+18855minor) 32-bit 66.32user 0.15system 1:07.00elapsed 0inputs+0outputs (0major+21057minor) Admittedly, we should not optimize for one particular test case, but we should start from somewhere. Decreasing the hashsize to 14-bit or less had noticeable degradation on the quality of renames the algorithm detects and misses to detect, both in v2.6.12..v2.6.14 test and some tests in git.git repository. One obvious change we could do is to make the hashsize to 0x1800D (a prime halfway between 16- and 17-bit); currently the hashtable grows double every time when the table of the current size fills up sufficiently, but the current hashbase cannot fit in 16-bit, so we _are_ already using a table with 128K entries in some cases. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-13 4:14 ` Junio C Hamano @ 2006-03-14 2:55 ` Junio C Hamano 2006-03-14 3:47 ` Linus Torvalds 0 siblings, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-03-14 2:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Fredrik Kuivinen, git While we are hacking away with weird ideas... Here is still an WIP but an insanely fast one (actually this is a modified version of what once was in next). I haven't verified the sanity of its output fully, but from a cursory look what are found look sensible. The same v2.6.12..v2.6.14 test on my Duron 750: master 64.65user 0.17system 1:05.42elapsed 0inputs+0outputs (0major+12511minor) next 40.69user 0.14system 0:40.98elapsed 0inputs+0outputs (0major+19471minor) "this" 5.59user 0.09system 0:05.68elapsed 0inputs+0outputs (0major+13015minor) The hash used here is heavily optimized for handling text files and nothing else. Actually, it punts on a file that contains a NUL byte. The hash is computed by first skipping sequences of whitespace letters (including LF); upon seeing a non whitespace, we start hashing, while still ignoring whitespaces, until we hit the next LF (or EOF). Then we store the real number of bytes along with the hash. When we find the matching hash value in the destination, we say that many bytes (including the whitespaces we ignored while hashing) were copied. The patch should apply on top of the current "next". --- diff --git a/diffcore-delta.c b/diffcore-delta.c index 835d82c..0f4866e 100644 --- a/diffcore-delta.c +++ b/diffcore-delta.c @@ -3,25 +3,10 @@ #include "diffcore.h" /* - * Idea here is very simple. - * - * We have total of (sz-N+1) N-byte overlapping sequences in buf whose - * size is sz. If the same N-byte sequence appears in both source and - * destination, we say the byte that starts that sequence is shared - * between them (i.e. copied from source to destination). - * - * For each possible N-byte sequence, if the source buffer has more - * instances of it than the destination buffer, that means the - * difference are the number of bytes not copied from source to - * destination. If the counts are the same, everything was copied - * from source to destination. If the destination has more, - * everything was copied, and destination added more. - * - * We are doing an approximation so we do not really have to waste - * memory by actually storing the sequence. We just hash them into - * somewhere around 2^16 hashbuckets and count the occurrences. - * - * The length of the sequence is arbitrarily set to 8 for now. + * Record the hashes for "extended lines" in both source and destination, + * and compare how similar they are. "Extended lines" hash is designed + * to work well on text files -- leading whitespaces and tabs, and consecutive + * LF characters are effectively ignored. */ /* Wild guess at the initial hash size */ @@ -40,8 +25,9 @@ #define HASHBASE 107927 struct spanhash { - unsigned long hashval; - unsigned long cnt; + unsigned long hashval; /* hash for the line */ + unsigned bytes; /* real number of bytes in such a line */ + unsigned long cnt; /* occurrences */ }; struct spanhash_top { int alloc_log2; @@ -87,6 +73,7 @@ static struct spanhash_top *spanhash_reh if (!h->cnt) { h->hashval = o->hashval; h->cnt = o->cnt; + h->bytes = o->bytes; new->free--; break; } @@ -99,7 +86,8 @@ static struct spanhash_top *spanhash_reh } static struct spanhash_top *add_spanhash(struct spanhash_top *top, - unsigned long hashval) + unsigned long hashval, + unsigned bytes) { int bucket, lim; struct spanhash *h; @@ -110,6 +98,7 @@ static struct spanhash_top *add_spanhash h = &(top->data[bucket++]); if (!h->cnt) { h->hashval = hashval; + h->bytes = bytes; h->cnt = 1; top->free--; if (top->free < 0) @@ -117,6 +106,14 @@ static struct spanhash_top *add_spanhash return top; } if (h->hashval == hashval) { + if (h->bytes != bytes) { + /* avg of h->cnt instances were h->bytes + * now we are adding bytes + */ + h->bytes = ((h->cnt / 2 + bytes + + h->cnt * h->bytes) / + (h->cnt + 1)); + } h->cnt++; return top; } @@ -125,11 +122,49 @@ static struct spanhash_top *add_spanhash } } -static struct spanhash_top *hash_chars(unsigned char *buf, unsigned long sz) +static unsigned long hash_extended_line(const unsigned char **buf_p, + unsigned long left) +{ + /* An extended line is zero or more whitespace letters (including LF) + * followed by one non whitespace letter followed by zero or more + * non LF, and terminated with by a LF (or EOF). + */ + const unsigned char *bol = *buf_p; + const unsigned char *buf = bol; + unsigned long hashval = 0; + while (left) { + unsigned c = *buf++; + if (!c) + goto binary; + left--; + if (' ' < c) { + hashval = c; + break; + } + } + while (left) { + unsigned c = *buf++; + if (!c) + goto binary; + left--; + if (c == '\n') + break; + if (' ' < c) + hashval = hashval * 11 + c; + } + *buf_p = buf; + return hashval; + + binary: + *buf_p = NULL; + return 0; +} + +static struct spanhash_top *hash_lines(const unsigned char *buf, unsigned long sz) { int i; - unsigned long accum1, accum2, hashval; struct spanhash_top *hash; + const unsigned char *eobuf = buf + sz; i = INITIAL_HASH_SIZE; hash = xmalloc(sizeof(*hash) + sizeof(struct spanhash) * (1<<i)); @@ -137,19 +172,14 @@ static struct spanhash_top *hash_chars(u hash->free = INITIAL_FREE(i); memset(hash->data, 0, sizeof(struct spanhash) * (1<<i)); - /* an 8-byte shift register made of accum1 and accum2. New - * bytes come at LSB of accum2, and shifted up to accum1 - */ - for (i = accum1 = accum2 = 0; i < 7; i++, sz--) { - accum1 = (accum1 << 8) | (accum2 >> 24); - accum2 = (accum2 << 8) | *buf++; - } - while (sz) { - accum1 = (accum1 << 8) | (accum2 >> 24); - accum2 = (accum2 << 8) | *buf++; - hashval = (accum1 + accum2 * 0x61) % HASHBASE; - hash = add_spanhash(hash, hashval); - sz--; + while (buf < eobuf) { + const unsigned char *ptr = buf; + unsigned long hashval = hash_extended_line(&buf, eobuf-ptr); + if (!buf) { + free(hash); + return NULL; + } + hash = add_spanhash(hash, hashval, buf-ptr); } return hash; } @@ -166,21 +196,18 @@ int diffcore_count_changes(void *src, un struct spanhash_top *src_count, *dst_count; unsigned long sc, la; - if (src_size < 8 || dst_size < 8) - return -1; - src_count = dst_count = NULL; if (src_count_p) src_count = *src_count_p; if (!src_count) { - src_count = hash_chars(src, src_size); + src_count = hash_lines(src, src_size); if (src_count_p) *src_count_p = src_count; } if (dst_count_p) dst_count = *dst_count_p; if (!dst_count) { - dst_count = hash_chars(dst, dst_size); + dst_count = hash_lines(dst, dst_size); if (dst_count_p) *dst_count_p = dst_count; } @@ -193,9 +220,9 @@ int diffcore_count_changes(void *src, un unsigned dst_cnt, src_cnt; if (!s->cnt) continue; - src_cnt = s->cnt; d = spanhash_find(dst_count, s->hashval); - dst_cnt = d ? d->cnt : 0; + src_cnt = s->cnt * s->bytes; + dst_cnt = d ? (d->cnt * d->bytes) : 0; if (src_cnt < dst_cnt) { la += dst_cnt - src_cnt; sc += src_cnt; ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-14 2:55 ` Junio C Hamano @ 2006-03-14 3:47 ` Linus Torvalds 2006-03-14 10:26 ` Junio C Hamano 0 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2006-03-14 3:47 UTC (permalink / raw) To: Junio C Hamano; +Cc: Fredrik Kuivinen, git On Mon, 13 Mar 2006, Junio C Hamano wrote: > > Here is still an WIP but an insanely fast one (actually this is > a modified version of what once was in next). I haven't > verified the sanity of its output fully, but from a cursory look > what are found look sensible. The same v2.6.12..v2.6.14 test on > my Duron 750: Heh. I did something similar, except I wanted mine to work with binary data too. Not that I know how _well_ it works, but assuming you have _some_ '\n' characters to fix up offset mismatches, it might do something. Mine is a bit less hacky than yours, I believe. It doesn't skip whitespace, instead it just maintains a rolling 64-bit number, where each character shifts it left by 7 and then adds in the new character value (overflow in 32 bits just ignored). Then it uses your old hash function, except it hides the length in the top byte. It breaks the hashing on '\n' or on hitting a 64-byte sequence, whichever comes first. It's fast and stupid, but doesn't seem to do any worse than your old one. The speed comes from the fact that it only does the hash comparisons at the "block boundaries", not at every byte. Anyway, I don't think something like this is really any good for rename detection, but it might be good for deciding whether to do a real delta. Linus ---- diff --git a/diffcore-delta.c b/diffcore-delta.c index 835d82c..4c6e512 100644 --- a/diffcore-delta.c +++ b/diffcore-delta.c @@ -127,7 +127,7 @@ static struct spanhash_top *add_spanhash static struct spanhash_top *hash_chars(unsigned char *buf, unsigned long sz) { - int i; + int i, n; unsigned long accum1, accum2, hashval; struct spanhash_top *hash; @@ -137,19 +137,21 @@ static struct spanhash_top *hash_chars(u hash->free = INITIAL_FREE(i); memset(hash->data, 0, sizeof(struct spanhash) * (1<<i)); - /* an 8-byte shift register made of accum1 and accum2. New - * bytes come at LSB of accum2, and shifted up to accum1 - */ - for (i = accum1 = accum2 = 0; i < 7; i++, sz--) { - accum1 = (accum1 << 8) | (accum2 >> 24); - accum2 = (accum2 << 8) | *buf++; - } + n = 0; + accum1 = accum2 = 0; while (sz) { - accum1 = (accum1 << 8) | (accum2 >> 24); - accum2 = (accum2 << 8) | *buf++; + unsigned long c = *buf++; + sz--; + accum1 = (accum1 << 7) | (accum2 >> 25); + accum2 = (accum2 << 7) | (accum1 >> 25); + accum1 += c; + if (++n < 64 && c != '\n') + continue; hashval = (accum1 + accum2 * 0x61) % HASHBASE; + hashval |= (n << 24); hash = add_spanhash(hash, hashval); - sz--; + n = 0; + accum1 = accum2 = 0; } return hash; } @@ -166,9 +168,6 @@ int diffcore_count_changes(void *src, un struct spanhash_top *src_count, *dst_count; unsigned long sc, la; - if (src_size < 8 || dst_size < 8) - return -1; - src_count = dst_count = NULL; if (src_count_p) src_count = *src_count_p; @@ -196,6 +195,8 @@ int diffcore_count_changes(void *src, un src_cnt = s->cnt; d = spanhash_find(dst_count, s->hashval); dst_cnt = d ? d->cnt : 0; + dst_cnt *= (d->hashval >> 24); + src_cnt *= (d->hashval >> 24); if (src_cnt < dst_cnt) { la += dst_cnt - src_cnt; sc += src_cnt; ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: git-diff-tree -M performance regression in 'next' 2006-03-14 3:47 ` Linus Torvalds @ 2006-03-14 10:26 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-03-14 10:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Fredrik Kuivinen, git Linus Torvalds <torvalds@osdl.org> writes: > Mine is a bit less hacky than yours, I believe. It doesn't skip > whitespace, instead it just maintains a rolling 64-bit number, where each > character shifts it left by 7 and then adds in the new character value > (overflow in 32 bits just ignored). That rolling register is a good idea. The "whitespace hack" was done to recognize certain kind of changes that commonly appear in source code. For example, it will still recognize content copies after you re-indent your code, or add an "if (...) {" and "} else { ... }" around an existing code block, or add extra blank lines. It is still an inadequate hack. If you comment out a code block by adding "#if 0" and "#endif" around it, it notices the surviving lines, but if instead you comment out a block by prefixing "//" in front of every line in the block, neither your 64-byte-or-EOL or my extended line algorithm would notice that the content copy anymore. Anyway, I did a bit of comparison and it appears that the whitespace thing does not make much difference in practice. > It's fast and stupid, but doesn't seem to do any worse than your old one. Comparing the "next" with your 64-byte-or-EOL and "extended line" on the v2.6.12..v2.6.14 test case shows: 64-or-EOL extended line renames identically detected 108 110 matched differently 2 2 finds what"next" misses 4 4 misses what "next" finds 23 21 What they find seem reasonable. What they reject are sometimes debatable. For example, similarity between these two files does not seem to be noticed by either. v2.6.12/drivers/media/dvb/dibusb/dvb-dibusb-firmware.c v2.6.14/drivers/media/dvb/dvb-usb/dvb-usb-firmware.c The "next" algorithm gives 60% score while these two gives 45% or so to this pair. But they both reject these bogus "rename" the "next" algorithm finds: v2.6.12/drivers/char/drm/gamma_drv.c v2.6.14/drivers/char/drm/via_verifier.h ("next" 51% vs 37-40% with these algorithms). > Anyway, I don't think something like this is really any good for rename > detection, but it might be good for deciding whether to do a real delta. Either algorithm seem to have non-negligible false negative rates but their false positive rates are reasonably low. So we could use these as a pre-filter and use real delta on pairs that these quick and dirty algorithms say are too different. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-03-14 10:26 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-11 17:28 git-diff-tree -M performance regression in 'next' Fredrik Kuivinen 2006-03-12 3:10 ` Junio C Hamano 2006-03-12 12:28 ` Junio C Hamano 2006-03-12 17:00 ` Linus Torvalds 2006-03-12 19:34 ` Junio C Hamano 2006-03-13 0:42 ` Junio C Hamano 2006-03-13 1:09 ` Linus Torvalds 2006-03-13 1:22 ` Junio C Hamano 2006-03-13 1:39 ` Linus Torvalds 2006-03-13 1:29 ` Linus Torvalds 2006-03-13 1:31 ` Linus Torvalds 2006-03-13 2:29 ` Linus Torvalds 2006-03-13 2:53 ` Linus Torvalds 2006-03-13 4:14 ` Junio C Hamano 2006-03-14 2:55 ` Junio C Hamano 2006-03-14 3:47 ` Linus Torvalds 2006-03-14 10:26 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).