git-diff-tree -M performance regression in 'next'

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-diff-tree -M performance regression in 'next'
@ 2006-03-11 17:28 Fredrik Kuivinen
  2006-03-12  3:10 ` Junio C Hamano
  2006-03-12 12:28 ` Junio C Hamano
  0 siblings, 2 replies; 17+ messages in thread
From: Fredrik Kuivinen @ 2006-03-11 17:28 UTC (permalink / raw)
  To: git; +Cc: junkio

Hi,

I added some time logging code to git-merge-recursive to see exactly
what we spend all the time on in merges which involves many changes,
such as a merge of a slightly modified v2.6.12 and an unmodified
v2.6.15.

I turned out that the rename detection took almost 10 minutes on my
machine. More specifically,

   git-diff-tree -r -M --diff-filter=R v2.6.12 v2.6.14

takes ~9 minutes with the current 'next'.

With 65416758cd83772f2f3c69f1bd1f501608e64745, which uses the delta
code to compute the similarity measure, the above git-diff-tree
invocation takes 1.50 minutes.

- Fredrik

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-11 17:28 git-diff-tree -M performance regression in 'next' Fredrik Kuivinen
@ 2006-03-12  3:10 ` Junio C Hamano
  2006-03-12 12:28 ` Junio C Hamano
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-03-12  3:10 UTC (permalink / raw)
  To: Fredrik Kuivinen; +Cc: git, junkio

Fredrik Kuivinen <freku045@student.liu.se> writes:

> I turned out that the rename detection took almost 10 minutes on my
> machine.

Yes, that is one of the reasons why it still is in "next", not
in "master".

The rename-detector change was done primarily to work around the
correctness problem the finer-grained delta changes would have
introduced.  The new delta code would have produced far more
copies from the source than the current xdelta code, but the
nature of the new copies it would have found was quite different
from what we would usually call "file being renamed".  Now we
decided to shelve the finer-grained delta code for now, I do not
see a pressing reason to have the experimental rename detector
graduate to "master" until we resolve its performance issues.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-11 17:28 git-diff-tree -M performance regression in 'next' Fredrik Kuivinen
  2006-03-12  3:10 ` Junio C Hamano
@ 2006-03-12 12:28 ` Junio C Hamano
  2006-03-12 17:00   ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-03-12 12:28 UTC (permalink / raw)
  To: Fredrik Kuivinen; +Cc: git

Fredrik Kuivinen <freku045@student.liu.se> writes:

> I turned out that the rename detection took almost 10 minutes on my
> machine. More specifically,
>
>    git-diff-tree -r -M --diff-filter=R v2.6.12 v2.6.14
>
> takes ~9 minutes with the current 'next'.

I have some updates to "next" tonight.

On my otherwise idle Duron 750 with slow disks, I am getting
something like these:

0.99.9m : 130m virtual, 40m resident, (0major+14205minor)
          67.62user 0.08system 1:15.95elapsed
master  : 130m virtual, 40m resident, (0major+12510minor)
          66.06user 0.07system 1:10.95elapsed
"next"  : 150m virtual, 65m resident, (0major+49858minor)
          51.41user 0.45system 0.57.55elapsed

The result will _not_ exactly match, because the similarity
estimation algorithms are different.

Judging the differences objectively is a bit hard, but my
impression is that the "next" one tends to find more sensible
renames.  To name a few:

* Documentation/dvb/README.dibusb from v2.6.12 seems pretty
  similar to Documentation/dvb/README.dvb-usb from v2.6.14, and
  "next" finds them, but "master" does not.

* "master" says arch/arm/configs/omnimeter_defconfig was
  copy-edited to produce arch/arm/configs/collie_defconfig; The
  diff output shows ~350 new lines and ~270 deleted lines, while
  these files are 800-900 lines long; "next" rejects them.  I
  think this is a border-line case.

* "next" finds Kconfig and Makefile in arch/arm/mach-omap-1/ are
  derived from arch/arm/mach-omap/; manual inspection of these
  files makes me feel that decision is sensible.  "master" does
  not find them.

* Same thing for config.c in arch/m68knommu/platform/68VZ328/;
  found to be derived from arch/m68knommu/platform/68VZ328/de2/ by
  "next" but not by "master".

* Other examples "next" finds but "master" misses include:

  arch/um/kernel/process.c	arch/um/os-Linux/start_up.c
  arch/um/kernel/tt/unmap.c	arch/um/sys-x86_64/unmap.c
  drivers/ide/cris/ide-v10.c	drivers/ide/cris/ide-cris.c
  include/asm-ppc/cputime.h	include/asm-xtensa/cputime.h
  include/asm-ppc64/ioctl.h	include/asm-xtensa/ioctl.h
  include/asm-ppc64/ioctls.h	include/asm-xtensa/ioctls.h
  include/asm-ppc64/mman.h	include/asm-xtensa/mman.h
  include/asm-ppc64/poll.h	include/asm-xtensa/poll.h
  include/asm-ppc64/shmbuf.h	include/asm-xtensa/shmbuf.h
  include/asm-ppc64/socket.h	include/asm-xtensa/socket.h
  include/asm-ppc64/termbits.h	include/asm-xtensa/termbits.h

* The "next" one is not perfect.  There are some quite bad
  choices made by it:

  include/asm-ppc64/timex.h	include/asm-powerpc/bugs.h
  include/asm-ppc64/iSeries/LparData.h	include/linux/i2c-isa.h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-12 12:28 ` Junio C Hamano
@ 2006-03-12 17:00   ` Linus Torvalds
  2006-03-12 19:34     ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-03-12 17:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git



On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> On my otherwise idle Duron 750 with slow disks, I am getting
> something like these:
> 
> 0.99.9m : 130m virtual, 40m resident, (0major+14205minor)
>           67.62user 0.08system 1:15.95elapsed
> master  : 130m virtual, 40m resident, (0major+12510minor)
>           66.06user 0.07system 1:10.95elapsed
> "next"  : 150m virtual, 65m resident, (0major+49858minor)
>           51.41user 0.45system 0.57.55elapsed

Any way to fix that "4 times as many page misses, and 70% bigger rss?" 
thing? It looks like you're not very careful about your memory use.

I realize that git in general wants a lot of memory, but I see that as a 
failure most of the time. I've got 2GB in most of my machines, but I 
shouldn't _need_ to have it..

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-12 17:00   ` Linus Torvalds
@ 2006-03-12 19:34     ` Junio C Hamano
  2006-03-13  0:42       ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-03-12 19:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git

Linus Torvalds <torvalds@osdl.org> writes:

> On Sun, 12 Mar 2006, Junio C Hamano wrote:
>> 
>> master  : 130m virtual, 40m resident, (0major+12510minor)
>>           66.06user 0.07system 1:10.95elapsed
>> "next"  : 150m virtual, 65m resident, (0major+49858minor)
>>           51.41user 0.45system 0.57.55elapsed
>
> Any way to fix that "4 times as many page misses, and 70% bigger rss?" 
> thing? It looks like you're not very careful about your memory use.

I started from "80 times as many page misses and 5 times bigger
rss", shrunk it down to the above after doing more careful
memory use, running out of ideas to shrink it more, and pushed
it out.  So...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-12 19:34     ` Junio C Hamano
@ 2006-03-13  0:42       ` Junio C Hamano
  2006-03-13  1:09         ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-03-13  0:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git

> Linus Torvalds <torvalds@osdl.org> writes:
>
>> On Sun, 12 Mar 2006, Junio C Hamano wrote:
>>> 
>>> master  : 130m virtual, 40m resident, (0major+12510minor)
>>>           66.06user 0.07system 1:10.95elapsed
>>> "next"  : 150m virtual, 65m resident, (0major+49858minor)
>>>           51.41user 0.45system 0.57.55elapsed
>>
>> Any way to fix that "4 times as many page misses, and 70% bigger rss?" 
>> thing? It looks like you're not very careful about your memory use.

"this"  : 145m virtual, 57m resident, (0major+18855minor)
          39.81user 0.28system 0:42.16elapsed

50% more page misses, 45% bigger rss, 65% less usertime.
Slowly getting there...

-- >8 --
[PATCH] diffcore-delta: make the hash a bit denser.

To reduce wasted memory, wait until the hash fills up more
densely before we rehash.  This reduces the working set size a
bit further.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 diffcore-delta.c  |   13 +++++++++----
 diffcore-rename.c |    4 +++-
 2 files changed, 12 insertions(+), 5 deletions(-)

af0b459589edaa77c51a892dd7dc44329634d253
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 471b98f..f8a7518 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -25,8 +25,12 @@
  */
 
 /* Wild guess at the initial hash size */
-#define INITIAL_HASH_SIZE 10
+#define INITIAL_HASH_SIZE 9
 #define HASHBASE 65537 /* next_prime(2^16) */
+/* We leave more room in smaller hash but do not let it
+ * grow to have unused hole too much.
+ */
+#define INITIAL_FREE(sz_log2) ((1<<(sz_log2))*(sz_log2-3)/(sz_log2))
 
 struct spanhash {
 	unsigned long hashval;
@@ -38,7 +42,8 @@ struct spanhash_top {
 	struct spanhash data[FLEX_ARRAY];
 };
 
-static struct spanhash *spanhash_find(struct spanhash_top *top, unsigned long hashval)
+static struct spanhash *spanhash_find(struct spanhash_top *top,
+				      unsigned long hashval)
 {
 	int sz = 1 << top->alloc_log2;
 	int bucket = hashval & (sz - 1);
@@ -62,7 +67,7 @@ static struct spanhash_top *spanhash_reh
 
 	new = xmalloc(sizeof(*orig) + sizeof(struct spanhash) * sz);
 	new->alloc_log2 = orig->alloc_log2 + 1;
-	new->free = osz;
+	new->free = INITIAL_FREE(new->alloc_log2);
 	memset(new->data, 0, sizeof(struct spanhash) * sz);
 	for (i = 0; i < osz; i++) {
 		struct spanhash *o = &(orig->data[i]);
@@ -122,7 +127,7 @@ static struct spanhash_top *hash_chars(u
 	i = INITIAL_HASH_SIZE;
 	hash = xmalloc(sizeof(*hash) + sizeof(struct spanhash) * (1<<i));
 	hash->alloc_log2 = i;
-	hash->free = (1<<i)/2;
+	hash->free = INITIAL_FREE(i);
 	memset(hash->data, 0, sizeof(struct spanhash) * (1<<i));
 
 	/* an 8-byte shift register made of accum1 and accum2.  New
diff --git a/diffcore-rename.c b/diffcore-rename.c
index b80b432..8380049 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -307,8 +307,10 @@ void diffcore_rename(struct diff_options
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 		}
+		/* We do not need the text anymore */
 		free(two->cnt_data);
-		two->cnt_data = NULL;
+		free(two->data);
+		two->data = two->cnt_data = NULL;
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */
-- 
1.2.4.g3dcf

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  0:42       ` Junio C Hamano
@ 2006-03-13  1:09         ` Linus Torvalds
  2006-03-13  1:22           ` Junio C Hamano
  2006-03-13  1:29           ` Linus Torvalds
  0 siblings, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-03-13  1:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git

On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> To reduce wasted memory, wait until the hash fills up more
> densely before we rehash.  This reduces the working set size a
> bit further.

Umm. Why do you rehash at all?

Just take the size of the "src" file as the initial hash size. 

Also, I think it is likely really wasteful to try to actually hash at 
_each_ character. Instead, let's say that the chunk-size is 8 bytes (like 
you do now), and let's say that you have a 32-bit good hash of those 8 
bytes. What you can do is:

 - for each 8 bytes in the source, hash those 8 bytes (not every byte)
 - for each byte in the other file, hash 8 next bytes. IF it matches a 
   hash in the source with a non-zero count, subtract the count for that 
   hash and move up by _eight_ characters! If it doesn't, add one to 
   "characters not matched" counter, and move up _one_ character, and try 
   again.

At the end of this, you have two counts: the count of characters that you 
couldn't match in the other file, and the count of 8-byte hash-chunks that 
you couldn't match in the first one. Use those two counts to decide if 
it's close or not.

Especially for good matches, this should basically cut your work into an 
eight of what you do now.

Actually, even for bad matches, you cut the first source overhead into one 
eight (the second file will obviously do the "update by 1 byte" most of 
the time).

Don't you think that would be as accurate as what you're doing now (it's 
the same basic notion, after all), and noticeably faster?

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  1:09         ` Linus Torvalds
@ 2006-03-13  1:22           ` Junio C Hamano
  2006-03-13  1:39             ` Linus Torvalds
  2006-03-13  1:29           ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-03-13  1:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git

Linus Torvalds <torvalds@osdl.org> writes:

> Umm. Why do you rehash at all?
>
> Just take the size of the "src" file as the initial hash size. 

The code uses close to 16-bit hash and I had 65k flat array as a
hashtable.  That one was what you commented as "4-times as many
page misses".

Interestingly enough, that kind of flat array representation
seems to be too sparse and gives very bad performance behaviour.

The improvement I mentioned in the message you are replying to
is the result of making it into smaller (starting at (1<<9) or
something like that) linear-overflowing hash.

The latter suggestion I need to think about it a bit more.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  1:09         ` Linus Torvalds
  2006-03-13  1:22           ` Junio C Hamano
@ 2006-03-13  1:29           ` Linus Torvalds
  2006-03-13  1:31             ` Linus Torvalds
                               ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-03-13  1:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git

On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> Also, I think it is likely really wasteful to try to actually hash at 
> _each_ character. Instead, let's say that the chunk-size is 8 bytes (like 
> you do now), and let's say that you have a 32-bit good hash of those 8 
> bytes. What you can do is:

Side note: regardless, your new algorithm clearly does seem faster.

However, it worries me a bit that you don't check the source strings, 
especially since the hash you use seems somewhat suspect (why limit it to 
essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime 
that fits in the "hashval" thing, which is at least 32 bits? Also, 
shouldn't you make that spanhash thing use a "unsigned int" instead of 
"unsigned long"?)

So I would suggest instead the hash function to be:

	typedef unsigned long long u64;

	/* Biggest prime in 32 bits */
	#define HASHVAL (4294967291u)

	u64 value = *(u64 *)src;
	src += 8;
	hash = value % 4294967291u;

which does a 64-bit modulus, but hey, 64-bit hw isn't _that_ uncommon any 
more, and it's not _excessively_ slow on x86 (gcc doesn't generate good 
code, but we could actually use the kernel "do_div()" routine for much 
faster division of 64 % 32 than what gcc can do - since the dividend is 
32-bit, you actually only need to do one 32/32 division and one 64/32 
division, so the optimized hash function on a good x86 will be just in 
the teens of cycles for the 64-bit modulus).

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  1:29           ` Linus Torvalds
@ 2006-03-13  1:31             ` Linus Torvalds
  2006-03-13  2:29             ` Linus Torvalds
  2006-03-13  4:14             ` Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-03-13  1:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git



On Sun, 12 Mar 2006, Linus Torvalds wrote:
>
> 	u64 value = *(u64 *)src;
> 	src += 8;
> 	hash = value % 4294967291u;

Btw, this assumes the "only hash every 8 bytes" at the source, in which 
case this is ok even on architectures that need aligned reads. For the 
non-aligned reads, you'd need your "shift the value" approach.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  1:22           ` Junio C Hamano
@ 2006-03-13  1:39             ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-03-13  1:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git

On Sun, 12 Mar 2006, Junio C Hamano wrote:
> 
> The code uses close to 16-bit hash and I had 65k flat array as a
> hashtable.  That one was what you commented as "4-times as many
> page misses".

Ahh. That explains the limited bits in the hash function too. I only 
looked at the current sources, not at the historic ones.

Btw, the page misses may come from the fact that you allocated and 
re-allocated the flat array all the time. That can be very expensive for 
big allocations, since most libraries may decide that it's a big enough 
area that it should be map/unmap'ed in order to give memory back to the 
system (without realizing that there's another allocation coming soon 
afterwards of the same size).

If you map/unmap, the kernel will end up having to not just use new pages, 
but obviously also clear them for security reasons. So it ends up sucking 
on many levels. In contrast, if you just have a 64k-entry array of "int", 
and allocate it _once_ (instead of once per file-pair) you'll still have 
to clear it in between file-pairs, but at least you won't have the 
overhead of mapping/unmapping it.

The clearing can still be pretty expensive (64k "int" entries is 256kB, 
and since most _files_ are just in the ~4-8kB range, you're spending a 
_lot_ of time just memset'ing). Which is why it's probably a good idea to 
instead default to having just "filesize / 8" entries, but then you can 
obviously not use the hash as the index any more.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  1:29           ` Linus Torvalds
  2006-03-13  1:31             ` Linus Torvalds
@ 2006-03-13  2:29             ` Linus Torvalds
  2006-03-13  2:53               ` Linus Torvalds
  2006-03-13  4:14             ` Junio C Hamano
  2 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-03-13  2:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git



On Sun, 12 Mar 2006, Linus Torvalds wrote:
> 
> So I would suggest instead the hash function to be:
> 
> 	typedef unsigned long long u64;
> 
> 	/* Biggest prime in 32 bits */
> 	#define HASHVAL (4294967291u)
> 
> 
> 	u64 value = *(u64 *)src;
> 	src += 8;
> 	hash = value % 4294967291u;
> 
> which does a 64-bit modulus, but hey, 64-bit hw isn't _that_ uncommon any 
> more, and it's not _excessively_ slow on x86 (gcc doesn't generate good 
> code, but we could actually use the kernel "do_div()" routine for much 
> faster division of 64 % 32 than what gcc can do - since the dividend is 
> 32-bit, you actually only need to do one 32/32 division and one 64/32 
> division, so the optimized hash function on a good x86 will be just in 
> the teens of cycles for the 64-bit modulus).

Actually, on x86, where we can do the 64-by-32 division with a single 
instruction, this seems to be the best possible code:

	#define HASH_VAL (4294967291u)

	static inline unsigned int hash64x32(unsigned long long a)
	{
		unsigned int high, low;
		unsigned int modulus;

		asm(""
			:"=a" (low), "=d" (high)
			:"A" (a));
		if (high > HASH_VAL)
			high -= HASH_VAL;
		asm("divl %2"
			:"=a" (low), "=d" (modulus)
			:"rm" (HASH_VAL), "0" (low), "1" (high));
		return modulus;
	}

at least as far as I can think.

The magic is the reduction of the high 32 bits: for the general case you
want a modulus for that reduction, but since we're dividing with a
really big value, the modulus of the high bits really does end up
becoming just that simple

	if (high > HASH_VAL)
		high -= HASH_VAL;

and then we just need to do a single "divl" instruction.

So if you want a 32-bit hash of an 8-byte area, this should be a pretty
good and fast way to calculate it. 

Of course, maybe just doing an adler32() is simpler/better.  But with
the above, at least on x86, I suspect you get an even better
distribution at a lower cost (of course, different coress do differently
well on large divisions, but integer division being pretty important for
some codes, modern cores actually tend to be pretty good at it). 

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  2:29             ` Linus Torvalds
@ 2006-03-13  2:53               ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2006-03-13  2:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git



On Sun, 12 Mar 2006, Linus Torvalds wrote:
>
> 		if (high > HASH_VAL)
> 			high -= HASH_VAL;

That should be ">= HASH_VAL", of course. I'm a retard. 

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  1:29           ` Linus Torvalds
  2006-03-13  1:31             ` Linus Torvalds
  2006-03-13  2:29             ` Linus Torvalds
@ 2006-03-13  4:14             ` Junio C Hamano
  2006-03-14  2:55               ` Junio C Hamano
  2 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-03-13  4:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git

Linus Torvalds <torvalds@osdl.org> writes:

> However, it worries me a bit that you don't check the source strings, 
> especially since the hash you use seems somewhat suspect (why limit it to 
> essentially just 16 bits? Wouldn't it be best to have the _biggest_ prime 
> that fits in the "hashval" thing, which is at least 32 bits? Also, 
> shouldn't you make that spanhash thing use a "unsigned int" instead of 
> "unsigned long"?)
>
> So I would suggest instead the hash function to be:
>
> 	typedef unsigned long long u64;
>
> 	/* Biggest prime in 32 bits */
> 	#define HASHVAL (4294967291u)

Actually what you are seeing is the best compromise I've come up
with.  There were about a dozen crappoid that turned out to be
suboptimal I threw away.

The hashsize is an interesting example of such a compromise
between precision and performance.  It is true that full 32-bit
hash gives us more precise match detection.  If you change the
current hash function to divide with (4294967291u), the rename
similarity score assigned to some (but not all) paths in the
example dataset we have been using (diff between v2.6.12 and
v2.6.14) decreases by about 1% -- this comes from the fact that
the hashvalue capped to 16-bit gives some false hits that larger
hashbase value can tell apart.  Because of it, some paths near
the rename detection threshold stop being recognized as renames.

However, the runtime of the same dataset increases quite a bit
when we do this.  I think this is because we need to keep more
different hash values in the hashtable and the cost to randomly
access into a huge table starts to hurt, unlike the 16-bit
capped version whose hash table never needs to grow beyond 65k
entries.

        next    39.77user 0.22system 0:40.78elapsed
                0inputs+0outputs (0major+18855minor)

        32-bit  66.32user 0.15system 1:07.00elapsed
                0inputs+0outputs (0major+21057minor)

Admittedly, we should not optimize for one particular test case,
but we should start from somewhere.

Decreasing the hashsize to 14-bit or less had noticeable
degradation on the quality of renames the algorithm detects and
misses to detect, both in v2.6.12..v2.6.14 test and some tests
in git.git repository.

One obvious change we could do is to make the hashsize to
0x1800D (a prime halfway between 16- and 17-bit); currently the
hashtable grows double every time when the table of the current
size fills up sufficiently, but the current hashbase cannot fit
in 16-bit, so we _are_ already using a table with 128K entries
in some cases.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-13  4:14             ` Junio C Hamano
@ 2006-03-14  2:55               ` Junio C Hamano
  2006-03-14  3:47                 ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-03-14  2:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git

While we are hacking away with weird ideas...

Here is still an WIP but an insanely fast one (actually this is
a modified version of what once was in next).  I haven't
verified the sanity of its output fully, but from a cursory look
what are found look sensible.  The same v2.6.12..v2.6.14 test on
my Duron 750:

        master  64.65user 0.17system 1:05.42elapsed
                0inputs+0outputs (0major+12511minor)

        next    40.69user 0.14system 0:40.98elapsed
                0inputs+0outputs (0major+19471minor)

        "this"  5.59user 0.09system 0:05.68elapsed
                0inputs+0outputs (0major+13015minor)

The hash used here is heavily optimized for handling text files
and nothing else.  Actually, it punts on a file that contains a
NUL byte.  The hash is computed by first skipping sequences of
whitespace letters (including LF); upon seeing a non whitespace,
we start hashing, while still ignoring whitespaces, until we hit
the next LF (or EOF).  Then we store the real number of bytes
along with the hash.  

When we find the matching hash value in the destination, we say
that many bytes (including the whitespaces we ignored while
hashing) were copied.

The patch should apply on top of the current "next".

---

diff --git a/diffcore-delta.c b/diffcore-delta.c
index 835d82c..0f4866e 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -3,25 +3,10 @@
 #include "diffcore.h"
 
 /*
- * Idea here is very simple.
- *
- * We have total of (sz-N+1) N-byte overlapping sequences in buf whose
- * size is sz.  If the same N-byte sequence appears in both source and
- * destination, we say the byte that starts that sequence is shared
- * between them (i.e. copied from source to destination).
- *
- * For each possible N-byte sequence, if the source buffer has more
- * instances of it than the destination buffer, that means the
- * difference are the number of bytes not copied from source to
- * destination.  If the counts are the same, everything was copied
- * from source to destination.  If the destination has more,
- * everything was copied, and destination added more.
- *
- * We are doing an approximation so we do not really have to waste
- * memory by actually storing the sequence.  We just hash them into
- * somewhere around 2^16 hashbuckets and count the occurrences.
- *
- * The length of the sequence is arbitrarily set to 8 for now.
+ * Record the hashes for "extended lines" in both source and destination,
+ * and compare how similar they are.  "Extended lines" hash is designed
+ * to work well on text files -- leading whitespaces and tabs, and consecutive
+ * LF characters are effectively ignored.
  */
 
 /* Wild guess at the initial hash size */
@@ -40,8 +25,9 @@
 #define HASHBASE 107927
 
 struct spanhash {
-	unsigned long hashval;
-	unsigned long cnt;
+	unsigned long hashval; /* hash for the line */
+	unsigned bytes; /* real number of bytes in such a line */
+	unsigned long cnt; /* occurrences */
 };
 struct spanhash_top {
 	int alloc_log2;
@@ -87,6 +73,7 @@ static struct spanhash_top *spanhash_reh
 			if (!h->cnt) {
 				h->hashval = o->hashval;
 				h->cnt = o->cnt;
+				h->bytes = o->bytes;
 				new->free--;
 				break;
 			}
@@ -99,7 +86,8 @@ static struct spanhash_top *spanhash_reh
 }
 
 static struct spanhash_top *add_spanhash(struct spanhash_top *top,
-					 unsigned long hashval)
+					 unsigned long hashval,
+					 unsigned bytes)
 {
 	int bucket, lim;
 	struct spanhash *h;
@@ -110,6 +98,7 @@ static struct spanhash_top *add_spanhash
 		h = &(top->data[bucket++]);
 		if (!h->cnt) {
 			h->hashval = hashval;
+			h->bytes = bytes;
 			h->cnt = 1;
 			top->free--;
 			if (top->free < 0)
@@ -117,6 +106,14 @@ static struct spanhash_top *add_spanhash
 			return top;
 		}
 		if (h->hashval == hashval) {
+			if (h->bytes != bytes) {
+				/* avg of h->cnt instances were h->bytes
+				 * now we are adding bytes
+				 */
+				h->bytes = ((h->cnt / 2 + bytes +
+					     h->cnt * h->bytes) /
+					    (h->cnt + 1));
+			}
 			h->cnt++;
 			return top;
 		}
@@ -125,11 +122,49 @@ static struct spanhash_top *add_spanhash
 	}
 }
 
-static struct spanhash_top *hash_chars(unsigned char *buf, unsigned long sz)
+static unsigned long hash_extended_line(const unsigned char **buf_p,
+					unsigned long left)
+{
+	/* An extended line is zero or more whitespace letters (including LF)
+	 * followed by one non whitespace letter followed by zero or more
+	 * non LF, and terminated with by a LF (or EOF).
+	 */
+	const unsigned char *bol = *buf_p;
+	const unsigned char *buf = bol;
+	unsigned long hashval = 0;
+	while (left) {
+		unsigned c = *buf++;
+		if (!c)
+			goto binary;
+		left--;
+		if (' ' < c) {
+			hashval = c;
+			break;
+		}
+	}
+	while (left) {
+		unsigned c = *buf++;
+		if (!c)
+			goto binary;
+		left--;
+		if (c == '\n')
+			break;
+		if (' ' < c)
+			hashval = hashval * 11 + c;
+	}
+	*buf_p = buf;
+	return hashval;
+
+ binary:
+	*buf_p = NULL;
+	return 0;
+}
+
+static struct spanhash_top *hash_lines(const unsigned char *buf, unsigned long sz)
 {
 	int i;
-	unsigned long accum1, accum2, hashval;
 	struct spanhash_top *hash;
+	const unsigned char *eobuf = buf + sz;
 
 	i = INITIAL_HASH_SIZE;
 	hash = xmalloc(sizeof(*hash) + sizeof(struct spanhash) * (1<<i));
@@ -137,19 +172,14 @@ static struct spanhash_top *hash_chars(u
 	hash->free = INITIAL_FREE(i);
 	memset(hash->data, 0, sizeof(struct spanhash) * (1<<i));
 
-	/* an 8-byte shift register made of accum1 and accum2.  New
-	 * bytes come at LSB of accum2, and shifted up to accum1
-	 */
-	for (i = accum1 = accum2 = 0; i < 7; i++, sz--) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
-	}
-	while (sz) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
-		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
-		hash = add_spanhash(hash, hashval);
-		sz--;
+	while (buf < eobuf) {
+		const unsigned char *ptr = buf;
+		unsigned long hashval = hash_extended_line(&buf, eobuf-ptr);
+		if (!buf) {
+			free(hash);
+			return NULL;
+		}
+		hash = add_spanhash(hash, hashval, buf-ptr);
 	}
 	return hash;
 }
@@ -166,21 +196,18 @@ int diffcore_count_changes(void *src, un
 	struct spanhash_top *src_count, *dst_count;
 	unsigned long sc, la;
 
-	if (src_size < 8 || dst_size < 8)
-		return -1;
-
 	src_count = dst_count = NULL;
 	if (src_count_p)
 		src_count = *src_count_p;
 	if (!src_count) {
-		src_count = hash_chars(src, src_size);
+		src_count = hash_lines(src, src_size);
 		if (src_count_p)
 			*src_count_p = src_count;
 	}
 	if (dst_count_p)
 		dst_count = *dst_count_p;
 	if (!dst_count) {
-		dst_count = hash_chars(dst, dst_size);
+		dst_count = hash_lines(dst, dst_size);
 		if (dst_count_p)
 			*dst_count_p = dst_count;
 	}
@@ -193,9 +220,9 @@ int diffcore_count_changes(void *src, un
 		unsigned dst_cnt, src_cnt;
 		if (!s->cnt)
 			continue;
-		src_cnt = s->cnt;
 		d = spanhash_find(dst_count, s->hashval);
-		dst_cnt = d ? d->cnt : 0;
+		src_cnt = s->cnt * s->bytes;
+		dst_cnt = d ? (d->cnt * d->bytes) : 0;
 		if (src_cnt < dst_cnt) {
 			la += dst_cnt - src_cnt;
 			sc += src_cnt;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-14  2:55               ` Junio C Hamano
@ 2006-03-14  3:47                 ` Linus Torvalds
  2006-03-14 10:26                   ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-03-14  3:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, git



On Mon, 13 Mar 2006, Junio C Hamano wrote:
> 
> Here is still an WIP but an insanely fast one (actually this is
> a modified version of what once was in next).  I haven't
> verified the sanity of its output fully, but from a cursory look
> what are found look sensible.  The same v2.6.12..v2.6.14 test on
> my Duron 750:

Heh. I did something similar, except I wanted mine to work with binary 
data too. Not that I know how _well_ it works, but assuming you have 
_some_ '\n' characters to fix up offset mismatches, it might do something.

Mine is a bit less hacky than yours, I believe. It doesn't skip 
whitespace, instead it just maintains a rolling 64-bit number, where each 
character shifts it left by 7 and then adds in the new character value 
(overflow in 32 bits just ignored).

Then it uses your old hash function, except it hides the length in the top 
byte.

It breaks the hashing on '\n' or on hitting a 64-byte sequence, whichever 
comes first.

It's fast and stupid, but doesn't seem to do any worse than your old one. 
The speed comes from the fact that it only does the hash comparisons at 
the "block boundaries", not at every byte.

Anyway, I don't think something like this is really any good for rename 
detection, but it might be good for deciding whether to do a real delta.

		Linus

----
diff --git a/diffcore-delta.c b/diffcore-delta.c
index 835d82c..4c6e512 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -127,7 +127,7 @@ static struct spanhash_top *add_spanhash
 
 static struct spanhash_top *hash_chars(unsigned char *buf, unsigned long sz)
 {
-	int i;
+	int i, n;
 	unsigned long accum1, accum2, hashval;
 	struct spanhash_top *hash;
 
@@ -137,19 +137,21 @@ static struct spanhash_top *hash_chars(u
 	hash->free = INITIAL_FREE(i);
 	memset(hash->data, 0, sizeof(struct spanhash) * (1<<i));
 
-	/* an 8-byte shift register made of accum1 and accum2.  New
-	 * bytes come at LSB of accum2, and shifted up to accum1
-	 */
-	for (i = accum1 = accum2 = 0; i < 7; i++, sz--) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
-	}
+	n = 0;
+	accum1 = accum2 = 0;
 	while (sz) {
-		accum1 = (accum1 << 8) | (accum2 >> 24);
-		accum2 = (accum2 << 8) | *buf++;
+		unsigned long c = *buf++;
+		sz--;
+		accum1 = (accum1 << 7) | (accum2 >> 25);
+		accum2 = (accum2 << 7) | (accum1 >> 25);
+		accum1 += c;
+		if (++n < 64 && c != '\n')
+			continue;
 		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
+		hashval |= (n << 24);
 		hash = add_spanhash(hash, hashval);
-		sz--;
+		n = 0;
+		accum1 = accum2 = 0;
 	}
 	return hash;
 }
@@ -166,9 +168,6 @@ int diffcore_count_changes(void *src, un
 	struct spanhash_top *src_count, *dst_count;
 	unsigned long sc, la;
 
-	if (src_size < 8 || dst_size < 8)
-		return -1;
-
 	src_count = dst_count = NULL;
 	if (src_count_p)
 		src_count = *src_count_p;
@@ -196,6 +195,8 @@ int diffcore_count_changes(void *src, un
 		src_cnt = s->cnt;
 		d = spanhash_find(dst_count, s->hashval);
 		dst_cnt = d ? d->cnt : 0;
+		dst_cnt *= (d->hashval >> 24);
+		src_cnt *= (d->hashval >> 24);
 		if (src_cnt < dst_cnt) {
 			la += dst_cnt - src_cnt;
 			sc += src_cnt;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: git-diff-tree -M performance regression in 'next'
  2006-03-14  3:47                 ` Linus Torvalds
@ 2006-03-14 10:26                   ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-03-14 10:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, git

Linus Torvalds <torvalds@osdl.org> writes:

> Mine is a bit less hacky than yours, I believe. It doesn't skip 
> whitespace, instead it just maintains a rolling 64-bit number, where each 
> character shifts it left by 7 and then adds in the new character value 
> (overflow in 32 bits just ignored).

That rolling register is a good idea.  The "whitespace hack" was
done to recognize certain kind of changes that commonly appear
in source code.  For example, it will still recognize content
copies after you re-indent your code, or add an "if (...) {" and
"} else { ... }" around an existing code block, or add extra
blank lines.

It is still an inadequate hack.  If you comment out a code block
by adding "#if 0" and "#endif" around it, it notices the
surviving lines, but if instead you comment out a block by
prefixing "//" in front of every line in the block, neither your
64-byte-or-EOL or my extended line algorithm would notice that
the content copy anymore.

Anyway, I did a bit of comparison and it appears that the
whitespace thing does not make much difference in practice.

> It's fast and stupid, but doesn't seem to do any worse than your old one. 

Comparing the "next" with your 64-byte-or-EOL and "extended
line" on the v2.6.12..v2.6.14 test case shows:

				64-or-EOL	extended line
renames identically detected	108		110
matched differently		2		2
finds what"next" misses		4		4
misses what "next" finds	23		21

What they find seem reasonable.  What they reject are sometimes
debatable.  For example, similarity between these two files does
not seem to be noticed by either.

        v2.6.12/drivers/media/dvb/dibusb/dvb-dibusb-firmware.c
        v2.6.14/drivers/media/dvb/dvb-usb/dvb-usb-firmware.c

The "next" algorithm gives 60% score while these two gives 45%
or so to this pair.

But they both reject these bogus "rename" the "next" algorithm
finds:

	v2.6.12/drivers/char/drm/gamma_drv.c
	v2.6.14/drivers/char/drm/via_verifier.h

("next" 51% vs 37-40% with these algorithms). 

> Anyway, I don't think something like this is really any good for rename 
> detection, but it might be good for deciding whether to do a real delta.

Either algorithm seem to have non-negligible false negative
rates but their false positive rates are reasonably low.  So we
could use these as a pre-filter and use real delta on pairs that
these quick and dirty algorithms say are too different.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-03-14 10:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-11 17:28 git-diff-tree -M performance regression in 'next' Fredrik Kuivinen
2006-03-12  3:10 ` Junio C Hamano
2006-03-12 12:28 ` Junio C Hamano
2006-03-12 17:00   ` Linus Torvalds
2006-03-12 19:34     ` Junio C Hamano
2006-03-13  0:42       ` Junio C Hamano
2006-03-13  1:09         ` Linus Torvalds
2006-03-13  1:22           ` Junio C Hamano
2006-03-13  1:39             ` Linus Torvalds
2006-03-13  1:29           ` Linus Torvalds
2006-03-13  1:31             ` Linus Torvalds
2006-03-13  2:29             ` Linus Torvalds
2006-03-13  2:53               ` Linus Torvalds
2006-03-13  4:14             ` Junio C Hamano
2006-03-14  2:55               ` Junio C Hamano
2006-03-14  3:47                 ` Linus Torvalds
2006-03-14 10:26                   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).