git-revert is a memory hog

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-revert is a memory hog
@ 2008-01-27 17:27 Adrian Bunk
  2008-01-27 17:38 ` Shawn O. Pearce
  2008-01-28  5:59 ` Jeff King
  0 siblings, 2 replies; 24+ messages in thread
From: Adrian Bunk @ 2008-01-27 17:27 UTC (permalink / raw)
  To: git

I'm not sure whether this is already known, but when recently working 
for some time from a computer with "only" 512 MB RAM I ran into the huge 
memory usage of git-revert when it tries to revert old commits.

Example (in Linus' kernel tree with git 1.5.3.8):

<--  snip  -->

$ git-revert d19fbe8a7
Auto-merged drivers/input/input.c
CONFLICT (content): Merge conflict in drivers/input/input.c
Auto-merged include/linux/input.h
CONFLICT (content): Merge conflict in include/linux/input.h
Automatic revert failed.  After resolving the conflicts,
mark the corrected paths with 'git add <paths>' and commit the result.
$ 

<--  snip  -->

In top you can see that this took > 800 MB of RAM !

I don't know how easy it would be to implement, but shouldn't git-revert 
be able to be as fast and less memory consuming as
  git-show d19fbe8a7 | patch -p1 -R
?

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-27 17:27 git-revert is a memory hog Adrian Bunk
@ 2008-01-27 17:38 ` Shawn O. Pearce
  2008-01-28  6:01   ` Jeff King
  2008-01-28  5:59 ` Jeff King
  1 sibling, 1 reply; 24+ messages in thread
From: Shawn O. Pearce @ 2008-01-27 17:38 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: git

Adrian Bunk <bunk@kernel.org> wrote:
> I'm not sure whether this is already known, but when recently working 
> for some time from a computer with "only" 512 MB RAM I ran into the huge 
> memory usage of git-revert when it tries to revert old commits.
> 
> Example (in Linus' kernel tree with git 1.5.3.8):
> 
> $ git-revert d19fbe8a7
...
> In top you can see that this took > 800 MB of RAM !
> 
> I don't know how easy it would be to implement, but shouldn't git-revert 
> be able to be as fast and less memory consuming as
>   git-show d19fbe8a7 | patch -p1 -R

Its more like:

	git-diff-tree -M d19fbe8a7^ d19fbe8a7 | git-apply -R --index

In other words its doing rename detection.  Its possible that the
rename detector fired for added/removed paths and it took some
significant amount of memory to figure out what was renamed.
Maybe we hung onto stuff for too long, but it has to do rename
detection and that takes memory to store the matrix and file
contents.  Once memory is allocated by git we don't give it back
to the kernel, even if we free'd it internally.

If you really are tight on memory and have to do a revert you can
use the above, but minus the -M, to setup the change, but you may
run into trouble if renames were involved.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-27 17:38 ` Shawn O. Pearce
@ 2008-01-28  6:01   ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2008-01-28  6:01 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Adrian Bunk, git

On Sun, Jan 27, 2008 at 12:38:08PM -0500, Shawn O. Pearce wrote:

> >   git-show d19fbe8a7 | patch -p1 -R
> 
> Its more like:
> 
> 	git-diff-tree -M d19fbe8a7^ d19fbe8a7 | git-apply -R --index
> 
> In other words its doing rename detection.  Its possible that the
> rename detector fired for added/removed paths and it took some
> significant amount of memory to figure out what was renamed.

It's doubtful. There are only two files in that diff, and neither is a
candidate for rename detection.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-27 17:27 git-revert is a memory hog Adrian Bunk
  2008-01-27 17:38 ` Shawn O. Pearce
@ 2008-01-28  5:59 ` Jeff King
  2008-01-29 21:51   ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff King @ 2008-01-28  5:59 UTC (permalink / raw)
  To: Adrian Bunk; +Cc: git

On Sun, Jan 27, 2008 at 07:27:48PM +0200, Adrian Bunk wrote:

> <--  snip  -->
> 
> $ git-revert d19fbe8a7
> Auto-merged drivers/input/input.c
> CONFLICT (content): Merge conflict in drivers/input/input.c
> Auto-merged include/linux/input.h
> CONFLICT (content): Merge conflict in include/linux/input.h
> Automatic revert failed.  After resolving the conflicts,
> mark the corrected paths with 'git add <paths>' and commit the result.
> $ 
> 
> <--  snip  -->
> 
> In top you can see that this took > 800 MB of RAM !

I tried to reproduce this, but my peak heap allocation was only around
20MB. Is your repository fully packed? Not packed at all? Can you use
valgrind/massif to figure out where the memory is going?

> I don't know how easy it would be to implement, but shouldn't git-revert 
> be able to be as fast and less memory consuming as
>   git-show d19fbe8a7 | patch -p1 -R

In your case, the patch doesn't apply cleanly, so we end up doing a
3-way merge (in my tests, it is git-merge-recursive which ends up taking
up the memory).

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-28  5:59 ` Jeff King
@ 2008-01-29 21:51   ` Linus Torvalds
  2008-01-29 22:15     ` Junio C Hamano
  2008-01-29 22:20     ` Jeff King
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2008-01-29 21:51 UTC (permalink / raw)
  To: Jeff King; +Cc: Adrian Bunk, git

On Mon, 28 Jan 2008, Jeff King wrote:
> 
> I tried to reproduce this, but my peak heap allocation was only around
> 20MB. Is your repository fully packed? Not packed at all? Can you use
> valgrind/massif to figure out where the memory is going?

I definitely can reproduce it, it's horrid.

This is from "top" fairly late in the game, but with the thing not even 
done yet. Current git, pretty much fully (and fairly aggressively) packed 
current kernel repo, and using "diff.renamelmit=0".

	4751 torvalds  20   0  852m 446m  47m R   72 22.4   2:46.58 git-merge-recur

It finally finished with time reporting:

	208.15user 3.50system 4:01.50elapsed 87%CPU (0avgtext+0avgdata 0maxresident)k
	238736inputs+4544outputs (8261major+280971minor)pagefaults 0swaps

where those 280971 minor page faults are what largely indicates how much 
memory it used (the technical term for that number is "metric buttload of 
memory").

But I'm in Melbourne right now on my laptop,and probably won't be able to 
debug this much. 

> In your case, the patch doesn't apply cleanly, so we end up doing a
> 3-way merge (in my tests, it is git-merge-recursive which ends up taking
> up the memory).

It is indeed git-merge-recursive. It just shouldn't take that much memory.

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 21:51   ` Linus Torvalds
@ 2008-01-29 22:15     ` Junio C Hamano
  2008-01-29 22:20     ` Jeff King
  1 sibling, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-01-29 22:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Adrian Bunk, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 28 Jan 2008, Jeff King wrote:
>> 
>> I tried to reproduce this, but my peak heap allocation was only around
>> 20MB. Is your repository fully packed? Not packed at all? Can you use
>> valgrind/massif to figure out where the memory is going?
>
> I definitely can reproduce it, it's horrid.
>
> This is from "top" fairly late in the game, but with the thing not even 
> done yet. Current git, pretty much fully (and fairly aggressively) packed 
> current kernel repo, and using "diff.renamelmit=0".
>
> 	4751 torvalds  20   0  852m 446m  47m R   72 22.4   2:46.58 git-merge-recur
>
> It finally finished with time reporting:
>
> 	208.15user 3.50system 4:01.50elapsed 87%CPU (0avgtext+0avgdata 0maxresident)k
> 	238736inputs+4544outputs (8261major+280971minor)pagefaults 0swaps
>
> where those 280971 minor page faults are what largely indicates how much 
> memory it used (the technical term for that number is "metric buttload of 
> memory").
>
> But I'm in Melbourne right now on my laptop,and probably won't be able to 
> debug this much. 
>
>> In your case, the patch doesn't apply cleanly, so we end up doing a
>> 3-way merge (in my tests, it is git-merge-recursive which ends up taking
>> up the memory).
>
> It is indeed git-merge-recursive. It just shouldn't take that much memory.

Hmmmmm.  Obviously this depends on where you start your revert
from, but that is not what I am getting.

: gitster linux-2.6/test; git reset --hard
HEAD is now at 0ba6c33... Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25
: gitster linux-2.6/test; /usr/bin/time git-revert d19fbe8a7
Auto-merged drivers/input/input.c
CONFLICT (content): Merge conflict in drivers/input/input.c
Auto-merged include/linux/input.h
CONFLICT (content): Merge conflict in include/linux/input.h
Automatic revert failed.  After resolving the conflicts,
mark the corrected paths with 'git add <paths>' or 'git rm <paths>' and commit the result.
Command exited with non-zero status 1
1.08user 0.06system 0:01.14elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+19867minor)pagefaults 0swaps

Now, a possible alternative (that produces an identical result)
is not much better, because it ends up using merge-recursive as
its core.

: gitster linux-2.6/test; git reset --hard                                      HEAD is now at 0ba6c33... Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25
: gitster linux-2.6/test; rm -fr .dotest
: gitster linux-2.6/test; /usr/bin/time sh -c 'git format-patch -R --binary --stdout -1 d19fbe8a7 | git am -3'
Applying Input: prepare to sysfs integration
error: patch failed: drivers/input/input.c:27
error: drivers/input/input.c: patch does not apply
error: patch failed: include/linux/input.h:12
error: include/linux/input.h: patch does not apply
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...
Auto-merged drivers/input/input.c
CONFLICT (content): Merge conflict in drivers/input/input.c
Auto-merged include/linux/input.h
CONFLICT (content): Merge conflict in include/linux/input.h
Failed to merge in the changes.
Patch failed at 0001.
When you have resolved this problem run "git-am -3 --resolved".
If you would prefer to skip this patch, instead run "git-am -3 --skip".
Command exited with non-zero status 1
0.52user 0.25system 0:00.75elapsed 103%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41208minor)pagefaults 0swaps

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 21:51   ` Linus Torvalds
  2008-01-29 22:15     ` Junio C Hamano
@ 2008-01-29 22:20     ` Jeff King
  2008-01-29 22:30       ` Jeff King
  2008-01-29 22:36       ` Junio C Hamano
  1 sibling, 2 replies; 24+ messages in thread
From: Jeff King @ 2008-01-29 22:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Adrian Bunk, git

On Wed, Jan 30, 2008 at 08:51:09AM +1100, Linus Torvalds wrote:

> I definitely can reproduce it, it's horrid.
> 
> This is from "top" fairly late in the game, but with the thing not even 
> done yet. Current git, pretty much fully (and fairly aggressively) packed 
> current kernel repo, and using "diff.renamelmit=0".

Hrm, setting diff.renamelimit to 0 lets me reproduce (I thought I tried
it before, but clearly not...).

The culprit seems to be diffcore-rename.c:476:

  mx = xmalloc(sizeof(*mx) * num_create * num_src);

Where that ends up allocating about 450M. I think this is exactly the
sort of case that renamelimit was introduced to address.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:20     ` Jeff King
@ 2008-01-29 22:30       ` Jeff King
  2008-01-29 22:36       ` Junio C Hamano
  1 sibling, 0 replies; 24+ messages in thread
From: Jeff King @ 2008-01-29 22:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Adrian Bunk, git

On Tue, Jan 29, 2008 at 05:20:07PM -0500, Jeff King wrote:

> The culprit seems to be diffcore-rename.c:476:
> 
>   mx = xmalloc(sizeof(*mx) * num_create * num_src);
> 
> Where that ends up allocating about 450M. I think this is exactly the
> sort of case that renamelimit was introduced to address.

BTW, a much easier way to see the problem is with:

  git diff -M -l0 d19fbe8a76

It's just a _really_ old commit, and the rename detection has to work on
a large number of files.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:20     ` Jeff King
  2008-01-29 22:30       ` Jeff King
@ 2008-01-29 22:36       ` Junio C Hamano
  2008-01-29 22:45         ` Jeff King
                           ` (3 more replies)
  1 sibling, 4 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-01-29 22:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Adrian Bunk, git

Jeff King <peff@peff.net> writes:

> On Wed, Jan 30, 2008 at 08:51:09AM +1100, Linus Torvalds wrote:
>
>> I definitely can reproduce it, it's horrid.
>> 
>> This is from "top" fairly late in the game, but with the thing not even 
>> done yet. Current git, pretty much fully (and fairly aggressively) packed 
>> current kernel repo, and using "diff.renamelmit=0".
>
> Hrm, setting diff.renamelimit to 0 lets me reproduce (I thought I tried
> it before, but clearly not...).

Hmph.  But I wonder why this part does not trigger, even when
you have renamelimit set to 0.

	/*
	 * This basically does a test for the rename matrix not
	 * growing larger than a "rename_limit" square matrix, ie:
	 *
	 *    rename_dst_nr * rename_src_nr > rename_limit * rename_limit
	 *
	 * but handles the potential overflow case specially (and we
	 * assume at least 32-bit integers)
	 */
	if (rename_limit <= 0 || rename_limit > 32767)
		rename_limit = 32767;
	if (rename_dst_nr > rename_limit && rename_src_nr > rename_limit)
		goto cleanup;
	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
		goto cleanup;

I wonder if the second one for the overflow avoidance should be
using || instead of &&, though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:36       ` Junio C Hamano
@ 2008-01-29 22:45         ` Jeff King
  2008-01-29 22:51           ` Jeff King
  2008-01-29 22:49         ` Linus Torvalds
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Jeff King @ 2008-01-29 22:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Adrian Bunk, git

On Tue, Jan 29, 2008 at 02:36:24PM -0800, Junio C Hamano wrote:

> Hmph.  But I wonder why this part does not trigger, even when
> you have renamelimit set to 0.
> [...]
> 	if (rename_limit <= 0 || rename_limit > 32767)
> 		rename_limit = 32767;

It does trigger; we set the limit to the obscenely high 32767. My matrix
was something like 8000x3500.

> 	if (rename_dst_nr > rename_limit && rename_src_nr > rename_limit)
> 		goto cleanup;
> 	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
> 		goto cleanup;
> 
> I wonder if the second one for the overflow avoidance should be
> using || instead of &&, though.

Hrm, yes, I think it can still overflow. (e.g., a 2 by 2^32-1
situation). But changing it to || isn't right, either; you would
disallow 1 by 101, which is quite do-able (and the normal case for -C
-C, I would think).

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:45         ` Jeff King
@ 2008-01-29 22:51           ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2008-01-29 22:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Adrian Bunk, git

On Tue, Jan 29, 2008 at 05:45:58PM -0500, Jeff King wrote:

> > 	if (rename_dst_nr > rename_limit && rename_src_nr > rename_limit)
> > 		goto cleanup;
> 
> Hrm, yes, I think it can still overflow. (e.g., a 2 by 2^32-1
> situation). But changing it to || isn't right, either; you would

I think the correct solution is just:

  /* check for overflow of square */
  if (rename_dst_nr > ULONG_MAX / rename_src_nr)
          goto cleanup;

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:36       ` Junio C Hamano
  2008-01-29 22:45         ` Jeff King
@ 2008-01-29 22:49         ` Linus Torvalds
  2008-01-29 22:54           ` Jeff King
  2008-01-29 22:53         ` Junio C Hamano
  2008-01-29 23:19         ` Junio C Hamano
  3 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2008-01-29 22:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Adrian Bunk, git

On Tue, 29 Jan 2008, Junio C Hamano wrote:
> 
> I wonder if the second one for the overflow avoidance should be
> using || instead of &&, though.

No, we want to be able to handle the case where there is (for example) 
just one removed file, but lots of new ones. That's not expensive at all. 
So we don't want to require that *both* the counts for removed and new 
files are low, we really want to check that we don't have too many 
combinations together.

But  the

        if (rename_limit <= 0 || rename_limit > 32767)
                rename_limit = 32767;

which is there purely to avoid overflow in 32-bit multiplication should 
probably be changed to be more reasonable. We'll never want to try to do a 
matrix that is really 32k * 32k in size, even if we can calculate its size 
;)

So maybe we should just make that hard limit more reasonable. 100x100 was 
too small, but a 1000x1000 matrix might be acceptable.

Or, better yet (which was what I was hoping for originally), we'd just 
make the inexact rename detection be linear-size/time rather than O(m*n). 
But those patches never really came together, so we do need to limit it 
more aggressively.

		Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:49         ` Linus Torvalds
@ 2008-01-29 22:54           ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2008-01-29 22:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Adrian Bunk, git

On Wed, Jan 30, 2008 at 09:49:44AM +1100, Linus Torvalds wrote:

> But  the
> 
>         if (rename_limit <= 0 || rename_limit > 32767)
>                 rename_limit = 32767;
> 
> which is there purely to avoid overflow in 32-bit multiplication should 

Ah, right, that first conditional handles the overflow. Then what is the
second one doing? If they are both larger than the rename limit, then
won't there square by definition be larger than the square of the rename
limit? I.e., can't we just get rid of the second conditional?

> Or, better yet (which was what I was hoping for originally), we'd just 
> make the inexact rename detection be linear-size/time rather than O(m*n). 
> But those patches never really came together, so we do need to limit it 
> more aggressively.

I had trouble getting the memory usage to a reasonable level. The hash
tables were just getting enormous.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:36       ` Junio C Hamano
  2008-01-29 22:45         ` Jeff King
  2008-01-29 22:49         ` Linus Torvalds
@ 2008-01-29 22:53         ` Junio C Hamano
  2008-01-29 22:57           ` Jeff King
  2008-01-29 23:19         ` Junio C Hamano
  3 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2008-01-29 22:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Adrian Bunk, git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> Hrm, setting diff.renamelimit to 0 lets me reproduce (I thought I tried
>> it before, but clearly not...).
>
> Hmph.  But I wonder why this part does not trigger, even when
> you have renamelimit set to 0.
>
> 	/*
> 	 * This basically does a test for the rename matrix not
> 	 * growing larger than a "rename_limit" square matrix, ie:
> 	 *
> 	 *    rename_dst_nr * rename_src_nr > rename_limit * rename_limit
> 	 *
> 	 * but handles the potential overflow case specially (and we
> 	 * assume at least 32-bit integers)
> 	 */
> 	if (rename_limit <= 0 || rename_limit > 32767)
> 		rename_limit = 32767;
> 	if (rename_dst_nr > rename_limit && rename_src_nr > rename_limit)
> 		goto cleanup;
> 	if (rename_dst_nr * rename_src_nr > rename_limit * rename_limit)
> 		goto cleanup;
>
> I wonder if the second one for the overflow avoidance should be
> using || instead of &&, though.

Reverting d19fbe8a7 means coming up with a 3-way merge between
d19fbe8a7^ and master as if d19fbe8a7 is their common ancestor.

"git diff --name-status d19fbe8a7 d19fbe8a7^" shows only two
paths changed.

"git diff --name-status d19fbe8a7 master" shows 8558 new paths
and 3756 deleted paths, which makes 32m paths pairs, that is
still lower than 32767 squared.

If your int is 64-bit, struct diff_score which is 4-int is
16-byte long.  32m * 16 = 501,801,600.  That seems to match your
450MB observation well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:53         ` Junio C Hamano
@ 2008-01-29 22:57           ` Jeff King
  0 siblings, 0 replies; 24+ messages in thread
From: Jeff King @ 2008-01-29 22:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Adrian Bunk, git

On Tue, Jan 29, 2008 at 02:53:12PM -0800, Junio C Hamano wrote:

> If your int is 64-bit, struct diff_score which is 4-int is
> 16-byte long.  32m * 16 = 501,801,600.  That seems to match your
> 450MB observation well.

Yes, except for s/64-bit/32-bit/.

-Peff

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 22:36       ` Junio C Hamano
                           ` (2 preceding siblings ...)
  2008-01-29 22:53         ` Junio C Hamano
@ 2008-01-29 23:19         ` Junio C Hamano
  2008-01-29 23:50           ` Junio C Hamano
  2008-01-30  4:40           ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
  3 siblings, 2 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-01-29 23:19 UTC (permalink / raw)
  To: Linus Torvalds, Jeff King; +Cc: Adrian Bunk, git

Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> On Wed, Jan 30, 2008 at 08:51:09AM +1100, Linus Torvalds wrote:
>>
>>> I definitely can reproduce it, it's horrid.
>>> 
>>> This is from "top" fairly late in the game, but with the thing not even 
>>> done yet. Current git, pretty much fully (and fairly aggressively) packed 
>>> current kernel repo, and using "diff.renamelmit=0".
>>
>> Hrm, setting diff.renamelimit to 0 lets me reproduce (I thought I tried
>> it before, but clearly not...).

How about this fairly obvious patch?

We used to record, for N=src deleted paths and M=dst created
paths, (N x M) "struct diff_score" that records how similar each
of the pair is, and picked the <src,dst> pairs that gives the
best match first, and then went on to worse matches.  This
sorting is so that two destinations are both found to be similar
to a single source, we can process the more similar one first,
and when processing the second one, it can notice "Ah, the
source I was planning to say I am a copy of is already taken by
somebody else" and continue on to match himself with another
source with a lessor match (this matters to a change introduced
between 1.5.3.X series and 1.5.4-rc, that lets the code to favor
unused matches first and then falls back to using already used
matches).

This instead allocates and keeps only M records in core.  For
each dst, we compute similarlity with all sources (so the number
of similarity estimate computations we do is still N x M), but
we keep the best src for each dst.  This is essentially to save
memory drastically by giving up to come up with better pairing.

I guess we could keep a handful best candidates per dst, instead
of just one, to further improve on this approach, and such a
change should be fairly straightforward.

---
 diffcore-rename.c |   46 +++++++++++++++++++++++-----------------------
 1 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3d37725..7e8fdcd 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -473,42 +473,42 @@ void diffcore_rename(struct diff_options *options)
 	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	mx = xmalloc(sizeof(*mx) * num_create * num_src);
+
+	mx = xmalloc(sizeof(*mx) * num_create);
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		int base = dst_cnt * num_src;
 		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_score *m;
+
 		if (rename_dst[i].pair)
 			continue; /* dealt with exact match already. */
+
+		m = &mx[dst_cnt];
+		m->dst = -1;
 		for (j = 0; j < rename_src_nr; j++) {
 			struct diff_filespec *one = rename_src[j].one;
-			struct diff_score *m = &mx[base+j];
-			m->src = j;
-			m->dst = i;
-			m->score = estimate_similarity(one, two,
-						       minimum_score);
-			m->name_score = basename_same(one, two);
+			int score, name_score;
+			score = estimate_similarity(one, two,
+						    minimum_score);
+			name_score = basename_same(one, two);
+			if (m->dst < 0 ||
+			    score > m->score ||
+			    (score == m->score &&
+			     name_score > m->name_score)) {
+				m->score = score;
+				m->name_score = name_score;
+				m->src = j;
+				m->dst = i;
+			}
 			diff_free_filespec_blob(one);
 		}
 		/* We do not need the text anymore */
 		diff_free_filespec_blob(two);
 		dst_cnt++;
 	}
+
 	/* cost matrix sorted by most to least similar pair */
-	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
-		struct diff_filespec *src;
-		if (dst->pair)
-			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
-		src = rename_src[mx[i].src].one;
-		if (src->rename_used)
-			continue;
-		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
-		rename_count++;
-	}
-	for (i = 0; i < num_create * num_src; i++) {
+	qsort(mx, num_create, sizeof(*mx), score_compare);
+	for (i = 0; i < num_create; i++) {
 		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: git-revert is a memory hog
  2008-01-29 23:19         ` Junio C Hamano
@ 2008-01-29 23:50           ` Junio C Hamano
  2008-01-30  4:40           ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
  1 sibling, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-01-29 23:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Adrian Bunk, git

Junio C Hamano <gitster@pobox.com> writes:

> This instead allocates and keeps only M records in core.  For
> each dst, we compute similarlity with all sources (so the number
> of similarity estimate computations we do is still N x M), but
> we keep the best src for each dst.  This is essentially to save
> memory drastically by giving up to come up with better pairing.
>
> I guess we could keep a handful best candidates per dst, instead
> of just one, to further improve on this approach, and such a
> change should be fairly straightforward.

An obvious side note to this patch is that if we are going to
limit us to only 1 source candidate per destination, we do not
even have to allocate.  We can just do similarity one-by-one for
each destination, and pair up with the best source as we go.

I did not code it that way, primarily because that would
permanently close the door to extend it back to keep multiple
candidates per dst, so that later ones that gets processed can
notice what happened to earlier ones.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Optimize rename detection for a huge diff
  2008-01-29 23:19         ` Junio C Hamano
  2008-01-29 23:50           ` Junio C Hamano
@ 2008-01-30  4:40           ` Junio C Hamano
  2008-01-30  6:57             ` Luke Lu
  2008-02-13  9:53             ` Junio C Hamano
  1 sibling, 2 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-01-30  4:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Adrian Bunk, git

When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.

This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match.  This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.

This instead allocates and keeps only a handful rename source
candidates per new files in the postimage.  I.e. it makes the
memory requirement from O(N x M) to O(M).

For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.

I've run

    git diff -l0 -M --name-status v2.6.$i v2.6.$(($i+1))

with and without patch for i=15..23.  There is one pair between
v2.6.20 and v2.6.21 that gets different matching from this
version.

    R093	include/asm-arm/apm.h	include/linux/apm-emulation.h
    R093	include/asm-mips/apm.h	include/linux/apm-emulation.h

Without the patch, apm-emulation.h is found to be from asm-arm/arm.h;
with the patch, it is found to be from asm-mips/apm.h.  

But the difference between the ARM and MIPS versions is quite
small, so I do not think we need to even call this an regression:

    diff --git a/v2.6.20:include/asm-arm/apm.h b/v2.6.20:include/asm-mips/apm.h
    index d09113b..4b99ffc 100644
    --- a/v2.6.20:include/asm-arm/apm.h
    +++ b/v2.6.20:include/asm-mips/apm.h
    @@ -10,8 +10,8 @@
      *
      *
      */
    -#ifndef ARM_ASM_SA1100_APM_H
    -#define ARM_ASM_SA1100_APM_H
    +#ifndef MIPS_ASM_SA1100_APM_H
    +#define MIPS_ASM_SA1100_APM_H

     #include <linux/apm_bios.h>

And the renamed diff is like this.

    diff --git a/v2.6.20:include/asm-arm/apm.h b/v2.6.21:include/linux/apm-emulation.h
    index d09113b..e6d8003 100644
    --- a/v2.6.20:include/asm-arm/apm.h
    +++ b/v2.6.21:include/linux/apm-emulation.h
    @@ -7,11 +7,9 @@
      * based on arch/arm/kernel/apm.c
      * factor out the information needed by architectures to provide
      * apm status
    - *
    - *
      */
    -#ifndef ARM_ASM_SA1100_APM_H
    -#define ARM_ASM_SA1100_APM_H
    +#ifndef __LINUX_APM_EMULATION_H
    +#define __LINUX_APM_EMULATION_H

     #include <linux/apm_bios.h>

    @@ -61,4 +59,4 @@ extern void (*apm_get_power_status)(struct apm_power_info *);
      */
     void apm_queue_event(apm_event_t event);

    -#endif
    +#endif /* __LINUX_APM_EMULATION_H */

By looking at the above two diffs, I would say picking between
ARM and MIPS is equally good and the behaviour difference should
not matter in practice.

Rename-detecting diff without limit between v2.6.20 and v2.6.24
produces 991 renames and 57 copies, with or without the patch
(the resulting pairs are somewhat deferent, but I've looked at a
few and the differences looked fairly small, like the above
header files).  The output from /usr/bin/time are:

  (without patch)
  28.84user 0.33system 0:29.35elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (19major+90874minor)pagefaults 0swaps

  (with patch)
  24.81user 0.12system 0:24.93elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+36553minor)pagefaults 0swaps

Signed-off-by: Junio C Hamano <gitster@pobox.com>

---

 Junio C Hamano <gitster@pobox.com> writes:

 > How about this fairly obvious patch?

 This replaces the previous one by implementing "keep the best N
 candidates per each destination" as I hinted in my earlier
 message.  I suspect that we may want to optimize the O(N x M)
 part by stopping after finding enough number of good enough
 candidates for a given destination.

 diffcore-rename.c |   77 +++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 57 insertions(+), 20 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3d37725..90d06f0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -223,6 +223,12 @@ static int score_compare(const void *a_, const void *b_)
 {
 	const struct diff_score *a = a_, *b = b_;
 
+	/* sink the unused ones to the bottom */
+	if (a->dst < 0)
+		return (0 <= b->dst);
+	else if (b->dst < 0)
+		return -1;
+
 	if (a->score == b->score)
 		return b->name_score - a->name_score;
 
@@ -387,6 +393,22 @@ static int find_exact_renames(void)
 	return i;
 }
 
+#define NUM_CANDIDATE_PER_DST 8
+static void record_if_better(struct diff_score m[], struct diff_score *o)
+{
+	int i, worst;
+
+	/* find the worst one */
+	worst = 0;
+	for (i = 1; i < NUM_CANDIDATE_PER_DST; i++)
+		if (score_compare(&m[i], &m[worst]) > 0)
+			worst = i;
+
+	/* is it better than the worst one? */
+	if (score_compare(&m[worst], o) > 0)
+		m[worst] = *o;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -473,50 +495,65 @@ void diffcore_rename(struct diff_options *options)
 	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	mx = xmalloc(sizeof(*mx) * num_create * num_src);
+	mx = xcalloc(num_create * NUM_CANDIDATE_PER_DST, sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		int base = dst_cnt * num_src;
 		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_score *m;
+
 		if (rename_dst[i].pair)
 			continue; /* dealt with exact match already. */
+
+		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
+		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
+			m[j].dst = -1;
+
 		for (j = 0; j < rename_src_nr; j++) {
 			struct diff_filespec *one = rename_src[j].one;
-			struct diff_score *m = &mx[base+j];
-			m->src = j;
-			m->dst = i;
-			m->score = estimate_similarity(one, two,
-						       minimum_score);
-			m->name_score = basename_same(one, two);
+			struct diff_score this_src;
+			this_src.score = estimate_similarity(one, two,
+							     minimum_score);
+			this_src.name_score = basename_same(one, two);
+			this_src.dst = i;
+			this_src.src = j;
+			record_if_better(m, &this_src);
 			diff_free_filespec_blob(one);
 		}
 		/* We do not need the text anymore */
 		diff_free_filespec_blob(two);
 		dst_cnt++;
 	}
+
 	/* cost matrix sorted by most to least similar pair */
-	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
-		struct diff_filespec *src;
+	qsort(mx, dst_cnt * NUM_CANDIDATE_PER_DST, sizeof(*mx), score_compare);
+
+	for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {
+		struct diff_rename_dst *dst;
+
+		if ((mx[i].dst < 0) ||
+		    (mx[i].score < minimum_score))
+			break; /* there is no more usable pair. */
+		dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
-		src = rename_src[mx[i].src].one;
-		if (src->rename_used)
+		if (rename_src[mx[i].src].one->rename_used)
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		rename_count++;
 	}
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
+
+	for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {
+		struct diff_rename_dst *dst;
+
+		if ((mx[i].dst < 0) ||
+		    (mx[i].score < minimum_score))
+			break; /* there is no more usable pair. */
+		dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		rename_count++;
 	}
+
 	free(mx);
 
  cleanup:

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Optimize rename detection for a huge diff
  2008-01-30  4:40           ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
@ 2008-01-30  6:57             ` Luke Lu
  2008-01-30  7:24               ` Luke Lu
  2008-02-13  9:53             ` Junio C Hamano
  1 sibling, 1 reply; 24+ messages in thread
From: Luke Lu @ 2008-01-30  6:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Jeff King, Adrian Bunk, git

On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
> When there are N deleted paths and M created paths, we used to
> allocate (N x M) "struct diff_score" that record how similar
> each of the pair is, and picked the <src,dst> pair that gives
> the best match first, and then went on to process worse matches.
>
> This sorting is done so that when two new files in the postimage
> that are similar to the same file deleted from the preimage, we
> can process the more similar one first, and when processing the
> second one, it can notice "Ah, the source I was planning to say
> I am a copy of is already taken by somebody else" and continue
> on to match itself with another file in the preimage with a
> lessor match.  This matters to a change introduced between
> 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
> matches first and then falls back to using already used
> matches.
>
> This instead allocates and keeps only a handful rename source
> candidates per new files in the postimage.  I.e. it makes the
> memory requirement from O(N x M) to O(M).
>
> For each dst, we compute similarlity with all sources (i.e. the
> number of similarity estimate computations is still O(N x M)),
> but we keep handful best src candidates for each dst.

I can think of cases where you'll throw away better candidates this  
way. How about using a priority queue of size max(N, M)?

I don't know about the details of the current algorithm but it seems  
to me that using a naive Rabin Karp fingerprinting approach would not  
use too much memory: say L is total number of bytes of created files  
and the fingerprint size S and hash size of 4 bytes. To keep track of  
M files You only need to keep 8(additional 4 bytes as an index to the  
file names)*(L/S + M(for filenames)) plus some overhead for the hash  
table in memory. One pass through D (number of bytes of deleted  
files) you can get the NxM scores. The score is defined as Wf * Mf +  
Wt, where Wf is the weight for fingerprinting match and Wt is the  
weight for title match score; Mf is the fingerprint match score =  
(number of matching fingerprints)/(number of fingerprints of original  
(deleted) file). Wf and Wt can be tuned to boost exact basename match.

By pushing the scores into a priority queue you'll get the final top  
(max(N, M) = K) scores in the end. The computation complexity is  
really O(D+L+(MxN)logK) and memory requirement O(L)

You can compute the entire linux source tree renaming (24K files and  
total 260MB uncompressed) this way using only about 92MB of memory in  
18 seconds (limited by hash lookup speed, assuming 15M lookups per  
second based on my past experience).

__Luke

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Optimize rename detection for a huge diff
  2008-01-30  6:57             ` Luke Lu
@ 2008-01-30  7:24               ` Luke Lu
  0 siblings, 0 replies; 24+ messages in thread
From: Luke Lu @ 2008-01-30  7:24 UTC (permalink / raw)
  To: Luke Lu; +Cc: Junio C Hamano, Linus Torvalds, Jeff King, Adrian Bunk, git

On Jan 29, 2008, at 10:57 PM, Luke Lu wrote:
> On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
>> When there are N deleted paths and M created paths, we used to
>> allocate (N x M) "struct diff_score" that record how similar
>> each of the pair is, and picked the <src,dst> pair that gives
>> the best match first, and then went on to process worse matches.
>>
>> This sorting is done so that when two new files in the postimage
>> that are similar to the same file deleted from the preimage, we
>> can process the more similar one first, and when processing the
>> second one, it can notice "Ah, the source I was planning to say
>> I am a copy of is already taken by somebody else" and continue
>> on to match itself with another file in the preimage with a
>> lessor match.  This matters to a change introduced between
>> 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
>> matches first and then falls back to using already used
>> matches.
>>
>> This instead allocates and keeps only a handful rename source
>> candidates per new files in the postimage.  I.e. it makes the
>> memory requirement from O(N x M) to O(M).
>>
>> For each dst, we compute similarlity with all sources (i.e. the
>> number of similarity estimate computations is still O(N x M)),
>> but we keep handful best src candidates for each dst.
>
> I can think of cases where you'll throw away better candidates this  
> way. How about using a priority queue of size max(N, M)?
>
> I don't know about the details of the current algorithm but it  
> seems to me that using a naive Rabin Karp fingerprinting approach  
> would not use too much memory: say L is total number of bytes of  
> created files and the fingerprint size S and hash size of 4 bytes.  
> To keep track of M files You only need to keep 8(additional 4 bytes  
> as an index to the file names)*(L/S + M(for filenames)) plus some  
> overhead for the hash table in memory. One pass through D (number  
> of bytes of deleted files) you can get the NxM scores. The score is  
> defined as Wf * Mf + Wt, where Wf is the weight for fingerprinting  
> match and Wt is the weight for title match score; Mf is the  
> fingerprint match score = (number of matching fingerprints)/(number  
> of fingerprints of original (deleted) file). Wf and Wt can be tuned  
> to boost exact basename match.
>
> By pushing the scores into a priority queue you'll get the final  
> top (max(N, M) = K) scores in the end. The computation complexity  
> is really O(D+L+(MxN)logK) and memory requirement O(L)
>
> You can compute the entire linux source tree renaming (24K files  
> and total 260MB uncompressed) this way using only about 92MB of  
> memory in 18 seconds (limited by hash lookup speed, assuming 15M  
> lookups per second based on my past experience).

The estimate is based on fingerprint size of 64 bytes and a 2.4GHz  
C2D class Intel CPU, YMMV. One can trade off the accuracy for less  
memory by using larger fingerprint size and vice versa.

__Luke

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Optimize rename detection for a huge diff
  2008-01-30  4:40           ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
  2008-01-30  6:57             ` Luke Lu
@ 2008-02-13  9:53             ` Junio C Hamano
  2008-02-13 10:19               ` David Kastrup
  2008-02-14  3:00               ` Junio C Hamano
  1 sibling, 2 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-02-13  9:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Adrian Bunk, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 28 Jan 2008, Jeff King wrote:
>> 
>> I tried to reproduce this, but my peak heap allocation was only around
>> 20MB. Is your repository fully packed? Not packed at all? Can you use
>> valgrind/massif to figure out where the memory is going?
>
> I definitely can reproduce it, it's horrid.
>
> This is from "top" fairly late in the game, but with the thing not even 
> done yet. Current git, pretty much fully (and fairly aggressively) packed 
> current kernel repo, and using "diff.renamelmit=0".
>
> 	4751 torvalds  20   0  852m 446m  47m R   72 22.4   2:46.58 git-merge-recur
>
> It finally finished with time reporting:
>
> 	208.15user 3.50system 4:01.50elapsed 87%CPU (0avgtext+0avgdata 0maxresident)k
> 	238736inputs+4544outputs (8261major+280971minor)pagefaults 0swaps
>
> where those 280971 minor page faults are what largely indicates how much 
> memory it used (the technical term for that number is "metric buttload of 
> memory").

With a bit of tweak, now I am getting these numbers to the
rename detection that used to spend 800MB (the peak I observed
was somewhere around 430MB).

(after patch, in the kernel repository, master at 96b5a46)
$ /usr/bin/time git-diff -M -l0 --name-status d19fbe8a7 master >/var/tmp/3
157.20user 1.03system 2:38.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (32major+237315minor)pagefaults 0swaps

$ /usr/bin/time git-diff -M -l0 --name-status d19fbe8a7 master >/var/tmp/4
174.00user 2.73system 3:09.55elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (6106major+459314minor)pagefaults 0swaps

So it is not that much of an improvement, but it seems to help
somewhat.

The first hunk is about shrinking the diff_score structure;
before the patch, it was O(NxM) where N and M are number of
rename source and destination candidates, but after the patch it
is now O(M), so this shrinkage should not matter, but score is
capped to MAX_SCORE (60000) and name_score is actually 0 or 1.
We cannot make it 1-bit unsigned bitfield as there is a qsort
comparison callback that does (b->name_score - a->name_score).

We would need to see where the remaining 400MB is going and try
to shrink it, but this would be an improvement so I'll soon be
moving this to 'next'.

---

 diffcore-rename.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 90d06f0..99953e7 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -112,8 +112,8 @@ static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 struct diff_score {
 	int src; /* index in rename_src */
 	int dst; /* index in rename_dst */
-	int score;
-	int name_score;
+	unsigned short score;
+	short name_score;
 };
 
 static int estimate_similarity(struct diff_filespec *src,
@@ -393,7 +393,7 @@ static int find_exact_renames(void)
 	return i;
 }
 
-#define NUM_CANDIDATE_PER_DST 8
+#define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {
 	int i, worst;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Optimize rename detection for a huge diff
  2008-02-13  9:53             ` Junio C Hamano
@ 2008-02-13 10:19               ` David Kastrup
  2008-02-13 10:23                 ` Junio C Hamano
  2008-02-14  3:00               ` Junio C Hamano
  1 sibling, 1 reply; 24+ messages in thread
From: David Kastrup @ 2008-02-13 10:19 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:


> The first hunk is about shrinking the diff_score structure;
> before the patch, it was O(NxM) where N and M are number of
> rename source and destination candidates, but after the patch it
> is now O(M), so this shrinkage should not matter, but score is
> capped to MAX_SCORE (60000) and name_score is actually 0 or 1.
> We cannot make it 1-bit unsigned bitfield as there is a qsort
> comparison callback that does (b->name_score - a->name_score).

Hm?  Can't that be changed into

(b->name_score > a->name_score ? 1 
 : b->name_score < a->name_score ? -1 : 0)

or something?  Or perhaps just

  ((int)b->name_score - (int)a->name_score)

or so?

It sounds like a factor 2 is in question here, and that would seem like
an easy fix?

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Optimize rename detection for a huge diff
  2008-02-13 10:19               ` David Kastrup
@ 2008-02-13 10:23                 ` Junio C Hamano
  0 siblings, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-02-13 10:23 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

David Kastrup <dak@gnu.org> writes:

> Junio C Hamano <gitster@pobox.com> writes:
> ...
>> We cannot make it 1-bit unsigned bitfield as there is a qsort
>> comparison callback that does (b->name_score - a->name_score).
>
> Hm?  Can't that be changed into ...

Sorry, my "we cannot" was misleading.  We do not have anything
else that we need to store that is 15-bit, so even if we can, it
won't gain us anything.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Optimize rename detection for a huge diff
  2008-02-13  9:53             ` Junio C Hamano
  2008-02-13 10:19               ` David Kastrup
@ 2008-02-14  3:00               ` Junio C Hamano
  1 sibling, 0 replies; 24+ messages in thread
From: Junio C Hamano @ 2008-02-14  3:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jeff King, Adrian Bunk, git

Junio C Hamano <gitster@pobox.com> writes:

> With a bit of tweak, now I am getting these numbers to the
> rename detection that used to spend 800MB (the peak I observed
> was somewhere around 430MB).
>
> (after patch, in the kernel repository, master at 96b5a46)
> $ /usr/bin/time git-diff -M -l0 --name-status d19fbe8a7 master >/var/tmp/3
> 157.20user 1.03system 2:38.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (32major+237315minor)pagefaults 0swaps
>
> (before patch, same diff)
> $ /usr/bin/time git-diff -M -l0 --name-status d19fbe8a7 master >/var/tmp/4
> 174.00user 2.73system 3:09.55elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (6106major+459314minor)pagefaults 0swaps
>
> So it is not that much of an improvement, but it seems to help
> somewhat.
> ...
> We would need to see where the remaining 400MB is going and try
> to shrink it, but this would be an improvement so I'll soon be
> moving this to 'next'.

I noticed that massif reported diff_queue() very high in the
list, so came up with this patch on top of the previous ones.

162.75user 1.92system 2:45.22elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+215490minor)pagefaults 0swaps

Still not much of an improvement (10%, judging from the minor
fault counts?), and it risks crashing if there is an unconvered
free() that frees a diff_filepair that was obtained from the
bulk allocator.

---
 alloc.c               |   33 ++++++++++++++++++++
 builtin-diff-tree.c   |    2 +-
 builtin-rev-parse.c   |    2 +-
 builtin-send-pack.c   |    2 +-
 builtin-show-branch.c |    2 +-
 cache.h               |    4 ++
 commit.c              |   14 ++++----
 diff.c                |    4 +-
 diffcore-break.c      |   12 ++++---
 diffcore-rename.c     |   80 +++++++++++++++++++++++++++++++++++-------------
 diffcore.h            |    4 ++
 http-push.c           |    2 +-
 revision.c            |    6 ++--
 upload-pack.c         |    2 +-
 14 files changed, 124 insertions(+), 45 deletions(-)

diff --git a/alloc.c b/alloc.c
index 216c23a..08c2591 100644
--- a/alloc.c
+++ b/alloc.c
@@ -15,6 +15,8 @@
 #include "tree.h"
 #include "commit.h"
 #include "tag.h"
+#include "diff.h"
+#include "diffcore.h"
 
 #define BLOCKING 1024
 
@@ -51,6 +53,35 @@ DEFINE_ALLOCATOR(commit, struct commit)
 DEFINE_ALLOCATOR(tag, struct tag)
 DEFINE_ALLOCATOR(object, union any_object)
 
+#define DEFINE_FREEABLE_ALLOCATOR(name, type)			\
+DEFINE_ALLOCATOR(name, type)					\
+union freeable_##name {						\
+	union freeable_##name *next;				\
+	type body;						\
+};								\
+static union freeable_##name *name##_freepool;			\
+type *alloc_freeable_##name##_node(void)			\
+{								\
+	if (name##_freepool) {					\
+		union freeable_##name *one = name##_freepool;	\
+		name##_freepool = one->next;			\
+		memset(one, 0, sizeof(*one));			\
+		return &(one->body);				\
+	}							\
+	return alloc_##name##_node();				\
+}								\
+void free_##name##_node(type *it_)				\
+{								\
+	union freeable_##name *it = (union freeable_##name *)it_; \
+	if (it) {						\
+		it->next = name##_freepool;			\
+		name##_freepool = it;				\
+	}							\
+}
+
+DEFINE_FREEABLE_ALLOCATOR(commit_list, struct commit_list)
+DEFINE_FREEABLE_ALLOCATOR(diff_filepair, struct diff_filepair)
+
 #ifdef NO_C99_FORMAT
 #define SZ_FMT "%u"
 #else
@@ -73,4 +104,6 @@ void alloc_report(void)
 	REPORT(tree);
 	REPORT(commit);
 	REPORT(tag);
+	REPORT(commit_list);
 }
+
diff --git a/builtin-diff-tree.c b/builtin-diff-tree.c
index 832797f..04f93f5 100644
--- a/builtin-diff-tree.c
+++ b/builtin-diff-tree.c
@@ -36,7 +36,7 @@ static int diff_tree_stdin(char *line)
 		/* Free the real parent list */
 		for (parents = commit->parents; parents; ) {
 			struct commit_list *tmp = parents->next;
-			free(parents);
+			free_commit_list_node(parents);
 			parents = tmp;
 		}
 		commit->parents = NULL;
diff --git a/builtin-rev-parse.c b/builtin-rev-parse.c
index b9af1a5..580ec51 100644
--- a/builtin-rev-parse.c
+++ b/builtin-rev-parse.c
@@ -227,7 +227,7 @@ static int try_difference(const char *arg)
 				struct commit_list *n = exclude->next;
 				show_rev(REVERSED,
 					 exclude->item->object.sha1,NULL);
-				free(exclude);
+				free_commit_list_node(exclude);
 				exclude = n;
 			}
 		}
diff --git a/builtin-send-pack.c b/builtin-send-pack.c
index 8afb1d0..6a83258 100644
--- a/builtin-send-pack.c
+++ b/builtin-send-pack.c
@@ -82,7 +82,7 @@ static void unmark_and_free(struct commit_list *list, unsigned int mark)
 		struct commit_list *temp = list;
 		temp->item->object.flags &= ~mark;
 		list = temp->next;
-		free(temp);
+		free_commit_list_node(temp);
 	}
 }
 
diff --git a/builtin-show-branch.c b/builtin-show-branch.c
index 019abd3..0d7cf89 100644
--- a/builtin-show-branch.c
+++ b/builtin-show-branch.c
@@ -38,7 +38,7 @@ static struct commit *pop_one_commit(struct commit_list **list_p)
 	list = *list_p;
 	commit = list->item;
 	*list_p = list->next;
-	free(list);
+	free_commit_list_node(list);
 	return commit;
 }
 
diff --git a/cache.h b/cache.h
index 3867ba7..196a0d7 100644
--- a/cache.h
+++ b/cache.h
@@ -667,6 +667,10 @@ extern void *alloc_tree_node(void);
 extern void *alloc_commit_node(void);
 extern void *alloc_tag_node(void);
 extern void *alloc_object_node(void);
+
+extern struct commit_list *alloc_freeable_commit_list_node(void);
+extern void free_commit_list_node(struct commit_list *);
+
 extern void alloc_report(void);
 
 /* trace.c */
diff --git a/commit.c b/commit.c
index 8b8fb04..6696968 100644
--- a/commit.c
+++ b/commit.c
@@ -333,7 +333,7 @@ int parse_commit(struct commit *item)
 
 struct commit_list *commit_list_insert(struct commit *item, struct commit_list **list_p)
 {
-	struct commit_list *new_list = xmalloc(sizeof(struct commit_list));
+	struct commit_list *new_list = alloc_freeable_commit_list_node();
 	new_list->item = item;
 	new_list->next = *list_p;
 	*list_p = new_list;
@@ -345,11 +345,11 @@ void free_commit_list(struct commit_list *list)
 	while (list) {
 		struct commit_list *temp = list;
 		list = temp->next;
-		free(temp);
+		free_commit_list_node(temp);
 	}
 }
 
-struct commit_list * insert_by_date(struct commit *item, struct commit_list **list)
+struct commit_list *insert_by_date(struct commit *item, struct commit_list **list)
 {
 	struct commit_list **pp = list;
 	struct commit_list *p;
@@ -381,7 +381,7 @@ struct commit *pop_most_recent_commit(struct commit_list **list,
 	struct commit_list *old = *list;
 
 	*list = (*list)->next;
-	free(old);
+	free_commit_list_node(old);
 
 	while (parents) {
 		struct commit *commit = parents->item;
@@ -423,7 +423,7 @@ struct commit *pop_commit(struct commit_list **stack)
 
 	if (top) {
 		*stack = top->next;
-		free(top);
+		free_commit_list_node(top);
 	}
 	return item;
 }
@@ -568,7 +568,7 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
 
 		commit = list->item;
 		n = list->next;
-		free(list);
+		free_commit_list_node(list);
 		list = n;
 
 		flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
@@ -599,7 +599,7 @@ static struct commit_list *merge_bases(struct commit *one, struct commit *two)
 		struct commit_list *n = list->next;
 		if (!(list->item->object.flags & STALE))
 			insert_by_date(list->item, &result);
-		free(list);
+		free_commit_list_node(list);
 		list = n;
 	}
 	return result;
diff --git a/diff.c b/diff.c
index cd8bc4d..63ac8db 100644
--- a/diff.c
+++ b/diff.c
@@ -2433,7 +2433,7 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *queue,
 				 struct diff_filespec *one,
 				 struct diff_filespec *two)
 {
-	struct diff_filepair *dp = xcalloc(1, sizeof(*dp));
+	struct diff_filepair *dp = alloc_freeable_diff_filepair_node();
 	dp->one = one;
 	dp->two = two;
 	if (queue)
@@ -2445,7 +2445,7 @@ void diff_free_filepair(struct diff_filepair *p)
 {
 	free_filespec(p->one);
 	free_filespec(p->two);
-	free(p);
+	free_diff_filepair_node(p);
 }
 
 /* This is different from find_unique_abbrev() in that
diff --git a/diffcore-break.c b/diffcore-break.c
index 31cdcfe..debd26d 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -205,9 +205,11 @@ void diffcore_break(int break_score)
 				dp->score = score;
 				dp->broken_pair = 1;
 
-				free(p); /* not diff_free_filepair(), we are
-					  * reusing one and two here.
-					  */
+				/*
+				 * not diff_free_filepair(), we are
+				 * reusing one and two here.
+				 */
+				free_diff_filepair_node(p);
 				continue;
 			}
 		}
@@ -243,8 +245,8 @@ static void merge_broken(struct diff_filepair *p,
 	dp->score = p->score;
 	diff_free_filespec_data(d->two);
 	diff_free_filespec_data(c->one);
-	free(d);
-	free(c);
+	free_diff_filepair_node(d);
+	free_diff_filepair_node(c);
 }
 
 void diffcore_merge_broken(void)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 3d37725..5974362 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -112,8 +112,8 @@ static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
 struct diff_score {
 	int src; /* index in rename_src */
 	int dst; /* index in rename_dst */
-	int score;
-	int name_score;
+	unsigned short score;
+	short name_score;
 };
 
 static int estimate_similarity(struct diff_filespec *src,
@@ -223,6 +223,12 @@ static int score_compare(const void *a_, const void *b_)
 {
 	const struct diff_score *a = a_, *b = b_;
 
+	/* sink the unused ones to the bottom */
+	if (a->dst < 0)
+		return (0 <= b->dst);
+	else if (b->dst < 0)
+		return -1;
+
 	if (a->score == b->score)
 		return b->name_score - a->name_score;
 
@@ -387,6 +393,22 @@ static int find_exact_renames(void)
 	return i;
 }
 
+#define NUM_CANDIDATE_PER_DST 4
+static void record_if_better(struct diff_score m[], struct diff_score *o)
+{
+	int i, worst;
+
+	/* find the worst one */
+	worst = 0;
+	for (i = 1; i < NUM_CANDIDATE_PER_DST; i++)
+		if (score_compare(&m[i], &m[worst]) > 0)
+			worst = i;
+
+	/* is it better than the worst one? */
+	if (score_compare(&m[worst], o) > 0)
+		m[worst] = *o;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -473,47 +495,61 @@ void diffcore_rename(struct diff_options *options)
 	if (num_create * num_src > rename_limit * rename_limit)
 		goto cleanup;
 
-	mx = xmalloc(sizeof(*mx) * num_create * num_src);
+	mx = xcalloc(num_create * NUM_CANDIDATE_PER_DST, sizeof(*mx));
 	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
-		int base = dst_cnt * num_src;
 		struct diff_filespec *two = rename_dst[i].two;
+		struct diff_score *m;
+
 		if (rename_dst[i].pair)
 			continue; /* dealt with exact match already. */
+
+		m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST];
+		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
+			m[j].dst = -1;
+
 		for (j = 0; j < rename_src_nr; j++) {
 			struct diff_filespec *one = rename_src[j].one;
-			struct diff_score *m = &mx[base+j];
-			m->src = j;
-			m->dst = i;
-			m->score = estimate_similarity(one, two,
-						       minimum_score);
-			m->name_score = basename_same(one, two);
+			struct diff_score this_src;
+			this_src.score = estimate_similarity(one, two,
+							     minimum_score);
+			this_src.name_score = basename_same(one, two);
+			this_src.dst = i;
+			this_src.src = j;
+			record_if_better(m, &this_src);
 			diff_free_filespec_blob(one);
 		}
 		/* We do not need the text anymore */
 		diff_free_filespec_blob(two);
 		dst_cnt++;
 	}
+
 	/* cost matrix sorted by most to least similar pair */
-	qsort(mx, num_create * num_src, sizeof(*mx), score_compare);
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
-		struct diff_filespec *src;
+	qsort(mx, dst_cnt * NUM_CANDIDATE_PER_DST, sizeof(*mx), score_compare);
+
+	for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {
+		struct diff_rename_dst *dst;
+
+		if ((mx[i].dst < 0) ||
+		    (mx[i].score < minimum_score))
+			break; /* there is no more usable pair. */
+		dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
-		src = rename_src[mx[i].src].one;
-		if (src->rename_used)
+		if (rename_src[mx[i].src].one->rename_used)
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		rename_count++;
 	}
-	for (i = 0; i < num_create * num_src; i++) {
-		struct diff_rename_dst *dst = &rename_dst[mx[i].dst];
+
+	for (i = 0; i < dst_cnt * NUM_CANDIDATE_PER_DST; i++) {
+		struct diff_rename_dst *dst;
+
+		if ((mx[i].dst < 0) ||
+		    (mx[i].score < minimum_score))
+			break; /* there is no more usable pair. */
+		dst = &rename_dst[mx[i].dst];
 		if (dst->pair)
 			continue; /* already done, either exact or fuzzy. */
-		if (mx[i].score < minimum_score)
-			break; /* there is no more usable pair. */
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		rename_count++;
 	}
diff --git a/diffcore.h b/diffcore.h
index cc96c20..00aef4c 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -118,4 +118,8 @@ extern int diffcore_count_changes(struct diff_filespec *src,
 				  unsigned long *src_copied,
 				  unsigned long *literal_added);
 
+/* alloc.c */
+extern struct diff_filepair *alloc_freeable_diff_filepair_node(void);
+extern void free_diff_filepair_node(struct diff_filepair *);
+
 #endif
diff --git a/http-push.c b/http-push.c
index b2b410d..314141f 100644
--- a/http-push.c
+++ b/http-push.c
@@ -1817,7 +1817,7 @@ static void unmark_and_free(struct commit_list *list, unsigned int mark)
 		struct commit_list *temp = list;
 		temp->item->object.flags &= ~mark;
 		list = temp->next;
-		free(temp);
+		free_commit_list_node(temp);
 	}
 }
 
diff --git a/revision.c b/revision.c
index 6e85aaa..df9b062 100644
--- a/revision.c
+++ b/revision.c
@@ -571,7 +571,7 @@ static int limit_list(struct rev_info *revs)
 		show_early_output_fn_t show;
 
 		list = list->next;
-		free(entry);
+		free_commit_list_node(entry);
 
 		if (revs->max_age != -1 && (commit->date < revs->max_age))
 			obj->flags |= UNINTERESTING;
@@ -752,7 +752,7 @@ static void prepare_show_merge(struct rev_info *revs)
 	while (bases) {
 		struct commit *it = bases->item;
 		struct commit_list *n = bases->next;
-		free(bases);
+		free_commit_list_node(bases);
 		bases = n;
 		it->object.flags |= UNINTERESTING;
 		add_pending_object(revs, &it->object, "(merge-base)");
@@ -1472,7 +1472,7 @@ static struct commit *get_revision_1(struct rev_info *revs)
 		struct commit *commit = entry->item;
 
 		revs->commits = entry->next;
-		free(entry);
+		free_commit_list_node(entry);
 
 		if (revs->reflog_info)
 			fake_reflog_parent(revs->reflog_info, commit);
diff --git a/upload-pack.c b/upload-pack.c
index 51e3ec4..86a4e7e 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -331,7 +331,7 @@ static int reachable(struct commit *want)
 	while (work) {
 		struct commit_list *list = work->next;
 		struct commit *commit = work->item;
-		free(work);
+		free_commit_list_node(work);
 		work = list;
 
 		if (commit->object.flags & THEY_HAVE) {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-02-14  3:00 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-27 17:27 git-revert is a memory hog Adrian Bunk
2008-01-27 17:38 ` Shawn O. Pearce
2008-01-28  6:01   ` Jeff King
2008-01-28  5:59 ` Jeff King
2008-01-29 21:51   ` Linus Torvalds
2008-01-29 22:15     ` Junio C Hamano
2008-01-29 22:20     ` Jeff King
2008-01-29 22:30       ` Jeff King
2008-01-29 22:36       ` Junio C Hamano
2008-01-29 22:45         ` Jeff King
2008-01-29 22:51           ` Jeff King
2008-01-29 22:49         ` Linus Torvalds
2008-01-29 22:54           ` Jeff King
2008-01-29 22:53         ` Junio C Hamano
2008-01-29 22:57           ` Jeff King
2008-01-29 23:19         ` Junio C Hamano
2008-01-29 23:50           ` Junio C Hamano
2008-01-30  4:40           ` [PATCH] Optimize rename detection for a huge diff Junio C Hamano
2008-01-30  6:57             ` Luke Lu
2008-01-30  7:24               ` Luke Lu
2008-02-13  9:53             ` Junio C Hamano
2008-02-13 10:19               ` David Kastrup
2008-02-13 10:23                 ` Junio C Hamano
2008-02-14  3:00               ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).