diffcore-rename performance mode

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* diffcore-rename performance mode
@ 2007-09-18  8:23 Jeff King
  2007-09-18  8:49 ` Junio C Hamano
  2007-09-18 22:12 ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2007-09-18  8:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[this is a resend of some comments I made deep in another thread on
rename limiting; I wanted to get your comments, Junio, but I was afraid
you didn't see it buried in the mess. If you have already read it and
have nothing to say, just tell me to shut up.

I was able to get 1000% speedup on a (perhaps pathological) diff, and I
suspect there may be more speedups possible in the spanhash lookups. So
I think it's worth pursuing.]

Hmm. Actually, doing some profiling, it looks like about 75% of the time
is spent not in the O(n^2) comparison, but in generating the hash
fingerprints of each file.

There seems to be a serious performance problem in diffcore-rename.
There is infrastructure to cache the "cnt_data" member of each filespec,
but it never gets used because we immediately free the filespec data
after use. Oops.

With this patch:

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6bde439..531a844 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -362,10 +362,7 @@ void diffcore_rename(struct diff_options *options)
 			m->score = estimate_similarity(one, two,
 						       minimum_score);
 			m->name_score = basename_same(one, two);
-			diff_free_filespec_data(one);
 		}
-		/* We do not need the text anymore */
-		diff_free_filespec_data(two);
 		dst_cnt++;
 	}
 	/* cost matrix sorted by most to least similar pair */

My 20-minute diff becomes a 2-minute diff. The downside is that the
memory usage is much increased (for obvious reasons, it should increase
by the dataset size, since we are keeping pointers to the data around --
in my case, around 1G extra).  However, keeping around _just_ the
cnt_data caused only about 100M of extra memory consumption (and gave
the same performance boost).

Of course, 2 minutes for a git-status is still way too slow, so there we
might still need a limiter. It also looks like 57% of my time is spent
in spanhash_find, and another 29% in diffcore_count_changes.

The spanhash data structure is a bit confusing. At first, it looked like
we were doing a linear search for a matching hash, but it's not quite,
since we seem to start at some magic spot based on the hashval we're
looking up.

But it seems to be an array of (hash, count) pairs for each file. Is
there a reason not to use a hash table mapping hash -> count? That would
make insertion and lookup O(1), presumably at the cost of a bit more
memory (since each filespec would have the full table).

Junio, can you shed some light on that decision?

-Peff

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:23 diffcore-rename performance mode Jeff King
@ 2007-09-18  8:49 ` Junio C Hamano
  2007-09-18  8:54   ` Jeff King
  2007-09-18 22:12 ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2007-09-18  8:49 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> There seems to be a serious performance problem in diffcore-rename.
> There is infrastructure to cache the "cnt_data" member of each filespec,
> but it never gets used because we immediately free the filespec data
> after use. Oops.
>
> With this patch:
> ...
> My 20-minute diff becomes a 2-minute diff. The downside is that the
> memory usage is much increased (for obvious reasons, it should increase
> by the dataset size, since we are keeping pointers to the data around --
> in my case, around 1G extra).

Yes, these early freeing of filespec_data() were introducd later
specifically to address the memory usage issue.

> However, keeping around _just_ the
> cnt_data caused only about 100M of extra memory consumption (and gave
> the same performance boost).

That would be an interesting and relatively low-hanging optimization.

> The spanhash data structure is a bit confusing. At first, it looked like
> we were doing a linear search for a matching hash, but it's not quite,
> since we seem to start at some magic spot based on the hashval we're
> looking up.

I think it was just a hash table with linear overflow (if your
spot is occupied by somebody else, you look for the next
available vacant spot -- works only if you do not ever delete
items from the table) but sorry, I do not recall the rationale
for picking that data structure.  I vaguely recall I did some
measurement between that and the usual "an array that is indexed
with a hash value that holds heads of linked lists" and pointer
chasing appeared quite cache-unfriendly to the point that it
actually degraded performance, but did not try very hard to
optimize it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:49 ` Junio C Hamano
@ 2007-09-18  8:54   ` Jeff King
  2007-09-18  8:58     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jeff King @ 2007-09-18  8:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Sep 18, 2007 at 01:49:50AM -0700, Junio C Hamano wrote:

> > However, keeping around _just_ the
> > cnt_data caused only about 100M of extra memory consumption (and gave
> > the same performance boost).
> 
> That would be an interesting and relatively low-hanging optimization.

OK, I will work up a patch. Is it worth making it configurable? Since it
is a space-time tradeoff, if you are tight on memory, it might actually
hurt performance. However, I have only looked at the numbers for my
massive data set...I can produce memory usage numbers for the kernel,
too.

> I think it was just a hash table with linear overflow (if your
> spot is occupied by somebody else, you look for the next
> available vacant spot -- works only if you do not ever delete
> items from the table) but sorry, I do not recall the rationale
> for picking that data structure.  I vaguely recall I did some
> measurement between that and the usual "an array that is indexed
> with a hash value that holds heads of linked lists" and pointer
> chasing appeared quite cache-unfriendly to the point that it
> actually degraded performance, but did not try very hard to
> optimize it.

I thought we were holding counts of hashes, in which case there _is_ no
overflow. We only care if you hit the hash fingerprint or not. But
perhaps I am mistaken...I will have to look more closely at the code.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:54   ` Jeff King
@ 2007-09-18  8:58     ` Junio C Hamano
  2007-09-18  9:01       ` Jeff King
  2007-09-18 11:20     ` Johannes Schindelin
  2007-09-25 16:38     ` Jeff King
  2 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2007-09-18  8:58 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I thought we were holding counts of hashes, in which case there _is_ no
> overflow.

The raw hashval (the fingerprint recorded in struct spanhash) is
further reduced and used as an index into spahash_top.data[].
So more than one hashval can try to sit in the same slot in
spanhash_top.data[] array.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:58     ` Junio C Hamano
@ 2007-09-18  9:01       ` Jeff King
  2007-09-18  9:17         ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2007-09-18  9:01 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Tue, Sep 18, 2007 at 01:58:17AM -0700, Junio C Hamano wrote:

> > I thought we were holding counts of hashes, in which case there _is_ no
> > overflow.
> 
> The raw hashval (the fingerprint recorded in struct spanhash) is
> further reduced and used as an index into spahash_top.data[].
> So more than one hashval can try to sit in the same slot in
> spanhash_top.data[] array.

Right, that's sort of what I was hinting at in the original message. Can
we just make the hash table big enough to use the fingerprint hashes
directly? It's going to use a bit more memory, but lookups should be
very fast. I'll try to experiment and get some numbers.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  9:01       ` Jeff King
@ 2007-09-18  9:17         ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2007-09-18  9:17 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Tue, Sep 18, 2007 at 01:58:17AM -0700, Junio C Hamano wrote:
>
>> > I thought we were holding counts of hashes, in which case there _is_ no
>> > overflow.
>> 
>> The raw hashval (the fingerprint recorded in struct spanhash) is
>> further reduced and used as an index into spahash_top.data[].
>> So more than one hashval can try to sit in the same slot in
>> spanhash_top.data[] array.
>
> Right, that's sort of what I was hinting at in the original message. Can
> we just make the hash table big enough to use the fingerprint hashes
> directly? It's going to use a bit more memory, but lookups should be
> very fast. I'll try to experiment and get some numbers.

Thanks -- I vaguely recall large hash was disastrous for me
(trashed cache), but that was on a different hardware, different
time.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:54   ` Jeff King
  2007-09-18  8:58     ` Junio C Hamano
@ 2007-09-18 11:20     ` Johannes Schindelin
  2007-09-25 16:38     ` Jeff King
  2 siblings, 0 replies; 13+ messages in thread
From: Johannes Schindelin @ 2007-09-18 11:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git

Hi,

On Tue, 18 Sep 2007, Jeff King wrote:

> On Tue, Sep 18, 2007 at 01:49:50AM -0700, Junio C Hamano wrote:
> 
> > > However, keeping around _just_ the cnt_data caused only about 100M 
> > > of extra memory consumption (and gave the same performance boost).
> > 
> > That would be an interesting and relatively low-hanging optimization.
> 
> OK, I will work up a patch. Is it worth making it configurable?

Yes, I think it would be worth making it configurable... For example, one 
of the machines I work on has only 128M.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:54   ` Jeff King
  2007-09-18  8:58     ` Junio C Hamano
  2007-09-18 11:20     ` Johannes Schindelin
@ 2007-09-25 16:38     ` Jeff King
  2007-09-25 19:06       ` Jeff King
  2 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2007-09-25 16:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Tue, Sep 18, 2007 at 04:54:13AM -0400, Jeff King wrote:

> > > However, keeping around _just_ the
> > > cnt_data caused only about 100M of extra memory consumption (and gave
> > > the same performance boost).
> > 
> > That would be an interesting and relatively low-hanging optimization.
> 
> I can produce memory usage numbers for the kernel, too.

And here are some kernel numbers. I measured performance of this script
in the linux-2.6 repository:

#!/bin/sh

last=
git-tag | grep -v -- - | while read tag; do
  if test -n "$last"; then
    echo Diffing $last..$tag
    git-diff --raw -M -l0 $last $tag >/dev/null
  fi
  last=$tag
done

under the assumption that diffing between major revisions would give a
good medium of diffs that would be large enough to show the n^2 rename
behavior, but still small enough to be close to "everyday" usage.

I measured three different approaches:
  1. stock 'next' (stock)
  2. removing entirely the calls to diff_free_filespec_data (nofree)
  3. changing those free calls to free everything except cnt_data (somefree)

And I measured two things:
  1. user CPU time to complete
  2. peak memory usage

All numbers are warm-cache, and typical cases after multiple runs.

                 | stock | nofree | somefree
-----------------|---------------------------
user time (s)    | 76.78 | 16.96  | 46.26
peak memory (Kb) | 52300 | 66796  | 59156

The raw 'time' output is below:

stock:
76.78user 3.35system 1:20.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+166733minor)pagefaults 0swaps

nofree:
16.96user 1.46system 0:18.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+185353minor)pagefaults 0swaps

somefree:
46.26user 1.54system 0:47.94elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+178819minor)pagefaults 0swaps

So this is definitely worth pursuing, as it yields massive speedups even
for regular repositories. And even the 'nofree' case only costs us 14M
of extra memory (although it is a 27% increase, this just isn't that
memory-hungry an endeavour for the sizes of changes we're talking
about). And as Linus noted, now that we have a default rename limit,
you're not likely to hit an explosion of memory usage.

What is most confusing is why the 'somefree' case performs so badly,
since we should just be using the cnt_data. I'll see if gprof can shed
any light on that. It would be nice to use it instead, since it will
have much better memory usage in the face of large blobs (e.g., my
pathological case that started this whole thread).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-25 16:38     ` Jeff King
@ 2007-09-25 19:06       ` Jeff King
  2007-09-25 19:10         ` Andreas Ericsson
  2007-09-25 19:32         ` David Kastrup
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2007-09-25 19:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Tue, Sep 25, 2007 at 12:38:43PM -0400, Jeff King wrote:

>[...]
>
> What is most confusing is why the 'somefree' case performs so badly,
> since we should just be using the cnt_data. I'll see if gprof can shed

OK, I found the problem. estimate_similarity calls
diff_populate_filespec each time, even if we already have the cnt_data,
which leads to recomputing the blob contents from deltas. Oops.

Fixing this, the correct numbers are:

                 | stock | nofree | old somefree | fixed somefree
-----------------|-----------------------------------------------
user time (s)    | 76.78 | 16.96  | 46.26        | 16.99
peak memory (Kb) | 52300 | 66796  | 59156        | 57328

So now we're at a 4.5x speedup for about 10% extra memory usage. Patch
will follow.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-25 19:06       ` Jeff King
@ 2007-09-25 19:10         ` Andreas Ericsson
  2007-09-25 19:32         ` David Kastrup
  1 sibling, 0 replies; 13+ messages in thread
From: Andreas Ericsson @ 2007-09-25 19:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King wrote:
> On Tue, Sep 25, 2007 at 12:38:43PM -0400, Jeff King wrote:
> 
>> [...]
>>
>> What is most confusing is why the 'somefree' case performs so badly,
>> since we should just be using the cnt_data. I'll see if gprof can shed
> 
> OK, I found the problem. estimate_similarity calls
> diff_populate_filespec each time, even if we already have the cnt_data,
> which leads to recomputing the blob contents from deltas. Oops.
> 
> Fixing this, the correct numbers are:
> 
>                  | stock | nofree | old somefree | fixed somefree
> -----------------|-----------------------------------------------
> user time (s)    | 76.78 | 16.96  | 46.26        | 16.99
> peak memory (Kb) | 52300 | 66796  | 59156        | 57328
> 
> So now we're at a 4.5x speedup for about 10% extra memory usage. Patch
> will follow.
> 

Nice work :)

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-25 19:06       ` Jeff King
  2007-09-25 19:10         ` Andreas Ericsson
@ 2007-09-25 19:32         ` David Kastrup
  2007-09-25 19:52           ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: David Kastrup @ 2007-09-25 19:32 UTC (permalink / raw)
  To: git

Jeff King <peff@peff.net> writes:

> On Tue, Sep 25, 2007 at 12:38:43PM -0400, Jeff King wrote:
>
>>[...]
>>
>> What is most confusing is why the 'somefree' case performs so badly,
>> since we should just be using the cnt_data. I'll see if gprof can shed
>
> OK, I found the problem. estimate_similarity calls
> diff_populate_filespec each time, even if we already have the cnt_data,
> which leads to recomputing the blob contents from deltas. Oops.
>
> Fixing this, the correct numbers are:
>
>                  | stock | nofree | old somefree | fixed somefree
> -----------------|-----------------------------------------------
> user time (s)    | 76.78 | 16.96  | 46.26        | 16.99
> peak memory (Kb) | 52300 | 66796  | 59156        | 57328
>
> So now we're at a 4.5x speedup for about 10% extra memory usage. Patch
> will follow.

Sounds good except when we happen to just hit the "memory working set
exceeds physical memory" sweet spot.  But the odds are much better
than for "nofree".

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-25 19:32         ` David Kastrup
@ 2007-09-25 19:52           ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2007-09-25 19:52 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

On Tue, Sep 25, 2007 at 09:32:31PM +0200, David Kastrup wrote:

> >                  | stock | nofree | old somefree | fixed somefree
> > -----------------|-----------------------------------------------
> > user time (s)    | 76.78 | 16.96  | 46.26        | 16.99
> > peak memory (Kb) | 52300 | 66796  | 59156        | 57328
> >
> > So now we're at a 4.5x speedup for about 10% extra memory usage. Patch
> > will follow.
> 
> Sounds good except when we happen to just hit the "memory working set
> exceeds physical memory" sweet spot.  But the odds are much better
> than for "nofree".

Of course, there is the possibility that it is the 10% that pushes you
into swap (or kills your ability to cache the entire pack in RAM).
However, this is probably not a big deal for two reasons:
  1. this is a 50M process handling the linux-2.6 repository. Other
     operations such as repacking already consume significantly more
     memory (git-repack -a allocates 290M on the same repo).
  2. I specifically turned off rename limiting (-l0) to do rename
     detection on these large-ish diffs.  If you have a machine which is
     on the cusp of swapping, then don't do that. Jumping from -l100
     (the default) to -l0 in my tests is _already_ moving you from 28M
     to 52M, a much larger jump.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: diffcore-rename performance mode
  2007-09-18  8:23 diffcore-rename performance mode Jeff King
  2007-09-18  8:49 ` Junio C Hamano
@ 2007-09-18 22:12 ` Linus Torvalds
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2007-09-18 22:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git



On Tue, 18 Sep 2007, Jeff King wrote:
> 
> With this patch:
> 
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index 6bde439..531a844 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -362,10 +362,7 @@ void diffcore_rename(struct diff_options *options)
>  			m->score = estimate_similarity(one, two,
>  						       minimum_score);
>  			m->name_score = basename_same(one, two);
> -			diff_free_filespec_data(one);
>  		}
> -		/* We do not need the text anymore */
> -		diff_free_filespec_data(two);
>  		dst_cnt++;
>  	}
>  	/* cost matrix sorted by most to least similar pair */
> 
> My 20-minute diff becomes a 2-minute diff.

I think this is a reasonable patch, but only now *after* we limit the 
rename matrix.

The whole reason for freeing the filespec data was to avoid exploding the 
memory usage, but now that we aggressively limit the number of renames in 
flight anyway, I think we can probably remove the code that frees the file 
data in between all the rename similarity analysis.

			Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-09-25 19:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-18  8:23 diffcore-rename performance mode Jeff King
2007-09-18  8:49 ` Junio C Hamano
2007-09-18  8:54   ` Jeff King
2007-09-18  8:58     ` Junio C Hamano
2007-09-18  9:01       ` Jeff King
2007-09-18  9:17         ` Junio C Hamano
2007-09-18 11:20     ` Johannes Schindelin
2007-09-25 16:38     ` Jeff King
2007-09-25 19:06       ` Jeff King
2007-09-25 19:10         ` Andreas Ericsson
2007-09-25 19:32         ` David Kastrup
2007-09-25 19:52           ` Jeff King
2007-09-18 22:12 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).