[PATCH 0/2] diffcore-break optimizations

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] diffcore-break optimizations
@ 2009-11-16 15:53 Jeff King
  2009-11-16 15:56 ` [PATCH 1/2] diffcore-break: free filespec data as we go Jeff King
  2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff King @ 2009-11-16 15:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On one of my more ridiculously gigantic repositories, I recently tried
to make a commit that ran git out of memory while trying to commit. The
repository has about 3 gigabytes of data, and I made a small-ish change
to every file. Pathological, yes, but I think we can do better than
chugging for 5 minutes and dying.

The culprit turned out to be memory usage in diffcore-break, which is on
by default for "git status" (and for the "git commit" template message).
It wants to have every changed blob in memory at once, which is just
silly.

The patches are:

  [1/2]: diffcore-break: free filespec data as we go

  This addresses the memory consumption issue. If you have enough
  memory, it doesn't actually yield a speed improvement, but nor does it
  show any slowdown for practical workloads.

  There is a theoretical slowdown when doing -B -M, because the rename
  phase has to re-fetch the blobs from the object store. However, I
  wasn't able to measure any slowdown for real-world cases (like "git
  log --summary -M -B >/dev/null" on git.git).

  I did manage to produce the slowdown on a pathological case: ten
  20-megabyte files, each copied with a slight modification to another
  file, and then replaced with totally different contents (so each one
  will be broken and then trigger an inexact rename). That diff went
  from 16s to 17s.

  But I improved that and more with the next optimization.

  [2/2]: diffcore-break: save cnt_data for other phases

  We already do this in rename detection, and since they use the same
  data format, there is little reason not to do so. My pathological case
  above went from 17s down to 12s. I wasn't able to detect any speedup
  or slowdown for sane cases.

  So I doubt anybody will even notice this, but I think since we can
  address pathological cases, we might as well (and as you will see, the
  code change is quite small).

All of that being said, I was able to do my commit, but I still had to
wait five minutes for it to chug through 3G of data. :) I am tempted to
add a "quick" mode to git-commit, but perhaps such a ridiculous case is
rare enough not to worry about. I worked around it by writing my commit
message separately and using "git commit -F".

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] diffcore-break: free filespec data as we go
  2009-11-16 15:53 [PATCH 0/2] diffcore-break optimizations Jeff King
@ 2009-11-16 15:56 ` Jeff King
  2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2009-11-16 15:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

As we look at each changed file and consider breaking it, we
load the blob data and make a decision about whether to
break, which is independent of any other blobs that might
have changed. However, we keep the data in memory while we
consider breaking all of the other files. Which means that
both versions of every file you are diffing are in memory at
the same time.

This patch instead frees the blob data as we finish with
each file pair, leading to much lower memory usage.

Signed-off-by: Jeff King <peff@peff.net>
---
As I noted in the cover letter, this can actually slow things down
slightly for some pathological cases, but I think the reduced memory
consumption is worth it. I couldn't come up with a real-world case where
it made any difference to the speed.

One other thing where _thought_ I might cause a slowdown was in fetching
the blobs for doing patch generation. But it turns out we drop the blobs
already after the diffcore merge, so they weren't living that long
anyway.

 diffcore-break.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/diffcore-break.c b/diffcore-break.c
index d7097bb..15562e4 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -204,12 +204,16 @@ void diffcore_break(int break_score)
 				dp->score = score;
 				dp->broken_pair = 1;

+				diff_free_filespec_data(p->one);
+				diff_free_filespec_data(p->two);
 				free(p); /* not diff_free_filepair(), we are
 					  * reusing one and two here.
 					  */
 				continue;
 			}
 		}
+		diff_free_filespec_data(p->one);
+		diff_free_filespec_data(p->two);
 		diff_q(&outq, p);
 	}
 	free(q->queue);
-- 
1.6.5.2.187.g29317

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] diffcore-break: save cnt_data for other phases
  2009-11-16 15:53 [PATCH 0/2] diffcore-break optimizations Jeff King
  2009-11-16 15:56 ` [PATCH 1/2] diffcore-break: free filespec data as we go Jeff King
@ 2009-11-16 16:02 ` Jeff King
  2009-11-16 21:20   ` Junio C Hamano
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2009-11-16 16:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

The "break" phase works by counting changes between two
blobs with the same path. We do this by splitting the file
into chunks (or lines for text oriented files) and then
keeping a count of chunk hashes.

The "rename" phase counts changes between blobs at two
different paths. However, it uses the exact same set of
chunk hashes (which are immutable for a given sha1).

The rename phase can therefore use the same hash data as
break. Unfortunately, we were throwing this data away after
computing it in the break phase. This patch instead attaches
it to the filespec and lets it live through the rename
phase, working under the assumption that most of the time
that breaks are being computed, renames will be too.

We only do this optimization for files which have actually
been broken, as those ones will be candidates for rename
detection (and it is a time-space tradeoff, so we don't want
to waste space keeping useless data).

Signed-off-by: Jeff King <peff@peff.net>
---
Note the two assumptions above:

  1. If we are running break, we will probably run rename detection.

  2. We are not looking for inexact copies, which could re-use data even
     for non-broken pairs.

We could test both of those assumptions by peeking at the other
diff_options that have been set up, but that would involve some code
restructuring. I'm also not sure it is worth it.

For (1), if we don't do rename detection, the next thing we will do is
the diffcore merge, which will then free the data anyway. So we are
really just carrying around the extra cnt_data through the break, and it
is much smaller than the actual blob data.

For (2), keep in mind that this is just a heuristic. We may not have any
good rename candidates anyway, or we may find an exact rename. So it is
just a guess as to which data might be useful. Keeping all the data for
all pairs, broken or not, may be pushing us too far along the time-space
tradeoff.

 diffcore-break.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/diffcore-break.c b/diffcore-break.c
index 15562e4..3a7b60a 100644
--- a/diffcore-break.c
+++ b/diffcore-break.c
@@ -69,7 +69,7 @@ static int should_break(struct diff_filespec *src,
 		return 0; /* we do not break too small filepair */

 	if (diffcore_count_changes(src, dst,
-				   NULL, NULL,
+				   &src->cnt_data, &dst->cnt_data,
 				   0,
 				   &src_copied, &literal_added))
 		return 0;
@@ -204,8 +204,8 @@ void diffcore_break(int break_score)
 				dp->score = score;
 				dp->broken_pair = 1;

-				diff_free_filespec_data(p->one);
-				diff_free_filespec_data(p->two);
+				diff_free_filespec_blob(p->one);
+				diff_free_filespec_blob(p->two);
 				free(p); /* not diff_free_filepair(), we are
 					  * reusing one and two here.
 					  */
-- 
1.6.5.2.187.g29317

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] diffcore-break: save cnt_data for other phases
  2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
@ 2009-11-16 21:20   ` Junio C Hamano
  2009-11-19 15:22     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2009-11-16 21:20 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> The "break" phase works by counting changes between two
> blobs with the same path. We do this by splitting the file
> into chunks (or lines for text oriented files) and then
> keeping a count of chunk hashes.
>
> The "rename" phase counts changes between blobs at two
> different paths. However, it uses the exact same set of
> chunk hashes (which are immutable for a given sha1).
>
> The rename phase can therefore use the same hash data as
> break. Unfortunately, we were throwing this data away after
> computing it in the break phase. This patch instead attaches
> it to the filespec and lets it live through the rename
> phase, working under the assumption that most of the time
> that breaks are being computed, renames will be too.

The change looks correct, but I initially got worried about your change
interacts badly with this part in estimate_similarity() around line 176:

	if (!src->cnt_data && diff_populate_filespec(src, 0))
		return 0;
	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
		return 0;

I think the reason why it (not your patch but the original code) felt a
bit brittle is because the above if() statements know too much about the
internal of d-c-c (namely, it never looks at src->data when src->cnt_data
is already available, so it is safe to leave src->data NULL).

The same logic suggests that the loop to build the matrix in
diffcore_rename() could free two->data at the end of the innermost loop.

We currently loop this way (around ll. 505-530):

	for each two (i.e. rename destination candidate):
        	for each one (i.e. rename source candidate):
                	compute and record how similar one and two are
			free one->data
		free two->data

After computing how similar "one" and something else is only once, we have
one->cnt_data so we won't need one->data (the same fact is exploited by
your patch for optimization), and that is why freeing one->data in the
innermost loop does not result in constant re-reading of the same blob
data when we iterate more than one rename destination.  But the same logic
applies to "two" and we should be able to free the blob data early to
reduce the memory pressure.

I dunno if it would give us measurable performance benefit, though.

Here is the idea on top of your patch, but I think it can be applied
independently.

 diffcore-rename.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 63ac998..875ca81 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -523,10 +523,13 @@ void diffcore_rename(struct diff_options *options)
 			this_src.dst = i;
 			this_src.src = j;
 			record_if_better(m, &this_src);
+			/*
+			 * Once we run estimate_similarity, 
+			 * We do not need the text anymore.
+			 */
 			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
 		}
-		/* We do not need the text anymore */
-		diff_free_filespec_blob(two);
 		dst_cnt++;
 	}

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] diffcore-break: save cnt_data for other phases
  2009-11-16 21:20   ` Junio C Hamano
@ 2009-11-19 15:22     ` Jeff King
  2009-11-20  7:32       ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2009-11-19 15:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Nov 16, 2009 at 01:20:00PM -0800, Junio C Hamano wrote:

> The change looks correct, but I initially got worried about your change
> interacts badly with this part in estimate_similarity() around line 176:
> 
> 	if (!src->cnt_data && diff_populate_filespec(src, 0))
> 		return 0;
> 	if (!dst->cnt_data && diff_populate_filespec(dst, 0))
> 		return 0;
> 
> I think the reason why it (not your patch but the original code) felt a
> bit brittle is because the above if() statements know too much about the
> internal of d-c-c (namely, it never looks at src->data when src->cnt_data
> is already available, so it is safe to leave src->data NULL).

No, I am counting on those lines to actually optimize out the count
later on. But I agree it is a bit brittle. Probably a better design
would have been for diffcore_count_changes to simply call a new
"diff_filespec_hash", which would assign to the cnt_data member and
perform any optimizations, just as we do already with
diff_populate_filespec and the actual blob contents.

I don't know if it is worth refactoring though. This is a pretty static
part of the code these days, and what is there now is working.

> The same logic suggests that the loop to build the matrix in
> diffcore_rename() could free two->data at the end of the innermost loop.
> 
> We currently loop this way (around ll. 505-530):
> 
> 	for each two (i.e. rename destination candidate):
>         	for each one (i.e. rename source candidate):
>                 	compute and record how similar one and two are
> 			free one->data
> 		free two->data
> 
> After computing how similar "one" and something else is only once, we have
> one->cnt_data so we won't need one->data (the same fact is exploited by
> your patch for optimization), and that is why freeing one->data in the
> innermost loop does not result in constant re-reading of the same blob
> data when we iterate more than one rename destination.  But the same logic
> applies to "two" and we should be able to free the blob data early to
> reduce the memory pressure.
> 
> I dunno if it would give us measurable performance benefit, though.
> 
> Here is the idea on top of your patch, but I think it can be applied
> independently.

With your patch, I shaved 0.2 seconds off of a 46-second rename
detection (one of my pathological "rename and tweak the tags on a bunch
of photos" commits). To be fair, though, that commit is now quite old,
so some of that CPU time was spent assembling delta chains.

It should also relieve memory pressure for large repositories, though
that is not nearly as big an issue for renames as for breaks, as we tend
to have many fewer candidates.

I think it is worth applying.  Though perhaps the free-ing should simply
happen in estimate_similarity, which is the function that actually
populates the filespec.

Also, I don't see where we ever actually free the cnt_data. After we run
the whole O(n^2) loop, we should be able to drop cnt_data entirely. I
think in practice it doesn't matter all that much, since it isn't that
big, and afterwards we just generate the patch and exit (unless we are
doing diffcore-break, too, which will actually free all of the data).

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] diffcore-break: save cnt_data for other phases
  2009-11-19 15:22     ` Jeff King
@ 2009-11-20  7:32       ` Junio C Hamano
  0 siblings, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2009-11-20  7:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> I don't know if it is worth refactoring though.

I would rather not touch it.

> Also, I don't see where we ever actually free the cnt_data. After we run
> the whole O(n^2) loop, we should be able to drop cnt_data entirely. I
> think in practice it doesn't matter all that much, since it isn't that
> big, and afterwards we just generate the patch and exit (unless we are
> doing diffcore-break, too, which will actually free all of the data).

Originally, each phase of the diffcore pipeline was designed to be
independent from other phases, and keeping things in core was done as an
optimization for smallish trees so that later phases in the diffcore
pipeline can reuse prepopulated data.  If this were year 2006, I would
have said that keeping it in core would be better because we could add new
phases after it later, even though there is nobody who uses cnt_data after
diffcore-rename in the diffcore pipeline right now,

But it seems that our performance is more than adequate without such
keeping-things-around for smallish trees anyway, and it is wiser to
optimize for heavier workload.  More importantly, nobody seems to be
planning new phases in the pipeline, so I agree it is a good idea to drop
cnt_data once the rename detection phase is done with it.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-11-20  7:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-16 15:53 [PATCH 0/2] diffcore-break optimizations Jeff King
2009-11-16 15:56 ` [PATCH 1/2] diffcore-break: free filespec data as we go Jeff King
2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
2009-11-16 21:20   ` Junio C Hamano
2009-11-19 15:22     ` Jeff King
2009-11-20  7:32       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).