From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: [WIP PATCH] Manual rename correction
Date: Tue, 31 Jul 2012 20:42:38 -0400 [thread overview]
Message-ID: <20120801004238.GA15428@sigill.intra.peff.net> (raw)
In-Reply-To: <7vfw87isx1.fsf@alter.siamese.dyndns.org>
On Tue, Jul 31, 2012 at 01:20:42PM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > A much better hint is to annotate pairs of sha1s, to say "do not bother
> > doing inexact rename correlation on this pair; I promise that they have
> > value N".
>
> Surely. And I suspect that the patch to the current codebase to do
> so would be much less impact if you go that way.
Yes. You may remember I wrote a generic caching subsystem last summer
when we were talking about caching commit generations. Plugging in a new
map type to map sha1 pairs to 32-bit integers was pretty simple, and
that gives the basis for a rename cache.
It's fairly unimpressive on git.git. My best-of-five for "git log
--format=%H --raw -M" went from 5.83s to 5.74s, which is pretty much
within the run-to-run noise. The resulting cache was 155K.
However, it's easy to come up with much more pathological cases. I have
a really ugly rename-and-tweak-tags commit on my photo repository, and
those blobs are relatively big. My timings for "git show" on that were:
before: 49.724s
after, run 1: 54.904s
after, run 2: 0.117s
Which is pretty darn nice. The resulting cache is 5.3M (the repository
itself is in the gigabytes, but that's not really relevant; the cache
will obviously scale with the number of paths, not with the size of the
blobs).
It would also work for copies, too, of course. Here are the results of
"git log --format=%H --raw -M -C -C" on git.git:
before: 1m35s
after, run 1: 39.7s
after, run 2: 39.5s
So it does make much more of a difference for copies, which is obvious;
git is doing a lot more work for us to cache. At the same time, our
cache is much bigger: 32M. Yikes.
My cache is fairly naive, in that it literally stores 44 bytes of
<src_sha1, dst_sha1, score> for each entry. At the cost of more
complexity, you could store each src_sha1 once, followed by a set of
<dst_sha1, score> pairs. I also didn't take any special care to avoid
duplicates of <X, Y> and <Y, X> (since presumably these renames would be
commutative). I'm not sure it is necessary, though; I think the copy
machinery already suppresses this when entries are in both source and
destination lists.
So I don't know. It can definitely speed up some operations, but at the
cost of a non-trivial cache on disk. I'll spare you all of the generic
caching infrastructure, but the actual patch to rename looks like this
(just to give a sense of where the hooks go):
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 216a7a4..db70878 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -6,6 +6,7 @@
#include "diffcore.h"
#include "hash.h"
#include "progress.h"
+#include "metadata-cache.h"
/* Table of rename/copy destinations */
@@ -137,7 +138,8 @@ static int estimate_similarity(struct diff_filespec *src,
*/
unsigned long max_size, delta_size, base_size, src_copied, literal_added;
unsigned long delta_limit;
- int score;
+ uint32_t score;
+ struct sha1pair pair;
/* We deal only with regular files. Symlink renames are handled
* only when they are exact matches --- in other words, no edits
@@ -175,6 +177,11 @@ static int estimate_similarity(struct diff_filespec *src,
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
return 0;
+ hashcpy(pair.one, src->sha1);
+ hashcpy(pair.two, dst->sha1);
+ if (rename_cache_get(&pair, &score))
+ return score;
+
if (!src->cnt_data && diff_populate_filespec(src, 0))
return 0;
if (!dst->cnt_data && diff_populate_filespec(dst, 0))
@@ -195,6 +202,7 @@ static int estimate_similarity(struct diff_filespec *src,
score = 0; /* should not happen */
else
score = (int)(src_copied * MAX_SCORE / max_size);
+ rename_cache_set(&pair, &score);
return score;
}
-Peff
next prev parent reply other threads:[~2012-08-01 0:42 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-31 14:15 [WIP PATCH] Manual rename correction Nguyen Thai Ngoc Duy
2012-07-31 16:32 ` Junio C Hamano
2012-07-31 19:23 ` Jeff King
2012-07-31 20:20 ` Junio C Hamano
2012-08-01 0:42 ` Jeff King [this message]
2012-08-01 6:01 ` Junio C Hamano
2012-08-01 21:54 ` Jeff King
2012-08-01 22:10 ` Junio C Hamano
2012-08-02 22:37 ` Jeff King
2012-08-02 22:51 ` Junio C Hamano
2012-08-02 22:58 ` Jeff King
2012-08-02 5:33 ` Junio C Hamano
2012-08-01 1:10 ` Nguyen Thai Ngoc Duy
2012-08-01 2:01 ` Jeff King
2012-08-01 4:36 ` Nguyen Thai Ngoc Duy
2012-08-01 6:09 ` Junio C Hamano
2012-08-01 6:34 ` Nguyen Thai Ngoc Duy
2012-08-01 21:32 ` Jeff King
2012-08-01 21:27 ` Jeff King
2012-08-02 12:08 ` Nguyen Thai Ngoc Duy
2012-08-02 22:41 ` Jeff King
2012-08-04 17:09 ` [PATCH 0/8] caching rename results Jeff King
2012-08-04 17:10 ` [PATCH 1/8] implement generic key/value map Jeff King
2012-08-04 22:58 ` Junio C Hamano
2012-08-06 20:35 ` Jeff King
2012-08-04 17:10 ` [PATCH 2/8] map: add helper functions for objects as keys Jeff King
2012-08-04 17:11 ` [PATCH 3/8] fast-export: use object to uint32 map instead of "decorate" Jeff King
2012-08-04 17:11 ` [PATCH 4/8] decorate: use "map" for the underlying implementation Jeff King
2012-08-04 17:11 ` [PATCH 5/8] map: implement persistent maps Jeff King
2012-08-04 17:11 ` [PATCH 6/8] implement metadata cache subsystem Jeff King
2012-08-04 22:49 ` Junio C Hamano
2012-08-06 20:31 ` Jeff King
2012-08-06 20:38 ` Jeff King
2012-08-04 17:12 ` [PATCH 7/8] implement rename cache Jeff King
2012-08-04 17:14 ` [PATCH 8/8] diff: optionally use " Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120801004238.GA15428@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pclouds@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).