From: Jeff King <peff@peff.net>
To: John Keeping <john@keeping.me.uk>
Cc: git@vger.kernel.org, Kevin Bracey <kevin@bracey.fi>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [RFC/PATCH] patch-ids: check modified paths before calculating diff
Date: Wed, 29 May 2013 02:20:07 -0400 [thread overview]
Message-ID: <20130529062007.GA11955@sigill.intra.peff.net> (raw)
In-Reply-To: <a7088a74742b71a01423f3ddc1a6c7fd89474ed8.1368969438.git.john@keeping.me.uk>
On Sun, May 19, 2013 at 02:17:35PM +0100, John Keeping wrote:
> When using "git cherry" or "git log --cherry-pick" we often have a small
> number of commits on one side and a large number on the other. In
> revision.c::cherry_pick_list we store the patch IDs for the small side
> before comparing the large side to this.
>
> In this case, it is likely that only a small number of paths are touched
> by the commits on the smaller side of the range and by storing these we
> can ignore many commits on the other side before unpacking blobs and
> diffing them.
There are two observations that make your scheme work:
1. For something like --cherry-pick, we do not even care about the
actual patch-ids, only whether there are matches from the left and
right sides.
2. Comparing the sets of paths touched by two commits is often
sufficient to realize they do not have the same patch-ids.
But I think those observations mean we can do even better than
calculating the real patch-id for the small side at all.
Imagine we have both "loose" and "strict" patch-ids, where the loose one
is much faster to compute, and has that property that a match _might_
mean we have the same strict patch-id, but a non-match means we
definitely do not have the same strict patch-id.
I think such a loose patch-id could just be a hash of the filenames that
were changed by the patch (e.g., the first 32-bits of the sha1 of the
concatenated filenames). Computing that should be about as expensive as
a tree-diff. Per observation 2 above, if two commits do not have the
same loose id, we know that they cannot possibly have the same strict
id.
Then we can forget about the smaller-side and bigger-side entirely, and
just do something like:
1. Make a sorted list (or hash table) of loose ids for one side.
2. For each commit on the other side, calculate its loose id and look
that up in the sorted list. If no hits, we know that there is no
match. For any hits, lazily calculate (and cache) the strict patch
id for both sides and compare as usual.
In the best case, we compute no patch-ids at all. And even for the
average case, I'd expect our lazy calculation to only have to compute a
handful of ids.
-Peff
next prev parent reply other threads:[~2013-05-29 6:20 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-19 13:17 [RFC/PATCH] patch-ids: check modified paths before calculating diff John Keeping
2013-05-19 19:36 ` Jonathan Nieder
2013-05-19 22:02 ` John Keeping
2013-05-20 6:36 ` Jonathan Nieder
2013-05-20 8:16 ` John Keeping
2013-05-20 4:46 ` Junio C Hamano
2013-05-29 6:20 ` Jeff King [this message]
2013-05-29 7:22 ` Jeff King
2013-05-29 7:50 ` Jeff King
2013-05-29 18:08 ` Junio C Hamano
2013-05-29 18:36 ` Jeff King
2013-05-29 18:48 ` Junio C Hamano
2013-05-29 21:30 ` John Keeping
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130529062007.GA11955@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=john@keeping.me.uk \
--cc=kevin@bracey.fi \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).