From: Jeff King <peff@peff.net>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: git@vger.kernel.org, Andy C <andychup@gmail.com>,
Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH/RFC 0/3] faster inexact rename handling
Date: Tue, 30 Oct 2007 09:43:55 -0400 [thread overview]
Message-ID: <20071030134355.GA21342@coredump.intra.peff.net> (raw)
In-Reply-To: <alpine.LFD.0.999.0710292156580.30120@woody.linux-foundation.org>
On Mon, Oct 29, 2007 at 10:06:11PM -0700, Linus Torvalds wrote:
> Have you compared the results? IOW, does it find the *same* renames?
>From my limited testing, it generally finds the same pairs. However,
there are a number of renames that it _doesn't_ find, because they are
composed of "uninteresting" lines, dropping them below the minimum
score. Try (in git.git):
git-show --raw -M -l0 :/'Big tool rename'
with the old and new code. Pairs like Documentation/git-add-script.txt
-> Documentation/git-add.txt are not found, because the file is composed
almost entirely of boilerplate.
Moving the size normalization into the similarity engine should probably
fix that, and will let us compare old and new results more accurately.
I'll try to work on that.
> I'm a bit worried about the fact that you just pick a single (arbitrary)
> src/dst per fingerprint. Yes, it should be limited, but that seems to be a
> bit too *extremely* limited. But if it gives the same results in practice,
> maybe nobody cares?
Yes, I have not convinced myself yet that it's the right approach (but
it seemed like a good place to try first, for simplicity and speed). As
I noted, this approach seems to be a bit memory hungry on large, so I am
a bit concerned about increasing the size of the fingerprint_entry
structure. However, Andy's sampling approach might help fix that.
The current code also doesn't bother marking overflow, so common lines
get attributes to some random file (actually, worse than random: if a
bunch of files have the same common lines, _all_ of the lines will go to
the last file, which means we subtly favor renames from the end of the
input list). So probably it should be tested as-is, with an "overflow,
this line is too common to be interesting" bit, and with a small-ish
limit (I had at one point tried 5, but the implementation was naive and
too memory-hungry).
-Peff
next prev parent reply other threads:[~2007-10-30 13:44 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-10-30 4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
2007-10-30 4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
2007-10-30 4:24 ` [PATCH/RFC 2/3] introduce generic similarity library Jeff King
2007-10-30 4:24 ` [PATCH/RFC 3/3] handle renames using similarity engine Jeff King
2007-10-30 5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
2007-10-30 8:29 ` Junio C Hamano
2007-10-30 13:46 ` Jeff King
2007-10-30 13:43 ` Jeff King [this message]
2007-10-30 15:38 ` Linus Torvalds
2007-10-30 20:20 ` Jeff King
2007-10-30 20:39 ` Linus Torvalds
2007-10-31 0:27 ` Andy C
2007-10-31 0:06 ` Andy C
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071030134355.GA21342@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=andychup@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).