From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Johannes Sixt <J.Sixt@eudaptics.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] diffcore-rename: favour identical basenames
Date: Fri, 22 Jun 2007 11:39:46 +0100 (BST) [thread overview]
Message-ID: <Pine.LNX.4.64.0706221122200.4059@racer.site> (raw)
In-Reply-To: <467B777D.C47BFE0E@eudaptics.com>
Hi,
On Fri, 22 Jun 2007, Johannes Sixt wrote:
> Johannes Schindelin wrote:
> > The dangerous thing is that the score can get negative now.
> > ...
> > + score = (int)(src_copied * MAX_SCORE / max_size)
> > + - levenshtein(src->path, dst->path);
>
> Does that also mean that you can't ever have a rename with a score of
> 100%?
>
> (I haven't studied the algorithms and assume that levenshtein(a,b) == 0
> only if a==b, and that without the -levenshtein(...) the score can grow
> to 100%.)
There is a different code path for identical contents. So yes, you can
still hit 100%, but it is now much, much harder to hit a score close to
100% [*1*].
The obviously correct way to do this is to have a subscore, and use it
_strictly_ only when the score is identical.
I see two ways to do this properly:
- introduce a name_distance struct member, just below the score. This
means that estimate_similarity has to "return" two values instead of
one, and score_compare gets a bit more complex, too. Or
- change the score to unsigned long, and shift the score to higher bits,
adding a constant minus the Levenshtein distance. It is safe to assume
that the filenames are shorter than 16384 bytes (PATH_MAX is actually
much smaller than that), and even if two filenames of that length are
completely different, the distance can not be larger than twice that
number, i.e. 16384 deletions + 16384 insertions. Therefore, you could
pick 32768 as that constant.
However, I find both solutions ugly. Besides, I am not interested in the
feature myself, only the implementation of Levenshtein was interesting,
and I thought I just post the code here. So I did only the minimal stuff
on top of the interesting one to make it sort of work.
If somebody wants to pick up the ball, be my guest, because I am out of
that game.
Ciao,
Dscho
Footnote:
*1* Actually, it is not _that_ bad. The score is not a value between 0 and
100, IOW it is _not_ what you see in the output of "diff -M". It is an
unsigned short between 0 and MAX_SCORE, which is defined in
diffcore.h as 60000.0.
The Levenshtein distance between two filenames cannot be larger than
the sum of their lengths, so it should be relatively safe. That is, if
you don't have such insanely long paths as e.g. egit. But even there,
the paths share most of their directories, and therefore the distances
should be much, much smaller in real life.
next prev parent reply other threads:[~2007-06-22 10:40 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-21 3:06 Basename matching during rename/copy detection Shawn O. Pearce
2007-06-21 3:13 ` Junio C Hamano
2007-06-21 8:00 ` Andy Parkins
2007-06-21 8:07 ` Junio C Hamano
2007-06-21 9:50 ` Andy Parkins
2007-06-21 11:52 ` Johannes Schindelin
2007-06-21 12:44 ` Andy Parkins
2007-06-21 12:53 ` Matthieu Moy
2007-06-21 13:10 ` Jeff King
2007-06-21 13:18 ` Johannes Schindelin
2007-06-21 13:25 ` Matthieu Moy
2007-06-21 13:52 ` Johannes Schindelin
2007-06-21 15:37 ` Steven Grimm
2007-06-21 15:53 ` Johannes Schindelin
2007-06-21 16:57 ` Steven Grimm
2007-06-21 13:22 ` Johannes Schindelin
2007-06-21 3:42 ` Linus Torvalds
2007-06-21 11:52 ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
2007-06-21 13:19 ` Jeff King
2007-06-21 14:03 ` Johannes Schindelin
2007-06-21 16:20 ` Linus Torvalds
2007-06-21 17:52 ` Junio C Hamano
2007-06-21 18:24 ` Linus Torvalds
2007-06-22 15:19 ` Andy Parkins
2007-06-22 15:28 ` Johannes Schindelin
2007-06-22 17:51 ` Aidan Van Dyk
2007-06-22 1:14 ` Johannes Schindelin
2007-06-22 5:41 ` Jeff King
2007-06-22 10:22 ` Johannes Schindelin
2007-06-22 7:17 ` Johannes Sixt
2007-06-22 10:39 ` Johannes Schindelin [this message]
2007-06-22 10:52 ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
2007-06-22 12:49 ` Johannes Schindelin
[not found] ` <86abusi1fw.fsf@lola.quinscape.zz>
2007-06-23 1:31 ` 100% Johannes Schindelin
2007-06-23 10:18 ` 100% René Scharfe
2007-06-23 10:56 ` 100% Johannes Schindelin
2007-06-23 11:41 ` 100% René Scharfe
2007-06-23 12:00 ` 100% Johannes Schindelin
2007-06-23 12:11 ` 100% René Scharfe
2007-06-23 12:21 ` 100% Johannes Schindelin
2007-06-24 22:23 ` 100% René Scharfe
2007-06-23 19:33 ` 100% Junio C Hamano
2007-06-23 20:41 ` 100% Johannes Schindelin
2007-06-23 5:44 ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0706221122200.4059@racer.site \
--to=johannes.schindelin@gmx.de \
--cc=J.Sixt@eudaptics.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).