* Basename matching during rename/copy detection
@ 2007-06-21 3:06 Shawn O. Pearce
2007-06-21 3:13 ` Junio C Hamano
2007-06-21 3:42 ` Linus Torvalds
0 siblings, 2 replies; 44+ messages in thread
From: Shawn O. Pearce @ 2007-06-21 3:06 UTC (permalink / raw)
To: git; +Cc: govindsalinas
So Govind Salinas has found an interesting case in the rename
detection code:
$ git clone git://repo.or.cz/Widgit.git
$ git diff -M --raw -r 192e^ 192e | grep .resx
:100755 000000 4c8ab79... 0000000... D Form1.resx
:100755 100755 9e70146... 9e70146... R100 CommitViewer.resx UI/CommitViewer.resx
:100755 100755 90929fd... b40ff98... C091 RepoManager.resx UI/Form1.resx
:100755 100755 90929fd... 90929fd... C100 PreferencesEditor.resx UI/PreferencesEditor.resx
:100755 100755 90929fd... 90929fd... R100 PreferencesEditor.resx UI/RepoManager.resx
:100755 100755 90929fd... 8535007... R097 RepoManager.resx UI/RepoTreeView.resx
In this case several files had identical old images, and some
kept that old image during the rename. Unfortunately because of
the ordering of the files in the tree Git has decided to "rename"
the PreferencesEditor.resx file to UI/RepoManager.resx, rather than
renaming RepoManager.resx to UI/RepoManager.resx. Go Git.
I'm wondering if we shouldn't play the game of trying to match
delete/add pairs up by not only similarity, but also by path
basename. In the case above its exactly what Govind thought should
happen; he moved the file from one directory to another, and didn't
even change its content during the move. But Git decided "better"
to use a totally different file in the "rename".
--
Shawn.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 3:06 Basename matching during rename/copy detection Shawn O. Pearce
@ 2007-06-21 3:13 ` Junio C Hamano
2007-06-21 8:00 ` Andy Parkins
2007-06-21 3:42 ` Linus Torvalds
1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-21 3:13 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git, govindsalinas
"Shawn O. Pearce" <spearce@spearce.org> writes:
> So Govind Salinas has found an interesting case in the rename
> detection code:
>
> $ git clone git://repo.or.cz/Widgit.git
> $ git diff -M --raw -r 192e^ 192e | grep .resx
> :100755 000000 4c8ab79... 0000000... D Form1.resx
> :100755 100755 9e70146... 9e70146... R100 CommitViewer.resx UI/CommitViewer.resx
> :100755 100755 90929fd... b40ff98... C091 RepoManager.resx UI/Form1.resx
> :100755 100755 90929fd... 90929fd... C100 PreferencesEditor.resx UI/PreferencesEditor.resx
> :100755 100755 90929fd... 90929fd... R100 PreferencesEditor.resx UI/RepoManager.resx
> :100755 100755 90929fd... 8535007... R097 RepoManager.resx UI/RepoTreeView.resx
>
> In this case several files had identical old images, and some
> kept that old image during the rename. Unfortunately because of
> the ordering of the files in the tree Git has decided to "rename"
> the PreferencesEditor.resx file to UI/RepoManager.resx, rather than
> renaming RepoManager.resx to UI/RepoManager.resx. Go Git.
>
> I'm wondering if we shouldn't play the game of trying to match
> delete/add pairs up by not only similarity, but also by path
> basename. In the case above its exactly what Govind thought should
> happen; he moved the file from one directory to another, and didn't
> even change its content during the move. But Git decided "better"
> to use a totally different file in the "rename".
Actually, git did not decide anything, and certainly not better.
Having many "identical files" in the preimage is just stupid to
begin with (if you know they are identical, why are you storing
copies, instead of your build procedure to reuse the same file),
so the algorithm did not bother finding a better match among
"equals".
I am not opposed to a patch that says "Ok, these two preimages
have identical similarity score, *AND* indeed the preimages have
the same contents --- we tiebreak them with other heuristics to
help stupid projects better". And I can see basename similarity
one of the useful heuristics you could use.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 3:06 Basename matching during rename/copy detection Shawn O. Pearce
2007-06-21 3:13 ` Junio C Hamano
@ 2007-06-21 3:42 ` Linus Torvalds
2007-06-21 11:52 ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
1 sibling, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2007-06-21 3:42 UTC (permalink / raw)
To: Shawn O. Pearce; +Cc: git, govindsalinas
On Wed, 20 Jun 2007, Shawn O. Pearce wrote:
>
> I'm wondering if we shouldn't play the game of trying to match
> delete/add pairs up by not only similarity, but also by path
> basename.
I think we should just consider the basename as an "added
similarity bonus".
IOW, we currently sort purely by data similarity, but how about just
adding a small increment for "same base name".
We could make it actually use the similarity of the filename itself as the
basis for the increment, which would be even better, but the trivial thing
is to do something like
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -186,8 +186,11 @@ static int estimate_similarity(struct diff_filespec *src,
*/
if (!dst->size)
score = 0; /* should not happen */
- else
+ else {
score = (int)(src_copied * MAX_SCORE / max_size);
+ if (basename_same(src, dst))
+ score++;
+ }
return score;
}
and just implement that "basename_same()" function.
Or something.
I do agree that the filename logically can and probably _should_ count
towards the "similarity". The filename _is_ part of the data in the global
notion of "content", after all. It's the "index" to the data.
Linus
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 3:13 ` Junio C Hamano
@ 2007-06-21 8:00 ` Andy Parkins
2007-06-21 8:07 ` Junio C Hamano
0 siblings, 1 reply; 44+ messages in thread
From: Andy Parkins @ 2007-06-21 8:00 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano, Shawn O. Pearce, govindsalinas
On Thursday 2007 June 21, Junio C Hamano wrote:
> Having many "identical files" in the preimage is just stupid to
> begin with (if you know they are identical, why are you storing
> copies, instead of your build procedure to reuse the same file),
> so the algorithm did not bother finding a better match among
> "equals".
That's a really poor argument; it's not git's place to impose restrictions on
what is stored in it.
What if it's not a build environment at all but a home directory that's being
stored - should no one be allowed to store copies of files because
it's "stupid"? What if it's a collection of images that all started out the
same, but have gradually had detail added (which is actually what I do in my
GUI programs for toolbar images)? What about files that are used as flags,
and are all identically empty.
None of those seems like an abuse of a VCS to me. In fact, I'd say it's one
of git's strengths that a duplicate file in the working tree doesn't take up
any extra space in the repository.
Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 8:00 ` Andy Parkins
@ 2007-06-21 8:07 ` Junio C Hamano
2007-06-21 9:50 ` Andy Parkins
0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-21 8:07 UTC (permalink / raw)
To: Andy Parkins; +Cc: git, Shawn O. Pearce, govindsalinas
Andy Parkins <andyparkins@gmail.com> writes:
> On Thursday 2007 June 21, Junio C Hamano wrote:
>
>> Having many "identical files" in the preimage is just stupid to
>> begin with (if you know they are identical, why are you storing
>> copies, instead of your build procedure to reuse the same file),
>> so the algorithm did not bother finding a better match among
>> "equals".
>
> That's a really poor argument; it's not git's place to impose restrictions on
> what is stored in it.
It's not even an argument, nor an attempt to justify it. It was
just an explanation of historical fact "It did not bother".
Please re-read the final part of the message, which you omitted
from your quote.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 8:07 ` Junio C Hamano
@ 2007-06-21 9:50 ` Andy Parkins
2007-06-21 11:52 ` Johannes Schindelin
0 siblings, 1 reply; 44+ messages in thread
From: Andy Parkins @ 2007-06-21 9:50 UTC (permalink / raw)
To: git; +Cc: Junio C Hamano
On Thursday 2007 June 21, Junio C Hamano wrote:
> It's not even an argument, nor an attempt to justify it. It was
> just an explanation of historical fact "It did not bother".
> Please re-read the final part of the message, which you omitted
> from your quote.
I omitted it because I didn't object to that part :-)
I appreciated (as always) your practicality in that what you proposed would
let people keep their copies. What I was objecting to was the idea that any
repository with duplicate files was "stupid".
Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH] diffcore-rename: favour identical basenames
2007-06-21 3:42 ` Linus Torvalds
@ 2007-06-21 11:52 ` Johannes Schindelin
2007-06-21 13:19 ` Jeff King
2007-06-23 5:44 ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano
0 siblings, 2 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 11:52 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Shawn O. Pearce, git, govindsalinas, gitster
When there are several candidates for a rename source, and one of them
has an identical basename to the rename target, take that one.
Noticed by Govind Salinas, posted by Shawn O. Pearce, partial patch
by Linus Torvalds.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
On Wed, 20 Jun 2007, Linus Torvalds wrote:
> I think we should just consider the basename as an "added
> similarity bonus".
>
> IOW, we currently sort purely by data similarity, but how about
> just adding a small increment for "same base name".
>
> [patch suggestion snipped, since it is identical what is below]
How 'bout this?
diffcore-rename.c | 33 ++++++++++++++++++++++++++++++++-
t/t4001-diff-rename.sh | 13 +++++++++++++
2 files changed, 45 insertions(+), 1 deletions(-)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 93c40d9..79c984c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -119,6 +119,21 @@ static int is_exact_match(struct diff_filespec *src,
return 0;
}
+static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
+{
+ int src_len = strlen(src->path), dst_len = strlen(dst->path);
+ while (src_len && dst_len) {
+ char c1 = src->path[--src_len];
+ char c2 = dst->path[--dst_len];
+ if (c1 != c2)
+ return 0;
+ if (c1 == '/')
+ return 1;
+ }
+ return (!src_len || src->path[src_len - 1] == '/') &&
+ (!dst_len || dst->path[dst_len - 1] == '/');
+}
+
struct diff_score {
int src; /* index in rename_src */
int dst; /* index in rename_dst */
@@ -186,8 +201,11 @@ static int estimate_similarity(struct diff_filespec *src,
*/
if (!dst->size)
score = 0; /* should not happen */
- else
+ else {
score = (int)(src_copied * MAX_SCORE / max_size);
+ if (basename_same(src, dst))
+ score++;
+ }
return score;
}
@@ -295,9 +313,22 @@ void diffcore_rename(struct diff_options *options)
if (rename_dst[i].pair)
continue; /* dealt with an earlier round */
for (j = 0; j < rename_src_nr; j++) {
+ int k;
struct diff_filespec *one = rename_src[j].one;
if (!is_exact_match(one, two, contents_too))
continue;
+
+ /* see if there is a basename match, too */
+ for (k = j; k < rename_src_nr; k++) {
+ one = rename_src[k].one;
+ if (basename_same(one, two) &&
+ is_exact_match(one, two,
+ contents_too)) {
+ j = k;
+ break;
+ }
+ }
+
record_rename_pair(i, j, (int)MAX_SCORE);
rename_count++;
break; /* we are done with this entry */
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 2e3c20d..90c085f 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -64,4 +64,17 @@ test_expect_success \
'validate the output.' \
'compare_diff_patch current expected'
+test_expect_success 'favour same basenames over different ones' '
+ cp path1 another-path &&
+ git add another-path &&
+ git commit -m 1 &&
+ git rm path1 &&
+ mkdir subdir &&
+ git mv another-path subdir/path1 &&
+ git runstatus | grep "renamed: .*path1 -> subdir/path1"'
+
+test_expect_success 'favour same basenames even with minor differences' '
+ git show HEAD:path1 | sed "s/15/16/" > subdir/path1 &&
+ git runstatus | grep "renamed: .*path1 -> subdir/path1"'
+
test_done
--
1.5.2.2.2822.g027a6-dirty
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 9:50 ` Andy Parkins
@ 2007-06-21 11:52 ` Johannes Schindelin
2007-06-21 12:44 ` Andy Parkins
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 11:52 UTC (permalink / raw)
To: Andy Parkins; +Cc: git, Junio C Hamano
Hi,
On Thu, 21 Jun 2007, Andy Parkins wrote:
> On Thursday 2007 June 21, Junio C Hamano wrote:
>
> > It's not even an argument, nor an attempt to justify it. It was just
> > an explanation of historical fact "It did not bother". Please re-read
> > the final part of the message, which you omitted from your quote.
>
> I appreciated (as always) your practicality in that what you proposed
> would let people keep their copies. What I was objecting to was the
> idea that any repository with duplicate files was "stupid".
FWIW I find it stupid, too.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 11:52 ` Johannes Schindelin
@ 2007-06-21 12:44 ` Andy Parkins
2007-06-21 12:53 ` Matthieu Moy
2007-06-21 13:22 ` Johannes Schindelin
0 siblings, 2 replies; 44+ messages in thread
From: Andy Parkins @ 2007-06-21 12:44 UTC (permalink / raw)
To: git; +Cc: Johannes Schindelin, Junio C Hamano
On Thursday 2007 June 21, Johannes Schindelin wrote:
> > would let people keep their copies. What I was objecting to was the
> > idea that any repository with duplicate files was "stupid".
>
> FWIW I find it stupid, too.
Thanks very much. Okay, as I've been put in the position of defending this,
let me give you the use case that has cropped up for me to do this stupid
thing.
I've got a GUI program with a load of tool buttons. Each of those buttons
will, in the final product, be unique images. When I write the program, I
want to be able to refer to each of the button images in the correct place.
e.g.
setImage( NewButton, "path/to/new-button.png" );
setImage( OpenButton, "path/to/open-button.png" );
setImage( SaveButton, "path/to/save-button.png" );
Unfortunately, I can't draw. So, I open up gimp, draw a big red X and save it
as new-button.png. Then I copy that file to open-button.png and
save-button.png, knowing that at some point in the future, someone will come
and replace those red-X images with something appropriate.
All those images now go in the repository. Symbolic links are not an option,
as it's got to be checkable out on Windows.
Tell me what part of that is stupid?
Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 12:44 ` Andy Parkins
@ 2007-06-21 12:53 ` Matthieu Moy
2007-06-21 13:10 ` Jeff King
2007-06-21 13:18 ` Johannes Schindelin
2007-06-21 13:22 ` Johannes Schindelin
1 sibling, 2 replies; 44+ messages in thread
From: Matthieu Moy @ 2007-06-21 12:53 UTC (permalink / raw)
To: Andy Parkins; +Cc: git, Johannes Schindelin, Junio C Hamano
Andy Parkins <andyparkins@gmail.com> writes:
> On Thursday 2007 June 21, Johannes Schindelin wrote:
>
>> > would let people keep their copies. What I was objecting to was the
>> > idea that any repository with duplicate files was "stupid".
>>
>> FWIW I find it stupid, too.
>
> Thanks very much. Okay, as I've been put in the position of defending this,
> let me give you the use case that has cropped up for me to do this stupid
> thing.
Well, why look so far to find an example of people having identical
files in their tree?
$ cd git
$ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | wc -l
973
$ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | sort | uniq | wc -l
964
$
--
Matthieu
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 12:53 ` Matthieu Moy
@ 2007-06-21 13:10 ` Jeff King
2007-06-21 13:18 ` Johannes Schindelin
1 sibling, 0 replies; 44+ messages in thread
From: Jeff King @ 2007-06-21 13:10 UTC (permalink / raw)
To: git; +Cc: Andy Parkins, Johannes Schindelin, Junio C Hamano
On Thu, Jun 21, 2007 at 02:53:32PM +0200, Matthieu Moy wrote:
> Well, why look so far to find an example of people having identical
> files in their tree?
>
> $ cd git
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | wc -l
> 973
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | sort | uniq | wc -l
> 964
md5? What is this, CVS? How about:
git-ls-files -s | cut -d' ' -f2 | sort | uniq -d | wc -l
Your pipeline will also list files in the working directory, which can
inflate the number of duplicates (note that git-foo.sh and git-foo will
have the same content).
-Peff
PS Please don't take this to mean I think duplicate files are stupid; I
think they can be quite useful. I just wanted to nitpick your shell
command. :)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 12:53 ` Matthieu Moy
2007-06-21 13:10 ` Jeff King
@ 2007-06-21 13:18 ` Johannes Schindelin
2007-06-21 13:25 ` Matthieu Moy
1 sibling, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 13:18 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Andy Parkins, git, Junio C Hamano
Hi,
On Thu, 21 Jun 2007, Matthieu Moy wrote:
> Well, why look so far to find an example of people having identical
> files in their tree?
>
> $ cd git
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | wc -l
> 973
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | sort | uniq | wc -l
> 964
> $
Have you checked the files? They are all some blobs in the test scripts.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 11:52 ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
@ 2007-06-21 13:19 ` Jeff King
2007-06-21 14:03 ` Johannes Schindelin
` (2 more replies)
2007-06-23 5:44 ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano
1 sibling, 3 replies; 44+ messages in thread
From: Jeff King @ 2007-06-21 13:19 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster
On Thu, Jun 21, 2007 at 12:52:11PM +0100, Johannes Schindelin wrote:
> When there are several candidates for a rename source, and one of them
> has an identical basename to the rename target, take that one.
That's a reasonable heuristic, but it unfortunately won't match simple
things like:
i386_widget.c -> arch/i386/widget.c
You really don't care about "is this a good match" as much as providing
an order to potential matches. I think something like a Levenshtein
distance between the full pathnames would give good results, and would
cover almost every situation that the basename heuristic would (there
are a few exceptions, like getting "file.c" from either "file2.c" or
"foo/file.c", but that seems kind of pathological).
Sorry to post without a patch, but I don't have time right this second.
I'll add it to the end of my (ever-growing) todo list if you think it's
a good idea and don't do it yourself. :)
-Peff
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 12:44 ` Andy Parkins
2007-06-21 12:53 ` Matthieu Moy
@ 2007-06-21 13:22 ` Johannes Schindelin
1 sibling, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 13:22 UTC (permalink / raw)
To: Andy Parkins; +Cc: git, Junio C Hamano
Hi,
On Thu, 21 Jun 2007, Andy Parkins wrote:
> I open up gimp, draw a big red X and save it as new-button.png. Then I
> copy that file to open-button.png and save-button.png, knowing that at
> some point in the future, someone will come and replace those red-X
> images with something appropriate.
So you have a couple of identical files in your repo, which are
placeholders. That is quite different from what I criticised of being
stupid.
Well, I would not have checked in the files in your place, but only one:
dumb-red-X.png
Then, my Makefile would have checked for the existence of, say,
my-wonderful-ok-button.png, and if it does not exist yet, copy
dumb-red-X.png to it.
Now, when somebody comes along, paining the prettiest ok button I ever
saw, I copy that over the copy of dumb-red-X.png, and check it in.
It has the further bonus that I know exactly which buttons I have to find
a suck^Wgifted artist for.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 13:18 ` Johannes Schindelin
@ 2007-06-21 13:25 ` Matthieu Moy
2007-06-21 13:52 ` Johannes Schindelin
0 siblings, 1 reply; 44+ messages in thread
From: Matthieu Moy @ 2007-06-21 13:25 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Andy Parkins, git, Junio C Hamano
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> Have you checked the files? They are all some blobs in the test scripts.
Yes, but how does it make any difference? You still want git to manage
them properly, don't you?
--
Matthieu
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 13:25 ` Matthieu Moy
@ 2007-06-21 13:52 ` Johannes Schindelin
2007-06-21 15:37 ` Steven Grimm
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 13:52 UTC (permalink / raw)
To: Matthieu Moy; +Cc: Andy Parkins, git, Junio C Hamano
Hi,
On Thu, 21 Jun 2007, Matthieu Moy wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > Have you checked the files? They are all some blobs in the test scripts.
>
> Yes, but how does it make any difference? You still want git to manage
> them properly, don't you?
Yes. And Git explicitely allows what I call stupid. And yes, those
_identical_ files in the test suit should probably all be folded into
single files, and the places where they are used should reference _that_
single instance.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 13:19 ` Jeff King
@ 2007-06-21 14:03 ` Johannes Schindelin
2007-06-21 16:20 ` Linus Torvalds
2007-06-22 1:14 ` Johannes Schindelin
2 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 14:03 UTC (permalink / raw)
To: Jeff King; +Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster
Hi,
On Thu, 21 Jun 2007, Jeff King wrote:
> On Thu, Jun 21, 2007 at 12:52:11PM +0100, Johannes Schindelin wrote:
>
> > When there are several candidates for a rename source, and one of them
> > has an identical basename to the rename target, take that one.
>
> That's a reasonable heuristic, but it unfortunately won't match simple
> things like:
>
> i386_widget.c -> arch/i386/widget.c
That's right. But every heuristic falls down eventually. Personally, I
think basename_same() is good enough, even if the technical challenge to
implement a small enough Levenshtein, which still respects directory
boundaries somehow (and not just throws them away).
Besides, Levenshtein would introduce a ranking, not a boolean value like
basename_same(). And that complicates the code.
All in all, I'd say Levenshtein is not worth the _result_.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 13:52 ` Johannes Schindelin
@ 2007-06-21 15:37 ` Steven Grimm
2007-06-21 15:53 ` Johannes Schindelin
0 siblings, 1 reply; 44+ messages in thread
From: Steven Grimm @ 2007-06-21 15:37 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Matthieu Moy, Andy Parkins, git, Junio C Hamano
Johannes Schindelin wrote:
> Yes. And Git explicitely allows what I call stupid. And yes, those
> _identical_ files in the test suit should probably all be folded into
> single files, and the places where they are used should reference _that_
> single instance.
>
Two files that are identical in the current revision have not
necessarily been identical from the beginning. Doing what you suggest
will cause you to lose the history of all but one of those files.
Files can absolutely become identical in the real world. I know that for
a fact because it happened to me just this week (see my "Directory
renames" message from a few days ago.) Are you seriously suggesting that
every time I unpack an update from a third party, I should go through it
and see if they have changed any files such that the contents now match
another file in my repository, and if so, I should remove all but one of
the copies from my repository and have a build system create it instead?
Then undo that work when I unpack another update and the files are no
longer identical?
Well, no, I know you're not suggesting that, but it's the logical
conclusion of the "it's stupid to ever have duplicate files" philosophy.
While that approach certainly makes life easier for the version control
system, it doesn't exactly make life easier for the *developer*, which
is kind of the whole point of why we're here.
-Steve
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 15:37 ` Steven Grimm
@ 2007-06-21 15:53 ` Johannes Schindelin
2007-06-21 16:57 ` Steven Grimm
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 15:53 UTC (permalink / raw)
To: Steven Grimm; +Cc: Matthieu Moy, Andy Parkins, git, Junio C Hamano
Hi,
On Thu, 21 Jun 2007, Steven Grimm wrote:
> Johannes Schindelin wrote:
> > Yes. And Git explicitely allows what I call stupid. And yes, those
> > _identical_ files in the test suit should probably all be folded into
> > single files, and the places where they are used should reference _that_
> > single instance.
> >
>
> Two files that are identical in the current revision have not necessarily
> been identical from the beginning. Doing what you suggest will cause you to
> lose the history of all but one of those files.
>
> Files can absolutely become identical in the real world. I know that for a
> fact because it happened to me just this week (see my "Directory renames"
> message from a few days ago.)
No, that message did not convince me. It was way too short on the side of
facts.
And no, I do not think that two unrelated files can get exactly the same
content.
Be that as may, even _if_ there were such a case, I'd still try to reuse
the same file in the working directory. Just because Git can deal
efficiently with millions of identical files does not mean that a working
directory can, or worse, human developers.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 13:19 ` Jeff King
2007-06-21 14:03 ` Johannes Schindelin
@ 2007-06-21 16:20 ` Linus Torvalds
2007-06-21 17:52 ` Junio C Hamano
2007-06-22 15:19 ` Andy Parkins
2007-06-22 1:14 ` Johannes Schindelin
2 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2007-06-21 16:20 UTC (permalink / raw)
To: Jeff King
Cc: Johannes Schindelin, Shawn O. Pearce, git, govindsalinas, gitster
On Thu, 21 Jun 2007, Jeff King wrote:
> On Thu, Jun 21, 2007 at 12:52:11PM +0100, Johannes Schindelin wrote:
>
> > When there are several candidates for a rename source, and one of them
> > has an identical basename to the rename target, take that one.
>
> That's a reasonable heuristic, but it unfortunately won't match simple
> things like:
>
> i386_widget.c -> arch/i386/widget.c
We'e also had things like
arch/i386/kernel/pci-pc.c -> arch/i386/kernel/pci/common.c
so it's not always the ending of a file that is unchanged, but you still
often have some "similarity" of the name (ie the "pci" substring is still
common there).
So I agree that we can be even better about the heuristics. I don't know
how much it *matters* in practice.
I do agree with the people who argue that you simply shouldn't depend on
these kinds of things, and if you have identical files, and move them
around, you really are getting behaviour that doesn't matter.
The files are *identical* for christ sake! Following their history, it
doesn't matter *which* base you follow, since regardless, they've come to
the same point!
So in that sense, the current git behaviour is actually perfectly fine.
At the same time, I'll argue from a totally theoretical point that the
"filename" is obviously part of the data in the tree, and as such, a
similarity comparison that takes only the data into account is a bit
limited. So while I don't think a user should really care, I also think
that keeping the filename as part of the similarity analysis is actually
a perfectly logical and valid thing to do withing the git policy of
"content is king".
The filename *is* part of the content, and it's doubly so when you think
about a rename or copy operation, where the whole point of the exercise is
as much about the filename as about the data inside the file.
Linus
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Basename matching during rename/copy detection
2007-06-21 15:53 ` Johannes Schindelin
@ 2007-06-21 16:57 ` Steven Grimm
0 siblings, 0 replies; 44+ messages in thread
From: Steven Grimm @ 2007-06-21 16:57 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Matthieu Moy, Andy Parkins, git, Junio C Hamano
Johannes Schindelin wrote:
> No, that message did not convince me. It was way too short on the side of
> facts.
>
Short of posting multiple historical versions of the third-party source
code in question, I'm not sure what I can do to convince you. And I'd
rather not violate the license agreement on that code. I would have
thought, though, that the fact that I supplied a detailed, reproducible
test case with obviously broken behavior would itself have been pretty
convincing.
The fact that not all projects contain any short files, or any files
whose contents have ever been identical, does not cause git's behavior
in that test case to be correct. "It's broken and unfixable" is one
thing; "It's broken and we don't care" is another; and "It's broken and
we care but it's not at the top of anyone's priority list to fix" is
something else again. All of those are fine, but "If it's broken, you
are stupid" and "If it's broken, it's a sign your project isn't real"
are not.
Or, to take another tack on this entirely, it is not the proper function
of a version control system to dictate the contents of the projects
under its control. It should take whatever we humans throw at it and
reproduce those contents faithfully with coherent, non-jumbled history.
It should do so even if what we're throwing at it is completely stupid.
By the way, I'll toss out one more example of legitimate duplicate
files, though admittedly one where you might not care so much about
history jumbling: if you have a project that makes use of two GPL
libraries or utilities whose source you want to keep locally, e.g.
because you are making local modifications, you will have two copies of
the GNU "COPYING" file. Neither one produced by a build system (or at
least, not by *your* build system) and you are not permitted by the
terms of the GPL to publish a copy of either piece of software without a
verbatim copy of its license -- it says so right in section 1 of the GPL
(the "keep intact" wording.) Removing one of those copies and expecting
a build system to reconstruct it after someone clones your repository
would arguably be a violation of the GPL.
-Steve
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 16:20 ` Linus Torvalds
@ 2007-06-21 17:52 ` Junio C Hamano
2007-06-21 18:24 ` Linus Torvalds
2007-06-22 15:19 ` Andy Parkins
1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-21 17:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jeff King, Johannes Schindelin, Shawn O. Pearce, git,
govindsalinas, gitster
Linus Torvalds <torvalds@linux-foundation.org> writes:
> We'e also had things like
>
> arch/i386/kernel/pci-pc.c -> arch/i386/kernel/pci/common.c
>
> so it's not always the ending of a file that is unchanged, but you still
> often have some "similarity" of the name (ie the "pci" substring is still
> common there).
This is not an example to draw very useful conclusions, is it?
The heuristics to say '-pc => common' is a more likely rename
than '-obscure-arch => common' heavily depends on human
intelligence in the context of a particular project, the kernel,
where there are rules such as "peripherals are tested most
widely on PC architectures, so assume that the vendors might
have tested their stuff only on PCs".
But I do agree that not limiting to basename has values.
Taking example from the "I cannot draw so here is a red big X",
it is quite possible that two red big X's are replaced with
properly rendered icons, while their format modified, like so:
images/ok-button.gif => images/buttons/ok.png
images/cancel-button.gif => images/buttons/cancel.png
This suggests that we might be able to look around to see what
other rename src/target candidate files there are, so that we
can figure out if there is a common pattern (i.e. in the above
example, "patsubst images/%-button.gif,images/buttons/%.png" is
what is going on). If we find such a pattern, we can base the
assignment of "basename similarity bonus" on that pattern.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 17:52 ` Junio C Hamano
@ 2007-06-21 18:24 ` Linus Torvalds
0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2007-06-21 18:24 UTC (permalink / raw)
To: Junio C Hamano
Cc: Jeff King, Johannes Schindelin, Shawn O. Pearce, git,
govindsalinas
On Thu, 21 Jun 2007, Junio C Hamano wrote:
>
> This is not an example to draw very useful conclusions, is it?
>
> The heuristics to say '-pc => common' is a more likely rename
> than '-obscure-arch => common' heavily depends on human
> intelligence in the context
Oh, absolutely.
I'm just saying that *if* you see two equally weighed content moves, if
you then prefer the one that has more in common with the name, that's
likely the right choice.
In the actual example I gave, there was no ambiguity: the file contents
were very obvious. But let's sat that you happened to have an example of
two files with 100% identical content that moved, and you had the files
-arch/i386/kernel/pci-pc.c
-arch/alpha/kernel/pci-pc.c
+arch/i386/kernel/pci/common.c
+arch/alpha/kernel/pci/common.c
to match up, how would you do it? Again: they're all identical files: we
can obviously agree that two files got renamed, but what is the pairing.
I'd suggest that if you do it by matching up the similarity of the
filenames (not necessarily "exact same basename"), you'd actually catch
it. In this case, they all have "pci" in them, but the "alpha" similarity
would make you select the right one.
Similarly, in some other cases, the "pci" might be the thing they have in
common, and might be the thing that decides that "oh, those two filenames
look like they might be more of a better pair".
And yes, all of this would trigger only if the file data content match is
non-conclusive. The file data is *more* important, but that doesn't mean
that the file name similarity is *totally* unimportant either.
Linus
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 13:19 ` Jeff King
2007-06-21 14:03 ` Johannes Schindelin
2007-06-21 16:20 ` Linus Torvalds
@ 2007-06-22 1:14 ` Johannes Schindelin
2007-06-22 5:41 ` Jeff King
2007-06-22 7:17 ` Johannes Sixt
2 siblings, 2 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 1:14 UTC (permalink / raw)
To: Jeff King; +Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster
Hi,
On Thu, 21 Jun 2007, Jeff King wrote:
> I think something like a Levenshtein distance between the full pathnames
> would give good results, and would cover almost every situation that the
> basename heuristic would (there are a few exceptions, like getting
> "file.c" from either "file2.c" or "foo/file.c", but that seems kind of
> pathological).
Well, now you only have to test if it makes sense:
-- snipsnap --
[PATCH] diffcore-rename: replace basename_same() heuristics by Levenshtein
Instead of insisting on identical basenames, try the levenshtein
distance.
Basically, if there are multiple rename source candidates, take the
one with the smallest Levenshtein distance.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
The dangerous thing is that the score can get negative now.
Makefile | 4 ++--
diffcore-rename.c | 42 +++++++++++++++---------------------------
levenshtein.c | 39 +++++++++++++++++++++++++++++++++++++++
levenshtein.h | 6 ++++++
4 files changed, 62 insertions(+), 29 deletions(-)
create mode 100644 levenshtein.c
create mode 100644 levenshtein.h
diff --git a/Makefile b/Makefile
index 74b69fb..e015833 100644
--- a/Makefile
+++ b/Makefile
@@ -303,12 +303,12 @@ LIB_H = \
run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
- mailmap.h remote.h
+ mailmap.h remote.h levenshtein.h
DIFF_OBJS = \
diff.o diff-lib.o diffcore-break.o diffcore-order.o \
diffcore-pickaxe.o diffcore-rename.o tree-diff.o combine-diff.o \
- diffcore-delta.o log-tree.o
+ diffcore-delta.o log-tree.o levenshtein.o
LIB_OBJS = \
blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 79c984c..41448c9 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,6 +4,7 @@
#include "cache.h"
#include "diff.h"
#include "diffcore.h"
+#include "levenshtein.h"
/* Table of rename/copy destinations */
@@ -119,21 +120,6 @@ static int is_exact_match(struct diff_filespec *src,
return 0;
}
-static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
-{
- int src_len = strlen(src->path), dst_len = strlen(dst->path);
- while (src_len && dst_len) {
- char c1 = src->path[--src_len];
- char c2 = dst->path[--dst_len];
- if (c1 != c2)
- return 0;
- if (c1 == '/')
- return 1;
- }
- return (!src_len || src->path[src_len - 1] == '/') &&
- (!dst_len || dst->path[dst_len - 1] == '/');
-}
-
struct diff_score {
int src; /* index in rename_src */
int dst; /* index in rename_dst */
@@ -201,11 +187,9 @@ static int estimate_similarity(struct diff_filespec *src,
*/
if (!dst->size)
score = 0; /* should not happen */
- else {
- score = (int)(src_copied * MAX_SCORE / max_size);
- if (basename_same(src, dst))
- score++;
- }
+ else
+ score = (int)(src_copied * MAX_SCORE / max_size)
+ - levenshtein(src->path, dst->path);
return score;
}
@@ -313,20 +297,24 @@ void diffcore_rename(struct diff_options *options)
if (rename_dst[i].pair)
continue; /* dealt with an earlier round */
for (j = 0; j < rename_src_nr; j++) {
- int k;
+ int k, distance;
struct diff_filespec *one = rename_src[j].one;
if (!is_exact_match(one, two, contents_too))
continue;
+ distance = levenshtein(one->path, two->path);
/* see if there is a basename match, too */
for (k = j; k < rename_src_nr; k++) {
+ int d2;
one = rename_src[k].one;
- if (basename_same(one, two) &&
- is_exact_match(one, two,
- contents_too)) {
- j = k;
- break;
- }
+ if (!is_exact_match(one, two,
+ contents_too))
+ continue;
+ d2 = levenshtein(one->path, two->path);
+ if (d2 > distance)
+ continue;
+ distance = d2;
+ j = k;
}
record_rename_pair(i, j, (int)MAX_SCORE);
diff --git a/levenshtein.c b/levenshtein.c
new file mode 100644
index 0000000..80ef860
--- /dev/null
+++ b/levenshtein.c
@@ -0,0 +1,39 @@
+#include "cache.h"
+#include "levenshtein.h"
+
+int levenshtein(const char *string1, const char *string2)
+{
+ int len1 = strlen(string1), len2 = strlen(string2);
+ int *row1 = xmalloc(sizeof(int) * (len2 + 1));
+ int *row2 = xmalloc(sizeof(int) * (len2 + 1));
+ int i, j;
+
+ for (j = 1; j <= len2; j++)
+ row1[j] = j;
+ for (i = 0; i < len1; i++) {
+ int *dummy;
+
+ row2[0] = i + 1;
+ for (j = 0; j < len2; j++) {
+ /* substitution */
+ row2[j + 1] = row1[j] + (string1[i] != string2[j]);
+ /* insertion */
+ if (row2[j + 1] > row1[j + 1] + 1)
+ row2[j + 1] = row1[j + 1] + 1;
+ /* deletion */
+ if (row2[j + 1] > row2[j] + 1)
+ row2[j + 1] = row2[j] + 1;
+ }
+
+ dummy = row1;
+ row1 = row2;
+ row2 = dummy;
+ }
+
+ i = row1[len2];
+ free(row1);
+ free(row2);
+
+ return i;
+}
+
diff --git a/levenshtein.h b/levenshtein.h
new file mode 100644
index 0000000..74a6626
--- /dev/null
+++ b/levenshtein.h
@@ -0,0 +1,6 @@
+#ifndef LEVENSHTEIN_H
+#define LEVENSHTEIN_H
+
+int levenshtein(const char *string1, const char *string2);
+
+#endif
--
1.5.2.2.2822.g027a6-dirty
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-22 1:14 ` Johannes Schindelin
@ 2007-06-22 5:41 ` Jeff King
2007-06-22 10:22 ` Johannes Schindelin
2007-06-22 7:17 ` Johannes Sixt
1 sibling, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-06-22 5:41 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster
On Fri, Jun 22, 2007 at 02:14:43AM +0100, Johannes Schindelin wrote:
> @@ -313,20 +297,24 @@ void diffcore_rename(struct diff_options *options)
> if (rename_dst[i].pair)
> continue; /* dealt with an earlier round */
> for (j = 0; j < rename_src_nr; j++) {
> - int k;
> + int k, distance;
> struct diff_filespec *one = rename_src[j].one;
> if (!is_exact_match(one, two, contents_too))
> continue;
>
> + distance = levenshtein(one->path, two->path);
> /* see if there is a basename match, too */
> for (k = j; k < rename_src_nr; k++) {
This loop can start at k = j+1, since otherwise we are just checking
rename_src[j] against itself.
> +int levenshtein(const char *string1, const char *string2)
> +{
> + int len1 = strlen(string1), len2 = strlen(string2);
> + int *row1 = xmalloc(sizeof(int) * (len2 + 1));
> + int *row2 = xmalloc(sizeof(int) * (len2 + 1));
> + int i, j;
> +
> + for (j = 1; j <= len2; j++)
> + row1[j] = j;
This loop must start at j=0, not j=1; otherwise you have an undefined
value in row1[0], which gets read when setting row2[1], and you get
a totally meaningless distance (I got -1209667248 on my test case!).
-Peff
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-22 1:14 ` Johannes Schindelin
2007-06-22 5:41 ` Jeff King
@ 2007-06-22 7:17 ` Johannes Sixt
2007-06-22 10:39 ` Johannes Schindelin
1 sibling, 1 reply; 44+ messages in thread
From: Johannes Sixt @ 2007-06-22 7:17 UTC (permalink / raw)
To: git
Johannes Schindelin wrote:
> The dangerous thing is that the score can get negative now.
> ...
> + score = (int)(src_copied * MAX_SCORE / max_size)
> + - levenshtein(src->path, dst->path);
Does that also mean that you can't ever have a rename with a score of
100%?
(I haven't studied the algorithms and assume that levenshtein(a,b) == 0
only if a==b, and that without the -levenshtein(...) the score can grow
to 100%.)
-- Hannes
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-22 5:41 ` Jeff King
@ 2007-06-22 10:22 ` Johannes Schindelin
0 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 10:22 UTC (permalink / raw)
To: Jeff King; +Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster
Hi,
On Fri, 22 Jun 2007, Jeff King wrote:
> On Fri, Jun 22, 2007 at 02:14:43AM +0100, Johannes Schindelin wrote:
>
> > @@ -313,20 +297,24 @@ void diffcore_rename(struct diff_options *options)
> > if (rename_dst[i].pair)
> > continue; /* dealt with an earlier round */
> > for (j = 0; j < rename_src_nr; j++) {
> > - int k;
> > + int k, distance;
> > struct diff_filespec *one = rename_src[j].one;
> > if (!is_exact_match(one, two, contents_too))
> > continue;
> >
> > + distance = levenshtein(one->path, two->path);
> > /* see if there is a basename match, too */
> > for (k = j; k < rename_src_nr; k++) {
>
> This loop can start at k = j+1, since otherwise we are just checking
> rename_src[j] against itself.
Right.
> > +int levenshtein(const char *string1, const char *string2)
> > +{
> > + int len1 = strlen(string1), len2 = strlen(string2);
> > + int *row1 = xmalloc(sizeof(int) * (len2 + 1));
> > + int *row2 = xmalloc(sizeof(int) * (len2 + 1));
> > + int i, j;
> > +
> > + for (j = 1; j <= len2; j++)
> > + row1[j] = j;
>
> This loop must start at j=0, not j=1; otherwise you have an undefined
> value in row1[0], which gets read when setting row2[1], and you get
> a totally meaningless distance (I got -1209667248 on my test case!).
Sorry for that. I originally had an xcalloc in there, and did not look at
that loop afterwards.
And I completely forgot that on my laptop (on which I did this patch), I
had forgotten to add
ALL_CFLAGS += -DXMALLOC_POISON=1
to config.mak.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-22 7:17 ` Johannes Sixt
@ 2007-06-22 10:39 ` Johannes Schindelin
2007-06-22 10:52 ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 10:39 UTC (permalink / raw)
To: Johannes Sixt; +Cc: git
Hi,
On Fri, 22 Jun 2007, Johannes Sixt wrote:
> Johannes Schindelin wrote:
> > The dangerous thing is that the score can get negative now.
> > ...
> > + score = (int)(src_copied * MAX_SCORE / max_size)
> > + - levenshtein(src->path, dst->path);
>
> Does that also mean that you can't ever have a rename with a score of
> 100%?
>
> (I haven't studied the algorithms and assume that levenshtein(a,b) == 0
> only if a==b, and that without the -levenshtein(...) the score can grow
> to 100%.)
There is a different code path for identical contents. So yes, you can
still hit 100%, but it is now much, much harder to hit a score close to
100% [*1*].
The obviously correct way to do this is to have a subscore, and use it
_strictly_ only when the score is identical.
I see two ways to do this properly:
- introduce a name_distance struct member, just below the score. This
means that estimate_similarity has to "return" two values instead of
one, and score_compare gets a bit more complex, too. Or
- change the score to unsigned long, and shift the score to higher bits,
adding a constant minus the Levenshtein distance. It is safe to assume
that the filenames are shorter than 16384 bytes (PATH_MAX is actually
much smaller than that), and even if two filenames of that length are
completely different, the distance can not be larger than twice that
number, i.e. 16384 deletions + 16384 insertions. Therefore, you could
pick 32768 as that constant.
However, I find both solutions ugly. Besides, I am not interested in the
feature myself, only the implementation of Levenshtein was interesting,
and I thought I just post the code here. So I did only the minimal stuff
on top of the interesting one to make it sort of work.
If somebody wants to pick up the ball, be my guest, because I am out of
that game.
Ciao,
Dscho
Footnote:
*1* Actually, it is not _that_ bad. The score is not a value between 0 and
100, IOW it is _not_ what you see in the output of "diff -M". It is an
unsigned short between 0 and MAX_SCORE, which is defined in
diffcore.h as 60000.0.
The Levenshtein distance between two filenames cannot be larger than
the sum of their lengths, so it should be relatively safe. That is, if
you don't have such insanely long paths as e.g. egit. But even there,
the paths share most of their directories, and therefore the distances
should be much, much smaller in real life.
^ permalink raw reply [flat|nested] 44+ messages in thread
* 100% (was: [PATCH] diffcore-rename: favour identical basenames)
2007-06-22 10:39 ` Johannes Schindelin
@ 2007-06-22 10:52 ` David Kastrup
2007-06-22 12:49 ` Johannes Schindelin
0 siblings, 1 reply; 44+ messages in thread
From: David Kastrup @ 2007-06-22 10:52 UTC (permalink / raw)
To: git
> Footnote:
>
> *1* Actually, it is not _that_ bad. The score is not a value between 0 and
> 100, IOW it is _not_ what you see in the output of "diff -M". It is an
> unsigned short between 0 and MAX_SCORE, which is defined in
> diffcore.h as 60000.0.
>
> The Levenshtein distance between two filenames cannot be larger than
> the sum of their lengths, so it should be relatively safe. That is, if
> you don't have such insanely long paths as e.g. egit. But even there,
> the paths share most of their directories, and therefore the distances
> should be much, much smaller in real life.
As a note aside: would it be possible to always round downwards when
computing similarities or converting between them?
I very much would like to see the 100% figure reserved for identity.
This is particularly relevant when interpreting the output of git-diff
--name-status with regard to R100, C100 and similar flags.
--
David Kastrup
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100% (was: [PATCH] diffcore-rename: favour identical basenames)
2007-06-22 10:52 ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
@ 2007-06-22 12:49 ` Johannes Schindelin
[not found] ` <86abusi1fw.fsf@lola.quinscape.zz>
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 12:49 UTC (permalink / raw)
To: David Kastrup; +Cc: git
Hi,
On Fri, 22 Jun 2007, David Kastrup wrote:
> As a note aside: would it be possible to always round downwards when
> computing similarities or converting between them?
I'd rather not. This would be counterintuitive. People expect rounded
values.
> I very much would like to see the 100% figure reserved for identity.
> This is particularly relevant when interpreting the output of git-diff
> --name-status with regard to R100, C100 and similar flags.
You should never depend on the output of --name-status if you're
interested in identifying identical files, but on the object names.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 16:20 ` Linus Torvalds
2007-06-21 17:52 ` Junio C Hamano
@ 2007-06-22 15:19 ` Andy Parkins
2007-06-22 15:28 ` Johannes Schindelin
1 sibling, 1 reply; 44+ messages in thread
From: Andy Parkins @ 2007-06-22 15:19 UTC (permalink / raw)
To: git
Cc: Linus Torvalds, Jeff King, Johannes Schindelin, Shawn O. Pearce,
govindsalinas, gitster
On Thursday 2007 June 21, Linus Torvalds wrote:
> The files are *identical* for christ sake! Following their history, it
> doesn't matter *which* base you follow, since regardless, they've come to
> the same point!
>
> So in that sense, the current git behaviour is actually perfectly fine.
Perhaps not. (Please don't read this as meaning I disagree with your
favour-the-identical-filename patch at all - in fact I think that would
address the case I give below).
What if two files with different filenames and content converge at some point
in history, then diverge again? If git is tracking renames merely by content
and picks the wrong one, then the history of fileA suddenly becomes the
history of fileB.
Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-22 15:19 ` Andy Parkins
@ 2007-06-22 15:28 ` Johannes Schindelin
2007-06-22 17:51 ` Aidan Van Dyk
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 15:28 UTC (permalink / raw)
To: Andy Parkins
Cc: git, Linus Torvalds, Jeff King, Shawn O. Pearce, govindsalinas,
gitster
Hi,
On Fri, 22 Jun 2007, Andy Parkins wrote:
> What if two files with different filenames and content converge at some
> point in history, then diverge again? If git is tracking renames merely
> by content and picks the wrong one, then the history of fileA suddenly
> becomes the history of fileB.
This is becoming highly ethereal. Like "I could imagine that some day in
future, some person could devise a device, that might allow you to do
something that I can not explain, because I have not even thought of it".
IOW show me a reasonable example, and we'll talk business.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-22 15:28 ` Johannes Schindelin
@ 2007-06-22 17:51 ` Aidan Van Dyk
0 siblings, 0 replies; 44+ messages in thread
From: Aidan Van Dyk @ 2007-06-22 17:51 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Andy Parkins, git, Linus Torvalds, Jeff King, Shawn O. Pearce,
govindsalinas, gitster
[-- Attachment #1: Type: text/plain, Size: 1984 bytes --]
* Johannes Schindelin <Johannes.Schindelin@gmx.de> [070622 13:34]:
> Hi,
>
> On Fri, 22 Jun 2007, Andy Parkins wrote:
>
> > What if two files with different filenames and content converge at some
> > point in history, then diverge again? If git is tracking renames merely
> > by content and picks the wrong one, then the history of fileA suddenly
> > becomes the history of fileB.
>
> This is becoming highly ethereal. Like "I could imagine that some day in
> future, some person could devise a device, that might allow you to do
> something that I can not explain, because I have not even thought of it".
>
> IOW show me a reasonable example, and we'll talk business.
The one time the "content-only" rename tracking bit me was the
after a merge, resulting in conflicts that were un-nessesary:
-*-*-*-*-A-B-C-D
\
*-E-*
At A, there were 2 files:
dir1/foo
dir2/foo
They were template files that happened to be the same in 2 themes.
In E, "foo" was renamed to "foo-bar" in all the template directories.
Git detected this not as 2 renames, but as:
dir1/foo-bar renamed from dir1/foo
dir2/foo-bar copied from dir1/foo
dir2/foo deleted
Meanwhile, work was happening in B, C, and D, changing foo in both
templates identically.
When the branch with E was merged back into ABCD, there was a merge
conflict with dir2/foo being deleted in one branch, and editit in the
other.
In this case, the simple "basename" comparison wouldn't have even been
enough.
But the merge was easy enough (because no edits were made in the E
branch to those files, just the renames) that I could resolve it easily.
I don't know if preventing this easy-to-fix merge conflict is worth the
necessary "likeness of names" necessary to avoid it...
a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
[not found] ` <86abusi1fw.fsf@lola.quinscape.zz>
@ 2007-06-23 1:31 ` Johannes Schindelin
2007-06-23 10:18 ` 100% René Scharfe
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 1:31 UTC (permalink / raw)
To: David Kastrup; +Cc: git
Hi,
On Fri, 22 Jun 2007, David Kastrup wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > On Fri, 22 Jun 2007, David Kastrup wrote:
> >
> >> As a note aside: would it be possible to always round downwards when
> >> computing similarities or converting between them?
> >
> > I'd rather not. This would be counterintuitive. People expect rounded
> > values.
>
> Which people?
Me, for one. Thank you very much.
> The people I know will expect "100% identical" or even "100.0%
> identical" to mean identical, period. They will be quite surprised to
> hear that "99.95%" is supposed to be included.
Granted, 100.0% means as close as you can get to "completely" with 4
digits. But if you have an integer, you better use the complete range,
rather than arbitrarily make one number more important than others.
For if you see an integer, you usually assume a rounded value. If you
don't, you're hopeless.
> Also, for any kind of decision made upon percentages, it is much more
> relevant to be able to draw a line at 50% rather than at 49.5%.
I do see too many people in my day job who take the numbers they see for
absolute truths, so I cannot take that statement seriously, sorry.
> Could you name a _single_ use case where rounding down could cause an
> actual problem or even inconvenience for people?
Could you name a _single_ use case where it does not?
I mean, honestly, really. Really, really, really. A number is only a weak
_indicator_, and an integer even more so, for what is _really_ going on.
> >> I very much would like to see the 100% figure reserved for identity.
> >> This is particularly relevant when interpreting the output of
> >> git-diff --name-status with regard to R100, C100 and similar flags.
> >
> > You should never depend on the output of --name-status if you're
> > interested in identifying identical files, but on the object names.
>
> Which is rather inconvenient.
Frankly, I am getting bored.
This argument crops up ever so often. "If you did that, _I_ could be more
lazy, and the _hell_ with other people who expect otherwise!".
No, really.
> I _know_ that one can't rely on the output of --name-status right now.
And I _know_ that you can't rely on integer numbers. Or _any_ number which
is not _completely_ precise.
Really, I am getting bored with this discussion.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] diffcore-rename: favour identical basenames
2007-06-21 11:52 ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
2007-06-21 13:19 ` Jeff King
@ 2007-06-23 5:44 ` Junio C Hamano
1 sibling, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2007-06-23 5:44 UTC (permalink / raw)
To: Johannes Schindelin
Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> When there are several candidates for a rename source, and one of them
> has an identical basename to the rename target, take that one.
>
> Noticed by Govind Salinas, posted by Shawn O. Pearce, partial patch
> by Linus Torvalds.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>
> On Wed, 20 Jun 2007, Linus Torvalds wrote:
>
> > I think we should just consider the basename as an "added
> > similarity bonus".
Thanks, I obviously agree with both of you.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 1:31 ` 100% Johannes Schindelin
@ 2007-06-23 10:18 ` René Scharfe
2007-06-23 10:56 ` 100% Johannes Schindelin
0 siblings, 1 reply; 44+ messages in thread
From: René Scharfe @ 2007-06-23 10:18 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: David Kastrup, git
Johannes Schindelin schrieb:
> On Fri, 22 Jun 2007, David Kastrup wrote:
>> The people I know will expect "100% identical" or even "100.0%
>> identical" to mean identical, period. They will be quite surprised to
>> hear that "99.95%" is supposed to be included.
>
> Granted, 100.0% means as close as you can get to "completely" with 4
> digits. But if you have an integer, you better use the complete range,
> rather than arbitrarily make one number more important than others.
>
> For if you see an integer, you usually assume a rounded value. If you
> don't, you're hopeless.
Why hopeless? It's a useful convention to define "100%" as "complete
(not rounded)". See it this way: 50% of the time, a given percent value
will be shown as one point less than it's "true" value, but you gain the
ability to indicate full completeness. And that's an interesting piece
of information. The price is small given that the needed accuracy is
more in the range of 10 percent points (I assume).
It's more a question of how to make sure everybody knows what the
numbers mean -- but that's why we have a directory named
"Documentation". :-D And even a person that hasn't read the docs is
unlikely to really get harmed by inexact percentages, right?
René
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 10:18 ` 100% René Scharfe
@ 2007-06-23 10:56 ` Johannes Schindelin
2007-06-23 11:41 ` 100% René Scharfe
2007-06-23 19:33 ` 100% Junio C Hamano
0 siblings, 2 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 10:56 UTC (permalink / raw)
To: René Scharfe; +Cc: David Kastrup, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 975 bytes --]
Hi,
On Sat, 23 Jun 2007, René Scharfe wrote:
> Johannes Schindelin schrieb:
> > On Fri, 22 Jun 2007, David Kastrup wrote:
> >> The people I know will expect "100% identical" or even "100.0%
> >> identical" to mean identical, period. They will be quite surprised to
> >> hear that "99.95%" is supposed to be included.
> >
> > Granted, 100.0% means as close as you can get to "completely" with 4
> > digits. But if you have an integer, you better use the complete range,
> > rather than arbitrarily make one number more important than others.
> >
> > For if you see an integer, you usually assume a rounded value. If you
> > don't, you're hopeless.
>
> Why hopeless? It's a useful convention to define "100%" as "complete
> (not rounded)".
By the same reasoning, you could say "never round down to 0%, because I
want to know when there is no similarity".
You cannot be exact when you have to cut off fractions, so why try for
_exactly_ one number?
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 10:56 ` 100% Johannes Schindelin
@ 2007-06-23 11:41 ` René Scharfe
2007-06-23 12:00 ` 100% Johannes Schindelin
2007-06-23 19:33 ` 100% Junio C Hamano
1 sibling, 1 reply; 44+ messages in thread
From: René Scharfe @ 2007-06-23 11:41 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: David Kastrup, git
Johannes Schindelin schrieb:
> Hi,
>
> On Sat, 23 Jun 2007, René Scharfe wrote:
>
>> Johannes Schindelin schrieb:
>>> On Fri, 22 Jun 2007, David Kastrup wrote:
>>>> The people I know will expect "100% identical" or even "100.0%
>>>> identical" to mean identical, period. They will be quite surprised to
>>>> hear that "99.95%" is supposed to be included.
>>> Granted, 100.0% means as close as you can get to "completely" with 4
>>> digits. But if you have an integer, you better use the complete range,
>>> rather than arbitrarily make one number more important than others.
>>>
>>> For if you see an integer, you usually assume a rounded value. If you
>>> don't, you're hopeless.
>> Why hopeless? It's a useful convention to define "100%" as "complete
>> (not rounded)".
>
> By the same reasoning, you could say "never round down to 0%, because I
> want to know when there is no similarity".
>
> You cannot be exact when you have to cut off fractions, so why try for
> _exactly_ one number?
Because completeness is special. If just one bit was available, I'd use
it to indicate equality. That's what the authors of cmp(1) did, too. :)
And 0% is not special, at least not in a useful way that I can think of.
I.e. there is no practical difference between "no two lines match" and
"one percent of the lines match". If you're really interested in
similarities with an index below 10% then you'd better work with
absolute numbers instead of rounded percentages.
If someone came around with an interest in those cases with exactly 0%
similarity, then we might need to decide between rounding up or down.
But even in that hypothetical situation I think "equality" is still more
interesting a data point than "really everything differs".
René
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 11:41 ` 100% René Scharfe
@ 2007-06-23 12:00 ` Johannes Schindelin
2007-06-23 12:11 ` 100% René Scharfe
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 12:00 UTC (permalink / raw)
To: René Scharfe; +Cc: David Kastrup, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 464 bytes --]
Hi,
On Sat, 23 Jun 2007, René Scharfe wrote:
> Johannes Schindelin schrieb:
>
> > By the same reasoning, you could say "never round down to 0%, because
> > I want to know when there is no similarity".
> >
> > You cannot be exact when you have to cut off fractions, so why try for
> > _exactly_ one number?
>
> Because completeness is special.
I am not convinced. My vote is still for the _common_ practice of just
rounding. IOW keep it as is.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 12:00 ` 100% Johannes Schindelin
@ 2007-06-23 12:11 ` René Scharfe
2007-06-23 12:21 ` 100% Johannes Schindelin
0 siblings, 1 reply; 44+ messages in thread
From: René Scharfe @ 2007-06-23 12:11 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: David Kastrup, git
Johannes Schindelin schrieb:
> Hi,
>
> On Sat, 23 Jun 2007, René Scharfe wrote:
>
>> Johannes Schindelin schrieb:
>>
>>> By the same reasoning, you could say "never round down to 0%, because
>>> I want to know when there is no similarity".
>>>
>>> You cannot be exact when you have to cut off fractions, so why try for
>>> _exactly_ one number?
>> Because completeness is special.
>
> I am not convinced. My vote is still for the _common_ practice of just
> rounding. IOW keep it as is.
As I already hinted at, the common result of comparing two files, as
done by e.g. cmp(1), is one bit that indicates equality. This
information is lost when using up/down rounding, but it is retained when
rounding down. It's _not_ common to be unable to determine equality
from the result of a file compare.
René
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 12:11 ` 100% René Scharfe
@ 2007-06-23 12:21 ` Johannes Schindelin
2007-06-24 22:23 ` 100% René Scharfe
0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 12:21 UTC (permalink / raw)
To: René Scharfe; +Cc: David Kastrup, git
[-- Attachment #1: Type: TEXT/PLAIN, Size: 922 bytes --]
Hi,
On Sat, 23 Jun 2007, René Scharfe wrote:
> As I already hinted at, the common result of comparing two files, as
> done by e.g. cmp(1), is one bit that indicates equality. This
> information is lost when using up/down rounding, but it is retained when
> rounding down. It's _not_ common to be unable to determine equality
> from the result of a file compare.
And as _I_ already hinted, this does not matter. The whole purpose to have
a number here instead of a bit is to have a larger range. In practice, I
bet that the 100% are really uninteresting. At least here, they are.
For example, if you move a Java class from one package into another, you
have to change the package name in the file. Guess what, I am perfectly
okay if the rename detector says "100% similarity" here. Because if it is
closer to 100% than to 99%, dammit, I want to see 100%, not 99%.
Nuff said about this subject.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 10:56 ` 100% Johannes Schindelin
2007-06-23 11:41 ` 100% René Scharfe
@ 2007-06-23 19:33 ` Junio C Hamano
2007-06-23 20:41 ` 100% Johannes Schindelin
1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-23 19:33 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: René Scharfe, David Kastrup, git
Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> By the same reasoning, you could say "never round down to 0%, because I
> want to know when there is no similarity".
>
> You cannot be exact when you have to cut off fractions, so why try for
> _exactly_ one number?
R0 or C0 would not happen in real life, so 0% is a moot issue.
However, wasn't that you who did follow that "certain numbers
are special" logic in diffstat?
You advocated "diff --stat" should draw at least one +/- for a
patch that adds/removes lines. And I (and others) agreed
because zero is special in the context of that application.
I think reserving R100 to mean "identical byte sequences" has
value, when people look at --name-status output, in the context
of "similarity index".
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 19:33 ` 100% Junio C Hamano
@ 2007-06-23 20:41 ` Johannes Schindelin
0 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 20:41 UTC (permalink / raw)
To: Junio C Hamano; +Cc: René Scharfe, David Kastrup, git
Hi,
On Sat, 23 Jun 2007, Junio C Hamano wrote:
> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> > By the same reasoning, you could say "never round down to 0%, because I
> > want to know when there is no similarity".
> >
> > You cannot be exact when you have to cut off fractions, so why try for
> > _exactly_ one number?
>
> R0 or C0 would not happen in real life, so 0% is a moot issue.
It is, but not when you look at the formula.
> However, wasn't that you who did follow that "certain numbers
> are special" logic in diffstat?
>
> You advocated "diff --stat" should draw at least one +/- for a
> patch that adds/removes lines. And I (and others) agreed
> because zero is special in the context of that application.
Actually, it was not me, but I implemented the version that we have now.
I was reasonably scared that a non-linear diffstat would end up in git,
therefore I wrote a linear one.
The important thing to not here is that the diffstat as-is makes _better_
use of the limited scale that is available.
And as you pointed out, the low end of the scale is not really
interesting. The interesting parts are those around 100%. By rounding down
you make less use of the available scale.
> I think reserving R100 to mean "identical byte sequences" has value,
> when people look at --name-status output, in the context of "similarity
> index".
Ah, whatever. You do what you want.
Yes, this interpretation has value. No, it is not the only one that has
value. I am much more used to rounding, since at the end of the day it
makes better use of the scale, it is commonly used, and therefore _I_
expect it (and no, I will not read the documentation when I expect to know
what it means).
But hey, I don't care any more. AFAIAC you can change it from rounding to
rounding down, and next year to rounding-up. We could even have an
algorithm which rounds down only in odd years, and I still would not care.
Ciao,
Dscho
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: 100%
2007-06-23 12:21 ` 100% Johannes Schindelin
@ 2007-06-24 22:23 ` René Scharfe
0 siblings, 0 replies; 44+ messages in thread
From: René Scharfe @ 2007-06-24 22:23 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: David Kastrup, git
Johannes Schindelin schrieb:
> Hi,
>
> On Sat, 23 Jun 2007, René Scharfe wrote:
>
>> As I already hinted at, the common result of comparing two files, as
>> done by e.g. cmp(1), is one bit that indicates equality. This
>> information is lost when using up/down rounding, but it is retained when
>> rounding down. It's _not_ common to be unable to determine equality
>> from the result of a file compare.
>
> And as _I_ already hinted, this does not matter. The whole purpose to have
> a number here instead of a bit is to have a larger range. In practice, I
> bet that the 100% are really uninteresting. At least here, they are.
You would lose your bet since both David and me expressed interest in
that pure 100% thing.
Rounding down instead of up/down doesn't affect the size of neither the
input nor the output range. It affects the boundary of the input range,
(-0.499 .. 100.499 versus 0.000 .. 100.999), but I can't find a problem
with that.
> For example, if you move a Java class from one package into another, you
> have to change the package name in the file. Guess what, I am perfectly
> okay if the rename detector says "100% similarity" here. Because if it is
> closer to 100% than to 99%, dammit, I want to see 100%, not 99%.
That uses a side effect of rounding and won't work for small files. And
of course (if the file is large enough) there could be other changes
"hidden" in a similarity index value of 100% that was rounded up.
> Nuff said about this subject.
Yes, let's advance this topic to the coding stage.
René
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2007-06-24 22:23 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-21 3:06 Basename matching during rename/copy detection Shawn O. Pearce
2007-06-21 3:13 ` Junio C Hamano
2007-06-21 8:00 ` Andy Parkins
2007-06-21 8:07 ` Junio C Hamano
2007-06-21 9:50 ` Andy Parkins
2007-06-21 11:52 ` Johannes Schindelin
2007-06-21 12:44 ` Andy Parkins
2007-06-21 12:53 ` Matthieu Moy
2007-06-21 13:10 ` Jeff King
2007-06-21 13:18 ` Johannes Schindelin
2007-06-21 13:25 ` Matthieu Moy
2007-06-21 13:52 ` Johannes Schindelin
2007-06-21 15:37 ` Steven Grimm
2007-06-21 15:53 ` Johannes Schindelin
2007-06-21 16:57 ` Steven Grimm
2007-06-21 13:22 ` Johannes Schindelin
2007-06-21 3:42 ` Linus Torvalds
2007-06-21 11:52 ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
2007-06-21 13:19 ` Jeff King
2007-06-21 14:03 ` Johannes Schindelin
2007-06-21 16:20 ` Linus Torvalds
2007-06-21 17:52 ` Junio C Hamano
2007-06-21 18:24 ` Linus Torvalds
2007-06-22 15:19 ` Andy Parkins
2007-06-22 15:28 ` Johannes Schindelin
2007-06-22 17:51 ` Aidan Van Dyk
2007-06-22 1:14 ` Johannes Schindelin
2007-06-22 5:41 ` Jeff King
2007-06-22 10:22 ` Johannes Schindelin
2007-06-22 7:17 ` Johannes Sixt
2007-06-22 10:39 ` Johannes Schindelin
2007-06-22 10:52 ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
2007-06-22 12:49 ` Johannes Schindelin
[not found] ` <86abusi1fw.fsf@lola.quinscape.zz>
2007-06-23 1:31 ` 100% Johannes Schindelin
2007-06-23 10:18 ` 100% René Scharfe
2007-06-23 10:56 ` 100% Johannes Schindelin
2007-06-23 11:41 ` 100% René Scharfe
2007-06-23 12:00 ` 100% Johannes Schindelin
2007-06-23 12:11 ` 100% René Scharfe
2007-06-23 12:21 ` 100% Johannes Schindelin
2007-06-24 22:23 ` 100% René Scharfe
2007-06-23 19:33 ` 100% Junio C Hamano
2007-06-23 20:41 ` 100% Johannes Schindelin
2007-06-23 5:44 ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).