Basename matching during rename/copy detection

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Basename matching during rename/copy detection
@ 2007-06-21  3:06 Shawn O. Pearce
  2007-06-21  3:13 ` Junio C Hamano
  2007-06-21  3:42 ` Linus Torvalds
  0 siblings, 2 replies; 44+ messages in thread
From: Shawn O. Pearce @ 2007-06-21  3:06 UTC (permalink / raw)
  To: git; +Cc: govindsalinas

So Govind Salinas has found an interesting case in the rename
detection code:

  $ git clone git://repo.or.cz/Widgit.git
  $ git diff -M --raw -r 192e^ 192e | grep .resx
  :100755 000000 4c8ab79... 0000000... D  Form1.resx
  :100755 100755 9e70146... 9e70146... R100       CommitViewer.resx       UI/CommitViewer.resx
  :100755 100755 90929fd... b40ff98... C091       RepoManager.resx        UI/Form1.resx
  :100755 100755 90929fd... 90929fd... C100       PreferencesEditor.resx  UI/PreferencesEditor.resx
  :100755 100755 90929fd... 90929fd... R100       PreferencesEditor.resx  UI/RepoManager.resx
  :100755 100755 90929fd... 8535007... R097       RepoManager.resx        UI/RepoTreeView.resx

In this case several files had identical old images, and some
kept that old image during the rename.  Unfortunately because of
the ordering of the files in the tree Git has decided to "rename"
the PreferencesEditor.resx file to UI/RepoManager.resx, rather than
renaming RepoManager.resx to UI/RepoManager.resx.  Go Git.

I'm wondering if we shouldn't play the game of trying to match
delete/add pairs up by not only similarity, but also by path
basename.  In the case above its exactly what Govind thought should
happen; he moved the file from one directory to another, and didn't
even change its content during the move.  But Git decided "better"
to use a totally different file in the "rename".

-- 
Shawn.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21  3:06 Basename matching during rename/copy detection Shawn O. Pearce
@ 2007-06-21  3:13 ` Junio C Hamano
  2007-06-21  8:00   ` Andy Parkins
  2007-06-21  3:42 ` Linus Torvalds
  1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-21  3:13 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git, govindsalinas

"Shawn O. Pearce" <spearce@spearce.org> writes:

> So Govind Salinas has found an interesting case in the rename
> detection code:
>
>   $ git clone git://repo.or.cz/Widgit.git
>   $ git diff -M --raw -r 192e^ 192e | grep .resx
>   :100755 000000 4c8ab79... 0000000... D  Form1.resx
>   :100755 100755 9e70146... 9e70146... R100       CommitViewer.resx       UI/CommitViewer.resx
>   :100755 100755 90929fd... b40ff98... C091       RepoManager.resx        UI/Form1.resx
>   :100755 100755 90929fd... 90929fd... C100       PreferencesEditor.resx  UI/PreferencesEditor.resx
>   :100755 100755 90929fd... 90929fd... R100       PreferencesEditor.resx  UI/RepoManager.resx
>   :100755 100755 90929fd... 8535007... R097       RepoManager.resx        UI/RepoTreeView.resx
>
> In this case several files had identical old images, and some
> kept that old image during the rename.  Unfortunately because of
> the ordering of the files in the tree Git has decided to "rename"
> the PreferencesEditor.resx file to UI/RepoManager.resx, rather than
> renaming RepoManager.resx to UI/RepoManager.resx.  Go Git.
>
> I'm wondering if we shouldn't play the game of trying to match
> delete/add pairs up by not only similarity, but also by path
> basename.  In the case above its exactly what Govind thought should
> happen; he moved the file from one directory to another, and didn't
> even change its content during the move.  But Git decided "better"
> to use a totally different file in the "rename".

Actually, git did not decide anything, and certainly not better.

Having many "identical files" in the preimage is just stupid to
begin with (if you know they are identical, why are you storing
copies, instead of your build procedure to reuse the same file),
so the algorithm did not bother finding a better match among
"equals".

I am not opposed to a patch that says "Ok, these two preimages
have identical similarity score, *AND* indeed the preimages have
the same contents --- we tiebreak them with other heuristics to
help stupid projects better".  And I can see basename similarity
one of the useful heuristics you could use.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21  3:06 Basename matching during rename/copy detection Shawn O. Pearce
  2007-06-21  3:13 ` Junio C Hamano
@ 2007-06-21  3:42 ` Linus Torvalds
  2007-06-21 11:52   ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
  1 sibling, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2007-06-21  3:42 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: git, govindsalinas



On Wed, 20 Jun 2007, Shawn O. Pearce wrote:
> 
> I'm wondering if we shouldn't play the game of trying to match
> delete/add pairs up by not only similarity, but also by path
> basename.

I think we should just consider the basename as an "added 
similarity bonus".

IOW, we currently sort purely by data similarity, but how about just 
adding a small increment for "same base name".

We could make it actually use the similarity of the filename itself as the 
basis for the increment, which would be even better, but the trivial thing 
is to do something like

	--- a/diffcore-rename.c
	+++ b/diffcore-rename.c
	@@ -186,8 +186,11 @@ static int estimate_similarity(struct diff_filespec *src,
	 	 */
	 	if (!dst->size)
	 		score = 0; /* should not happen */
	-	else
	+	else {
	 		score = (int)(src_copied * MAX_SCORE / max_size);
	+		if (basename_same(src, dst))
	+			score++;
	+	}
	 	return score;
	 }
 
and just implement that "basename_same()" function.

Or something.

I do agree that the filename logically can and probably _should_ count 
towards the "similarity". The filename _is_ part of the data in the global 
notion of "content", after all. It's the "index" to the data.

		Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21  3:13 ` Junio C Hamano
@ 2007-06-21  8:00   ` Andy Parkins
  2007-06-21  8:07     ` Junio C Hamano
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Parkins @ 2007-06-21  8:00 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Shawn O. Pearce, govindsalinas

On Thursday 2007 June 21, Junio C Hamano wrote:

> Having many "identical files" in the preimage is just stupid to
> begin with (if you know they are identical, why are you storing
> copies, instead of your build procedure to reuse the same file),
> so the algorithm did not bother finding a better match among
> "equals".

That's a really poor argument; it's not git's place to impose restrictions on 
what is stored in it.

What if it's not a build environment at all but a home directory that's being 
stored - should no one be allowed to store copies of files because 
it's "stupid"?  What if it's a collection of images that all started out the 
same, but have gradually had detail added (which is actually what I do in my 
GUI programs for toolbar images)?  What about files that are used as flags, 
and are all identically empty.

None of those seems like an abuse of a VCS to me.  In fact, I'd say it's one 
of git's strengths that a duplicate file in the working tree doesn't take up 
any extra space in the repository.

Andy

-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21  8:00   ` Andy Parkins
@ 2007-06-21  8:07     ` Junio C Hamano
  2007-06-21  9:50       ` Andy Parkins
  0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-21  8:07 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git, Shawn O. Pearce, govindsalinas

Andy Parkins <andyparkins@gmail.com> writes:

> On Thursday 2007 June 21, Junio C Hamano wrote:
>
>> Having many "identical files" in the preimage is just stupid to
>> begin with (if you know they are identical, why are you storing
>> copies, instead of your build procedure to reuse the same file),
>> so the algorithm did not bother finding a better match among
>> "equals".
>
> That's a really poor argument; it's not git's place to impose restrictions on 
> what is stored in it.

It's not even an argument, nor an attempt to justify it.  It was
just an explanation of historical fact "It did not bother".
Please re-read the final part of the message, which you omitted
from your quote.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21  8:07     ` Junio C Hamano
@ 2007-06-21  9:50       ` Andy Parkins
  2007-06-21 11:52         ` Johannes Schindelin
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Parkins @ 2007-06-21  9:50 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

On Thursday 2007 June 21, Junio C Hamano wrote:

> It's not even an argument, nor an attempt to justify it.  It was
> just an explanation of historical fact "It did not bother".
> Please re-read the final part of the message, which you omitted
> from your quote.

I omitted it because I didn't object to that part :-)

I appreciated (as always) your practicality in that what you proposed would 
let people keep their copies.  What I was objecting to was the idea that any 
repository with duplicate files was "stupid".


Andy
-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] diffcore-rename: favour identical basenames
  2007-06-21  3:42 ` Linus Torvalds
@ 2007-06-21 11:52   ` Johannes Schindelin
  2007-06-21 13:19     ` Jeff King
  2007-06-23  5:44     ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano
  0 siblings, 2 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 11:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Shawn O. Pearce, git, govindsalinas, gitster


When there are several candidates for a rename source, and one of them
has an identical basename to the rename target, take that one.

Noticed by Govind Salinas, posted by Shawn O. Pearce, partial patch
by Linus Torvalds.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	On Wed, 20 Jun 2007, Linus Torvalds wrote:

	> I think we should just consider the basename as an "added 
	> similarity  bonus".
	> 
	> IOW, we currently sort purely by data similarity, but how about 
	> just adding a small increment for "same base name".
	> 
	> [patch suggestion snipped, since it is identical what is below]

	How 'bout this?

 diffcore-rename.c      |   33 ++++++++++++++++++++++++++++++++-
 t/t4001-diff-rename.sh |   13 +++++++++++++
 2 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 93c40d9..79c984c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -119,6 +119,21 @@ static int is_exact_match(struct diff_filespec *src,
 	return 0;
 }
 
+static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
+{
+	int src_len = strlen(src->path), dst_len = strlen(dst->path);
+	while (src_len && dst_len) {
+		char c1 = src->path[--src_len];
+		char c2 = dst->path[--dst_len];
+		if (c1 != c2)
+			return 0;
+		if (c1 == '/')
+			return 1;
+	}
+	return (!src_len || src->path[src_len - 1] == '/') &&
+		(!dst_len || dst->path[dst_len - 1] == '/');
+}
+
 struct diff_score {
 	int src; /* index in rename_src */
 	int dst; /* index in rename_dst */
@@ -186,8 +201,11 @@ static int estimate_similarity(struct diff_filespec *src,
 	 */
 	if (!dst->size)
 		score = 0; /* should not happen */
-	else
+	else {
 		score = (int)(src_copied * MAX_SCORE / max_size);
+		if (basename_same(src, dst))
+			score++;
+	}
 	return score;
 }
 
@@ -295,9 +313,22 @@ void diffcore_rename(struct diff_options *options)
 			if (rename_dst[i].pair)
 				continue; /* dealt with an earlier round */
 			for (j = 0; j < rename_src_nr; j++) {
+				int k;
 				struct diff_filespec *one = rename_src[j].one;
 				if (!is_exact_match(one, two, contents_too))
 					continue;
+
+				/* see if there is a basename match, too */
+				for (k = j; k < rename_src_nr; k++) {
+					one = rename_src[k].one;
+					if (basename_same(one, two) &&
+						is_exact_match(one, two,
+							contents_too)) {
+						j = k;
+						break;
+					}
+				}
+
 				record_rename_pair(i, j, (int)MAX_SCORE);
 				rename_count++;
 				break; /* we are done with this entry */
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index 2e3c20d..90c085f 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -64,4 +64,17 @@ test_expect_success \
     'validate the output.' \
     'compare_diff_patch current expected'
 
+test_expect_success 'favour same basenames over different ones' '
+	cp path1 another-path &&
+	git add another-path &&
+	git commit -m 1 &&
+	git rm path1 &&
+	mkdir subdir &&
+	git mv another-path subdir/path1 &&
+	git runstatus | grep "renamed: .*path1 -> subdir/path1"'
+
+test_expect_success  'favour same basenames even with minor differences' '
+	git show HEAD:path1 | sed "s/15/16/" > subdir/path1 &&
+	git runstatus | grep "renamed: .*path1 -> subdir/path1"'
+
 test_done
-- 
1.5.2.2.2822.g027a6-dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21  9:50       ` Andy Parkins
@ 2007-06-21 11:52         ` Johannes Schindelin
  2007-06-21 12:44           ` Andy Parkins
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 11:52 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git, Junio C Hamano

Hi,

On Thu, 21 Jun 2007, Andy Parkins wrote:

> On Thursday 2007 June 21, Junio C Hamano wrote:
> 
> > It's not even an argument, nor an attempt to justify it.  It was just 
> > an explanation of historical fact "It did not bother". Please re-read 
> > the final part of the message, which you omitted from your quote.
> 
> I appreciated (as always) your practicality in that what you proposed 
> would let people keep their copies.  What I was objecting to was the 
> idea that any repository with duplicate files was "stupid".

FWIW I find it stupid, too.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 11:52         ` Johannes Schindelin
@ 2007-06-21 12:44           ` Andy Parkins
  2007-06-21 12:53             ` Matthieu Moy
  2007-06-21 13:22             ` Johannes Schindelin
  0 siblings, 2 replies; 44+ messages in thread
From: Andy Parkins @ 2007-06-21 12:44 UTC (permalink / raw)
  To: git; +Cc: Johannes Schindelin, Junio C Hamano

On Thursday 2007 June 21, Johannes Schindelin wrote:

> > would let people keep their copies.  What I was objecting to was the
> > idea that any repository with duplicate files was "stupid".
>
> FWIW I find it stupid, too.

Thanks very much.  Okay, as I've been put in the position of defending this, 
let me give you the use case that has cropped up for me to do this stupid 
thing.

I've got a GUI program with a load of tool buttons.  Each of those buttons 
will, in the final product, be unique images.  When I write the program, I 
want to be able to refer to each of the button images in the correct place.  
e.g.

 setImage( NewButton, "path/to/new-button.png" );
 setImage( OpenButton, "path/to/open-button.png" );
 setImage( SaveButton, "path/to/save-button.png" );

Unfortunately, I can't draw.  So, I open up gimp, draw a big red X and save it 
as new-button.png.  Then I copy that file to open-button.png and 
save-button.png, knowing that at some point in the future, someone will come 
and replace those red-X images with something appropriate.

All those images now go in the repository.  Symbolic links are not an option, 
as it's got to be checkable out on Windows.

Tell me what part of that is stupid?

Andy
-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 12:44           ` Andy Parkins
@ 2007-06-21 12:53             ` Matthieu Moy
  2007-06-21 13:10               ` Jeff King
  2007-06-21 13:18               ` Johannes Schindelin
  2007-06-21 13:22             ` Johannes Schindelin
  1 sibling, 2 replies; 44+ messages in thread
From: Matthieu Moy @ 2007-06-21 12:53 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git, Johannes Schindelin, Junio C Hamano

Andy Parkins <andyparkins@gmail.com> writes:

> On Thursday 2007 June 21, Johannes Schindelin wrote:
>
>> > would let people keep their copies.  What I was objecting to was the
>> > idea that any repository with duplicate files was "stupid".
>>
>> FWIW I find it stupid, too.
>
> Thanks very much.  Okay, as I've been put in the position of defending this, 
> let me give you the use case that has cropped up for me to do this stupid 
> thing.

Well, why look so far to find an example of people having identical
files in their tree?

$ cd git
$ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | wc -l              
973
$ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | sort | uniq | wc -l
964
$ 

-- 
Matthieu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 12:53             ` Matthieu Moy
@ 2007-06-21 13:10               ` Jeff King
  2007-06-21 13:18               ` Johannes Schindelin
  1 sibling, 0 replies; 44+ messages in thread
From: Jeff King @ 2007-06-21 13:10 UTC (permalink / raw)
  To: git; +Cc: Andy Parkins, Johannes Schindelin, Junio C Hamano

On Thu, Jun 21, 2007 at 02:53:32PM +0200, Matthieu Moy wrote:

> Well, why look so far to find an example of people having identical
> files in their tree?
> 
> $ cd git
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | wc -l              
> 973
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | sort | uniq | wc -l
> 964

md5? What is this, CVS? How about:

git-ls-files -s | cut -d' ' -f2 | sort | uniq -d | wc -l

Your pipeline will also list files in the working directory, which can
inflate the number of duplicates (note that git-foo.sh and git-foo will
have the same content).

-Peff

PS Please don't take this to mean I think duplicate files are stupid; I
think they can be quite useful. I just wanted to nitpick your shell
command. :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 12:53             ` Matthieu Moy
  2007-06-21 13:10               ` Jeff King
@ 2007-06-21 13:18               ` Johannes Schindelin
  2007-06-21 13:25                 ` Matthieu Moy
  1 sibling, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 13:18 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Andy Parkins, git, Junio C Hamano

Hi,

On Thu, 21 Jun 2007, Matthieu Moy wrote:

> Well, why look so far to find an example of people having identical
> files in their tree?
> 
> $ cd git
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | wc -l              
> 973
> $ git-ls-files -z | xargs -0 md5sum | cut -f 1 -d ' ' | sort | uniq | wc -l
> 964
> $ 

Have you checked the files? They are all some blobs in the test scripts. 

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 11:52   ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
@ 2007-06-21 13:19     ` Jeff King
  2007-06-21 14:03       ` Johannes Schindelin
                         ` (2 more replies)
  2007-06-23  5:44     ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano
  1 sibling, 3 replies; 44+ messages in thread
From: Jeff King @ 2007-06-21 13:19 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster

On Thu, Jun 21, 2007 at 12:52:11PM +0100, Johannes Schindelin wrote:

> When there are several candidates for a rename source, and one of them
> has an identical basename to the rename target, take that one.

That's a reasonable heuristic, but it unfortunately won't match simple
things like:

  i386_widget.c -> arch/i386/widget.c

You really don't care about "is this a good match" as much as providing
an order to potential matches. I think something like a Levenshtein
distance between the full pathnames would give good results, and would
cover almost every situation that the basename heuristic would (there
are a few exceptions, like getting "file.c" from either "file2.c" or
"foo/file.c", but that seems kind of pathological).

Sorry to post without a patch, but I don't have time right this second.
I'll add it to the end of my (ever-growing) todo list if you think it's
a good idea and don't do it yourself. :)

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 12:44           ` Andy Parkins
  2007-06-21 12:53             ` Matthieu Moy
@ 2007-06-21 13:22             ` Johannes Schindelin
  1 sibling, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 13:22 UTC (permalink / raw)
  To: Andy Parkins; +Cc: git, Junio C Hamano

Hi,

On Thu, 21 Jun 2007, Andy Parkins wrote:

> I open up gimp, draw a big red X and save it as new-button.png.  Then I 
> copy that file to open-button.png and save-button.png, knowing that at 
> some point in the future, someone will come and replace those red-X 
> images with something appropriate.

So you have a couple of identical files in your repo, which are 
placeholders. That is quite different from what I criticised of being 
stupid.

Well, I would not have checked in the files in your place, but only one:

	dumb-red-X.png

Then, my Makefile would have checked for the existence of, say, 
my-wonderful-ok-button.png, and if it does not exist yet, copy 
dumb-red-X.png to it.

Now, when somebody comes along, paining the prettiest ok button I ever 
saw, I copy that over the copy of dumb-red-X.png, and check it in.

It has the further bonus that I know exactly which buttons I have to find 
a suck^Wgifted artist for.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 13:18               ` Johannes Schindelin
@ 2007-06-21 13:25                 ` Matthieu Moy
  2007-06-21 13:52                   ` Johannes Schindelin
  0 siblings, 1 reply; 44+ messages in thread
From: Matthieu Moy @ 2007-06-21 13:25 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Andy Parkins, git, Junio C Hamano

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> Have you checked the files? They are all some blobs in the test scripts. 

Yes, but how does it make any difference? You still want git to manage
them properly, don't you?

-- 
Matthieu

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 13:25                 ` Matthieu Moy
@ 2007-06-21 13:52                   ` Johannes Schindelin
  2007-06-21 15:37                     ` Steven Grimm
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 13:52 UTC (permalink / raw)
  To: Matthieu Moy; +Cc: Andy Parkins, git, Junio C Hamano

Hi,

On Thu, 21 Jun 2007, Matthieu Moy wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > Have you checked the files? They are all some blobs in the test scripts. 
> 
> Yes, but how does it make any difference? You still want git to manage
> them properly, don't you?

Yes. And Git explicitely allows what I call stupid. And yes, those 
_identical_ files in the test suit should probably all be folded into 
single files, and the places where they are used should reference _that_ 
single instance.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 13:19     ` Jeff King
@ 2007-06-21 14:03       ` Johannes Schindelin
  2007-06-21 16:20       ` Linus Torvalds
  2007-06-22  1:14       ` Johannes Schindelin
  2 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 14:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster

Hi,

On Thu, 21 Jun 2007, Jeff King wrote:

> On Thu, Jun 21, 2007 at 12:52:11PM +0100, Johannes Schindelin wrote:
> 
> > When there are several candidates for a rename source, and one of them
> > has an identical basename to the rename target, take that one.
> 
> That's a reasonable heuristic, but it unfortunately won't match simple
> things like:
> 
>   i386_widget.c -> arch/i386/widget.c

That's right. But every heuristic falls down eventually. Personally, I 
think basename_same() is good enough, even if the technical challenge to 
implement a small enough Levenshtein, which still respects directory 
boundaries somehow (and not just throws them away).

Besides, Levenshtein would introduce a ranking, not a boolean value like 
basename_same(). And that complicates the code.

All in all, I'd say Levenshtein is not worth the _result_.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 13:52                   ` Johannes Schindelin
@ 2007-06-21 15:37                     ` Steven Grimm
  2007-06-21 15:53                       ` Johannes Schindelin
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Grimm @ 2007-06-21 15:37 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Matthieu Moy, Andy Parkins, git, Junio C Hamano

Johannes Schindelin wrote:
> Yes. And Git explicitely allows what I call stupid. And yes, those 
> _identical_ files in the test suit should probably all be folded into 
> single files, and the places where they are used should reference _that_ 
> single instance.
>   

Two files that are identical in the current revision have not 
necessarily been identical from the beginning. Doing what you suggest 
will cause you to lose the history of all but one of those files.

Files can absolutely become identical in the real world. I know that for 
a fact because it happened to me just this week (see my "Directory 
renames" message from a few days ago.) Are you seriously suggesting that 
every time I unpack an update from a third party, I should go through it 
and see if they have changed any files such that the contents now match 
another file in my repository, and if so, I should remove all but one of 
the copies from my repository and have a build system create it instead? 
Then undo that work when I unpack another update and the files are no 
longer identical?

Well, no, I know you're not suggesting that, but it's the logical 
conclusion of the "it's stupid to ever have duplicate files" philosophy. 
While that approach certainly makes life easier for the version control 
system, it doesn't exactly make life easier for the *developer*, which 
is kind of the whole point of why we're here.

-Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 15:37                     ` Steven Grimm
@ 2007-06-21 15:53                       ` Johannes Schindelin
  2007-06-21 16:57                         ` Steven Grimm
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-21 15:53 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Matthieu Moy, Andy Parkins, git, Junio C Hamano

Hi,

On Thu, 21 Jun 2007, Steven Grimm wrote:

> Johannes Schindelin wrote:
> > Yes. And Git explicitely allows what I call stupid. And yes, those
> > _identical_ files in the test suit should probably all be folded into
> > single files, and the places where they are used should reference _that_
> > single instance.
> >   
> 
> Two files that are identical in the current revision have not necessarily
> been identical from the beginning. Doing what you suggest will cause you to
> lose the history of all but one of those files.
> 
> Files can absolutely become identical in the real world. I know that for a
> fact because it happened to me just this week (see my "Directory renames"
> message from a few days ago.)

No, that message did not convince me. It was way too short on the side of 
facts.

And no, I do not think that two unrelated files can get exactly the same 
content.

Be that as may, even _if_ there were such a case, I'd still try to reuse 
the same file in the working directory. Just because Git can deal 
efficiently with millions of identical files does not mean that a working 
directory can, or worse, human developers.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 13:19     ` Jeff King
  2007-06-21 14:03       ` Johannes Schindelin
@ 2007-06-21 16:20       ` Linus Torvalds
  2007-06-21 17:52         ` Junio C Hamano
  2007-06-22 15:19         ` Andy Parkins
  2007-06-22  1:14       ` Johannes Schindelin
  2 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2007-06-21 16:20 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Shawn O. Pearce, git, govindsalinas, gitster

On Thu, 21 Jun 2007, Jeff King wrote:

> On Thu, Jun 21, 2007 at 12:52:11PM +0100, Johannes Schindelin wrote:
> 
> > When there are several candidates for a rename source, and one of them
> > has an identical basename to the rename target, take that one.
> 
> That's a reasonable heuristic, but it unfortunately won't match simple
> things like:
> 
>   i386_widget.c -> arch/i386/widget.c

We'e also had things like

	arch/i386/kernel/pci-pc.c -> arch/i386/kernel/pci/common.c

so it's not always the ending of a file that is unchanged, but you still 
often have some "similarity" of the name (ie the "pci" substring is still 
common there).

So I agree that we can be even better about the heuristics. I don't know 
how much it *matters* in practice.

I do agree with the people who argue that you simply shouldn't depend on 
these kinds of things, and if you have identical files, and move them 
around, you really are getting behaviour that doesn't matter.

The files are *identical* for christ sake! Following their history, it 
doesn't matter *which* base you follow, since regardless, they've come to 
the same point!

So in that sense, the current git behaviour is actually perfectly fine.

At the same time, I'll argue from a totally theoretical point that the 
"filename" is obviously part of the data in the tree, and as such, a 
similarity comparison that takes only the data into account is a bit 
limited. So while I don't think a user should really care, I also think 
that keeping the filename as part of the similarity analysis is actually 
a perfectly logical and valid thing to do withing the git policy of 
"content is king".

The filename *is* part of the content, and it's doubly so when you think 
about a rename or copy operation, where the whole point of the exercise is 
as much about the filename as about the data inside the file.

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Basename matching during rename/copy detection
  2007-06-21 15:53                       ` Johannes Schindelin
@ 2007-06-21 16:57                         ` Steven Grimm
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Grimm @ 2007-06-21 16:57 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Matthieu Moy, Andy Parkins, git, Junio C Hamano

Johannes Schindelin wrote:
> No, that message did not convince me. It was way too short on the side of 
> facts.
>   

Short of posting multiple historical versions of the third-party source 
code in question, I'm not sure what I can do to convince you. And I'd 
rather not violate the license agreement on that code. I would have 
thought, though, that the fact that I supplied a detailed, reproducible 
test case with obviously broken behavior would itself have been pretty 
convincing.

The fact that not all projects contain any short files, or any files 
whose contents have ever been identical, does not cause git's behavior 
in that test case to be correct. "It's broken and unfixable" is one 
thing; "It's broken and we don't care" is another; and "It's broken and 
we care but it's not at the top of anyone's priority list to fix" is 
something else again. All of those are fine, but "If it's broken, you 
are stupid" and "If it's broken, it's a sign your project isn't real" 
are not.

Or, to take another tack on this entirely, it is not the proper function 
of a version control system to dictate the contents of the projects 
under its control. It should take whatever we humans throw at it and 
reproduce those contents faithfully with coherent, non-jumbled history. 
It should do so even if what we're throwing at it is completely stupid.

By the way, I'll toss out one more example of legitimate duplicate 
files, though admittedly one where you might not care so much about 
history jumbling: if you have a project that makes use of two GPL 
libraries or utilities whose source you want to keep locally, e.g. 
because you are making local modifications, you will have two copies of 
the GNU "COPYING" file. Neither one produced by a build system (or at 
least, not by *your* build system) and you are not permitted by the 
terms of the GPL to publish a copy of either piece of software without a 
verbatim copy of its license -- it says so right in section 1 of the GPL 
(the "keep intact" wording.) Removing one of those copies and expecting 
a build system to reconstruct it after someone clones your repository 
would arguably be a violation of the GPL.

-Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 16:20       ` Linus Torvalds
@ 2007-06-21 17:52         ` Junio C Hamano
  2007-06-21 18:24           ` Linus Torvalds
  2007-06-22 15:19         ` Andy Parkins
  1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-21 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff King, Johannes Schindelin, Shawn O. Pearce, git,
	govindsalinas, gitster

Linus Torvalds <torvalds@linux-foundation.org> writes:

> We'e also had things like
>
> 	arch/i386/kernel/pci-pc.c -> arch/i386/kernel/pci/common.c
>
> so it's not always the ending of a file that is unchanged, but you still 
> often have some "similarity" of the name (ie the "pci" substring is still 
> common there).

This is not an example to draw very useful conclusions, is it?

The heuristics to say '-pc => common' is a more likely rename
than '-obscure-arch => common' heavily depends on human
intelligence in the context of a particular project, the kernel,
where there are rules such as "peripherals are tested most
widely on PC architectures, so assume that the vendors might
have tested their stuff only on PCs".

But I do agree that not limiting to basename has values.
Taking example from the "I cannot draw so here is a red big X",
it is quite possible that two red big X's are replaced with
properly rendered icons, while their format modified, like so:

    images/ok-button.gif => images/buttons/ok.png
    images/cancel-button.gif => images/buttons/cancel.png

This suggests that we might be able to look around to see what
other rename src/target candidate files there are, so that we
can figure out if there is a common pattern (i.e. in the above
example, "patsubst images/%-button.gif,images/buttons/%.png" is
what is going on).  If we find such a pattern, we can base the
assignment of "basename similarity bonus" on that pattern.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 17:52         ` Junio C Hamano
@ 2007-06-21 18:24           ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2007-06-21 18:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Johannes Schindelin, Shawn O. Pearce, git,
	govindsalinas

On Thu, 21 Jun 2007, Junio C Hamano wrote:
> 
> This is not an example to draw very useful conclusions, is it?
> 
> The heuristics to say '-pc => common' is a more likely rename
> than '-obscure-arch => common' heavily depends on human
> intelligence in the context

Oh, absolutely.

I'm just saying that *if* you see two equally weighed content moves, if 
you then prefer the one that has more in common with the name, that's 
likely the right choice. 

In the actual example I gave, there was no ambiguity: the file contents 
were very obvious. But let's sat that you happened to have an example of 
two files with 100% identical content that moved, and you had the files

	-arch/i386/kernel/pci-pc.c
	-arch/alpha/kernel/pci-pc.c
	+arch/i386/kernel/pci/common.c
	+arch/alpha/kernel/pci/common.c

to match up, how would you do it? Again: they're all identical files: we 
can obviously agree that two files got renamed, but what is the pairing.

I'd suggest that if you do it by matching up the similarity of the 
filenames (not necessarily "exact same basename"), you'd actually catch 
it. In this case, they all have "pci" in them, but the "alpha" similarity 
would make you select the right one.

Similarly, in some other cases, the "pci" might be the thing they have in 
common, and might be the thing that decides that "oh, those two filenames 
look like they might be more of a better pair".

And yes, all of this would trigger only if the file data content match is 
non-conclusive. The file data is *more* important, but that doesn't mean 
that the file name similarity is *totally* unimportant either.

			Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 13:19     ` Jeff King
  2007-06-21 14:03       ` Johannes Schindelin
  2007-06-21 16:20       ` Linus Torvalds
@ 2007-06-22  1:14       ` Johannes Schindelin
  2007-06-22  5:41         ` Jeff King
  2007-06-22  7:17         ` Johannes Sixt
  2 siblings, 2 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22  1:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster

Hi,

On Thu, 21 Jun 2007, Jeff King wrote:

> I think something like a Levenshtein distance between the full pathnames 
> would give good results, and would cover almost every situation that the 
> basename heuristic would (there are a few exceptions, like getting 
> "file.c" from either "file2.c" or "foo/file.c", but that seems kind of 
> pathological).

Well, now you only have to test if it makes sense:

-- snipsnap --
[PATCH] diffcore-rename: replace basename_same() heuristics by Levenshtein

Instead of insisting on identical basenames, try the levenshtein
distance.

Basically, if there are multiple rename source candidates, take the
one with the smallest Levenshtein distance.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---

	The dangerous thing is that the score can get negative now.

 Makefile          |    4 ++--
 diffcore-rename.c |   42 +++++++++++++++---------------------------
 levenshtein.c     |   39 +++++++++++++++++++++++++++++++++++++++
 levenshtein.h     |    6 ++++++
 4 files changed, 62 insertions(+), 29 deletions(-)
 create mode 100644 levenshtein.c
 create mode 100644 levenshtein.h

diff --git a/Makefile b/Makefile
index 74b69fb..e015833 100644
--- a/Makefile
+++ b/Makefile
@@ -303,12 +303,12 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h
+	mailmap.h remote.h levenshtein.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
 	diffcore-pickaxe.o diffcore-rename.o tree-diff.o combine-diff.o \
-	diffcore-delta.o log-tree.o
+	diffcore-delta.o log-tree.o levenshtein.o
 
 LIB_OBJS = \
 	blob.o commit.o connect.o csum-file.o cache-tree.o base85.o \
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 79c984c..41448c9 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -4,6 +4,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "levenshtein.h"
 
 /* Table of rename/copy destinations */
 
@@ -119,21 +120,6 @@ static int is_exact_match(struct diff_filespec *src,
 	return 0;
 }
 
-static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
-{
-	int src_len = strlen(src->path), dst_len = strlen(dst->path);
-	while (src_len && dst_len) {
-		char c1 = src->path[--src_len];
-		char c2 = dst->path[--dst_len];
-		if (c1 != c2)
-			return 0;
-		if (c1 == '/')
-			return 1;
-	}
-	return (!src_len || src->path[src_len - 1] == '/') &&
-		(!dst_len || dst->path[dst_len - 1] == '/');
-}
-
 struct diff_score {
 	int src; /* index in rename_src */
 	int dst; /* index in rename_dst */
@@ -201,11 +187,9 @@ static int estimate_similarity(struct diff_filespec *src,
 	 */
 	if (!dst->size)
 		score = 0; /* should not happen */
-	else {
-		score = (int)(src_copied * MAX_SCORE / max_size);
-		if (basename_same(src, dst))
-			score++;
-	}
+	else
+		score = (int)(src_copied * MAX_SCORE / max_size)
+			- levenshtein(src->path, dst->path);
 	return score;
 }
 
@@ -313,20 +297,24 @@ void diffcore_rename(struct diff_options *options)
 			if (rename_dst[i].pair)
 				continue; /* dealt with an earlier round */
 			for (j = 0; j < rename_src_nr; j++) {
-				int k;
+				int k, distance;
 				struct diff_filespec *one = rename_src[j].one;
 				if (!is_exact_match(one, two, contents_too))
 					continue;
 
+				distance = levenshtein(one->path, two->path);
 				/* see if there is a basename match, too */
 				for (k = j; k < rename_src_nr; k++) {
+					int d2;
 					one = rename_src[k].one;
-					if (basename_same(one, two) &&
-						is_exact_match(one, two,
-							contents_too)) {
-						j = k;
-						break;
-					}
+					if (!is_exact_match(one, two,
+								contents_too))
+						continue;
+					d2 = levenshtein(one->path, two->path);
+					if (d2 > distance)
+						continue;
+					distance = d2;
+					j = k;
 				}
 
 				record_rename_pair(i, j, (int)MAX_SCORE);
diff --git a/levenshtein.c b/levenshtein.c
new file mode 100644
index 0000000..80ef860
--- /dev/null
+++ b/levenshtein.c
@@ -0,0 +1,39 @@
+#include "cache.h"
+#include "levenshtein.h"
+
+int levenshtein(const char *string1, const char *string2)
+{
+	int len1 = strlen(string1), len2 = strlen(string2);
+	int *row1 = xmalloc(sizeof(int) * (len2 + 1));
+	int *row2 = xmalloc(sizeof(int) * (len2 + 1));
+	int i, j;
+
+	for (j = 1; j <= len2; j++)
+		row1[j] = j;
+	for (i = 0; i < len1; i++) {
+		int *dummy;
+
+		row2[0] = i + 1;
+		for (j = 0; j < len2; j++) {
+			/* substitution */
+			row2[j + 1] = row1[j] + (string1[i] != string2[j]);
+			/* insertion */
+			if (row2[j + 1] > row1[j + 1] + 1)
+				row2[j + 1] = row1[j + 1] + 1;
+			/* deletion */
+			if (row2[j + 1] > row2[j] + 1)
+				row2[j + 1] = row2[j] + 1;
+		}
+
+		dummy = row1;
+		row1 = row2;
+		row2 = dummy;
+	}
+
+	i = row1[len2];
+	free(row1);
+	free(row2);
+
+	return i;
+}
+
diff --git a/levenshtein.h b/levenshtein.h
new file mode 100644
index 0000000..74a6626
--- /dev/null
+++ b/levenshtein.h
@@ -0,0 +1,6 @@
+#ifndef LEVENSHTEIN_H
+#define LEVENSHTEIN_H
+
+int levenshtein(const char *string1, const char *string2);
+
+#endif
-- 
1.5.2.2.2822.g027a6-dirty

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-22  1:14       ` Johannes Schindelin
@ 2007-06-22  5:41         ` Jeff King
  2007-06-22 10:22           ` Johannes Schindelin
  2007-06-22  7:17         ` Johannes Sixt
  1 sibling, 1 reply; 44+ messages in thread
From: Jeff King @ 2007-06-22  5:41 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster

On Fri, Jun 22, 2007 at 02:14:43AM +0100, Johannes Schindelin wrote:

> @@ -313,20 +297,24 @@ void diffcore_rename(struct diff_options *options)
>  			if (rename_dst[i].pair)
>  				continue; /* dealt with an earlier round */
>  			for (j = 0; j < rename_src_nr; j++) {
> -				int k;
> +				int k, distance;
>  				struct diff_filespec *one = rename_src[j].one;
>  				if (!is_exact_match(one, two, contents_too))
>  					continue;
>  
> +				distance = levenshtein(one->path, two->path);
>  				/* see if there is a basename match, too */
>  				for (k = j; k < rename_src_nr; k++) {

This loop can start at k = j+1, since otherwise we are just checking
rename_src[j] against itself.

> +int levenshtein(const char *string1, const char *string2)
> +{
> +	int len1 = strlen(string1), len2 = strlen(string2);
> +	int *row1 = xmalloc(sizeof(int) * (len2 + 1));
> +	int *row2 = xmalloc(sizeof(int) * (len2 + 1));
> +	int i, j;
> +
> +	for (j = 1; j <= len2; j++)
> +		row1[j] = j;

This loop must start at j=0, not j=1; otherwise you have an undefined
value in row1[0], which gets read when setting row2[1], and you get
a totally meaningless distance (I got -1209667248 on my test case!).

-Peff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-22  1:14       ` Johannes Schindelin
  2007-06-22  5:41         ` Jeff King
@ 2007-06-22  7:17         ` Johannes Sixt
  2007-06-22 10:39           ` Johannes Schindelin
  1 sibling, 1 reply; 44+ messages in thread
From: Johannes Sixt @ 2007-06-22  7:17 UTC (permalink / raw)
  To: git

Johannes Schindelin wrote:
>         The dangerous thing is that the score can get negative now.
>  ...
> +               score = (int)(src_copied * MAX_SCORE / max_size)
> +                       - levenshtein(src->path, dst->path);

Does that also mean that you can't ever have a rename with a score of
100%?

(I haven't studied the algorithms and assume that levenshtein(a,b) == 0
only if a==b, and that without the -levenshtein(...) the score can grow
to 100%.)

-- Hannes

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-22  5:41         ` Jeff King
@ 2007-06-22 10:22           ` Johannes Schindelin
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 10:22 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster

Hi,

On Fri, 22 Jun 2007, Jeff King wrote:

> On Fri, Jun 22, 2007 at 02:14:43AM +0100, Johannes Schindelin wrote:
> 
> > @@ -313,20 +297,24 @@ void diffcore_rename(struct diff_options *options)
> >  			if (rename_dst[i].pair)
> >  				continue; /* dealt with an earlier round */
> >  			for (j = 0; j < rename_src_nr; j++) {
> > -				int k;
> > +				int k, distance;
> >  				struct diff_filespec *one = rename_src[j].one;
> >  				if (!is_exact_match(one, two, contents_too))
> >  					continue;
> >  
> > +				distance = levenshtein(one->path, two->path);
> >  				/* see if there is a basename match, too */
> >  				for (k = j; k < rename_src_nr; k++) {
> 
> This loop can start at k = j+1, since otherwise we are just checking
> rename_src[j] against itself.

Right.

> > +int levenshtein(const char *string1, const char *string2)
> > +{
> > +	int len1 = strlen(string1), len2 = strlen(string2);
> > +	int *row1 = xmalloc(sizeof(int) * (len2 + 1));
> > +	int *row2 = xmalloc(sizeof(int) * (len2 + 1));
> > +	int i, j;
> > +
> > +	for (j = 1; j <= len2; j++)
> > +		row1[j] = j;
> 
> This loop must start at j=0, not j=1; otherwise you have an undefined
> value in row1[0], which gets read when setting row2[1], and you get
> a totally meaningless distance (I got -1209667248 on my test case!).

Sorry for that. I originally had an xcalloc in there, and did not look at 
that loop afterwards.

And I completely forgot that on my laptop (on which I did this patch), I 
had forgotten to add

	ALL_CFLAGS += -DXMALLOC_POISON=1

to config.mak.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-22  7:17         ` Johannes Sixt
@ 2007-06-22 10:39           ` Johannes Schindelin
  2007-06-22 10:52             ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 10:39 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Hi,

On Fri, 22 Jun 2007, Johannes Sixt wrote:

> Johannes Schindelin wrote:
> >         The dangerous thing is that the score can get negative now.
> >  ...
> > +               score = (int)(src_copied * MAX_SCORE / max_size)
> > +                       - levenshtein(src->path, dst->path);
> 
> Does that also mean that you can't ever have a rename with a score of
> 100%?
> 
> (I haven't studied the algorithms and assume that levenshtein(a,b) == 0
> only if a==b, and that without the -levenshtein(...) the score can grow
> to 100%.)

There is a different code path for identical contents. So yes, you can 
still hit 100%, but it is now much, much harder to hit a score close to 
100% [*1*].

The obviously correct way to do this is to have a subscore, and use it 
_strictly_ only when the score is identical.

I see two ways to do this properly:

- introduce a name_distance struct member, just below the score. This 
  means that estimate_similarity has to "return" two values instead of 
  one, and score_compare gets a bit more complex, too. Or

- change the score to unsigned long, and shift the score to higher bits, 
  adding a constant minus the Levenshtein distance. It is safe to assume 
  that the filenames are shorter than 16384 bytes (PATH_MAX is actually 
  much smaller than that), and even if two filenames of that length are 
  completely different, the distance can not be larger than twice that 
  number, i.e. 16384 deletions + 16384 insertions. Therefore, you could 
  pick 32768 as that constant.

However, I find both solutions ugly. Besides, I am not interested in the 
feature myself, only the implementation of Levenshtein was interesting, 
and I thought I just post the code here. So I did only the minimal stuff 
on top of the interesting one to make it sort of work.

If somebody wants to pick up the ball, be my guest, because I am out of 
that game.

Ciao,
Dscho

Footnote:

*1* Actually, it is not _that_ bad. The score is not a value between 0 and 
    100, IOW it is _not_ what you see in the output of "diff -M". It is an 
    unsigned short between 0 and MAX_SCORE, which is defined in 
    diffcore.h as 60000.0.

    The Levenshtein distance between two filenames cannot be larger than 
    the sum of their lengths, so it should be relatively safe. That is, if 
    you don't have such insanely long paths as e.g. egit. But even there, 
    the paths share most of their directories, and therefore the distances 
    should be much, much smaller in real life.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* 100% (was: [PATCH] diffcore-rename: favour identical basenames)
  2007-06-22 10:39           ` Johannes Schindelin
@ 2007-06-22 10:52             ` David Kastrup
  2007-06-22 12:49               ` Johannes Schindelin
  0 siblings, 1 reply; 44+ messages in thread
From: David Kastrup @ 2007-06-22 10:52 UTC (permalink / raw)
  To: git


> Footnote:
>
> *1* Actually, it is not _that_ bad. The score is not a value between 0 and 
>     100, IOW it is _not_ what you see in the output of "diff -M". It is an 
>     unsigned short between 0 and MAX_SCORE, which is defined in 
>     diffcore.h as 60000.0.
>
>     The Levenshtein distance between two filenames cannot be larger than 
>     the sum of their lengths, so it should be relatively safe. That is, if 
>     you don't have such insanely long paths as e.g. egit. But even there, 
>     the paths share most of their directories, and therefore the distances 
>     should be much, much smaller in real life.

As a note aside: would it be possible to always round downwards when
computing similarities or converting between them?

I very much would like to see the 100% figure reserved for identity.
This is particularly relevant when interpreting the output of git-diff
--name-status with regard to R100, C100 and similar flags.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100% (was: [PATCH] diffcore-rename: favour identical basenames)
  2007-06-22 10:52             ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
@ 2007-06-22 12:49               ` Johannes Schindelin
       [not found]                 ` <86abusi1fw.fsf@lola.quinscape.zz>
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 12:49 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

Hi,

On Fri, 22 Jun 2007, David Kastrup wrote:

> As a note aside: would it be possible to always round downwards when 
> computing similarities or converting between them?

I'd rather not. This would be counterintuitive. People expect rounded 
values.

> I very much would like to see the 100% figure reserved for identity.
> This is particularly relevant when interpreting the output of git-diff
> --name-status with regard to R100, C100 and similar flags.

You should never depend on the output of --name-status if you're 
interested in identifying identical files, but on the object names.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 16:20       ` Linus Torvalds
  2007-06-21 17:52         ` Junio C Hamano
@ 2007-06-22 15:19         ` Andy Parkins
  2007-06-22 15:28           ` Johannes Schindelin
  1 sibling, 1 reply; 44+ messages in thread
From: Andy Parkins @ 2007-06-22 15:19 UTC (permalink / raw)
  To: git
  Cc: Linus Torvalds, Jeff King, Johannes Schindelin, Shawn O. Pearce,
	govindsalinas, gitster

On Thursday 2007 June 21, Linus Torvalds wrote:

> The files are *identical* for christ sake! Following their history, it
> doesn't matter *which* base you follow, since regardless, they've come to
> the same point!
>
> So in that sense, the current git behaviour is actually perfectly fine.

Perhaps not.  (Please don't read this as meaning I disagree with your 
favour-the-identical-filename patch at all - in fact I think that would 
address the case I give below).

What if two files with different filenames and content converge at some point 
in history, then diverge again?  If git is tracking renames merely by content 
and picks the wrong one, then the history of fileA suddenly becomes the 
history of fileB.

Andy

-- 
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-22 15:19         ` Andy Parkins
@ 2007-06-22 15:28           ` Johannes Schindelin
  2007-06-22 17:51             ` Aidan Van Dyk
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-22 15:28 UTC (permalink / raw)
  To: Andy Parkins
  Cc: git, Linus Torvalds, Jeff King, Shawn O. Pearce, govindsalinas,
	gitster

Hi,

On Fri, 22 Jun 2007, Andy Parkins wrote:

> What if two files with different filenames and content converge at some 
> point in history, then diverge again?  If git is tracking renames merely 
> by content and picks the wrong one, then the history of fileA suddenly 
> becomes the history of fileB.

This is becoming highly ethereal. Like "I could imagine that some day in 
future, some person could devise a device, that might allow you to do 
something that I can not explain, because I have not even thought of it".

IOW show me a reasonable example, and we'll talk business.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-22 15:28           ` Johannes Schindelin
@ 2007-06-22 17:51             ` Aidan Van Dyk
  0 siblings, 0 replies; 44+ messages in thread
From: Aidan Van Dyk @ 2007-06-22 17:51 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Andy Parkins, git, Linus Torvalds, Jeff King, Shawn O. Pearce,
	govindsalinas, gitster

[-- Attachment #1: Type: text/plain, Size: 1984 bytes --]

* Johannes Schindelin <Johannes.Schindelin@gmx.de> [070622 13:34]:
> Hi,
> 
> On Fri, 22 Jun 2007, Andy Parkins wrote:
> 
> > What if two files with different filenames and content converge at some 
> > point in history, then diverge again?  If git is tracking renames merely 
> > by content and picks the wrong one, then the history of fileA suddenly 
> > becomes the history of fileB.
> 
> This is becoming highly ethereal. Like "I could imagine that some day in 
> future, some person could devise a device, that might allow you to do 
> something that I can not explain, because I have not even thought of it".
> 
> IOW show me a reasonable example, and we'll talk business.

The one time the "content-only" rename tracking bit me was the
after a merge, resulting in conflicts that were un-nessesary:

-*-*-*-*-A-B-C-D
	  \
	   *-E-*

At A, there were 2 files:
	dir1/foo
	dir2/foo
They were template files that happened to be the same in 2 themes.

In E, "foo" was renamed to "foo-bar" in all the template directories.
Git detected this not as 2 renames, but as:
	dir1/foo-bar renamed from dir1/foo
	dir2/foo-bar copied from dir1/foo
	dir2/foo deleted

Meanwhile, work was happening in B, C, and D, changing foo in both
templates identically.

When the branch with E was merged back into ABCD, there was a merge
conflict with dir2/foo being deleted in one branch, and editit in the
other.

In this case, the simple "basename" comparison wouldn't have even been
enough.  

But the merge was easy enough (because no edits were made in the E
branch to those files, just the renames) that I could resolve it easily.

I don't know if preventing this easy-to-fix merge conflict is worth the
necessary "likeness of names" necessary to avoid it...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
       [not found]                 ` <86abusi1fw.fsf@lola.quinscape.zz>
@ 2007-06-23  1:31                   ` Johannes Schindelin
  2007-06-23 10:18                     ` 100% René Scharfe
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23  1:31 UTC (permalink / raw)
  To: David Kastrup; +Cc: git

Hi,

On Fri, 22 Jun 2007, David Kastrup wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > On Fri, 22 Jun 2007, David Kastrup wrote:
> >
> >> As a note aside: would it be possible to always round downwards when 
> >> computing similarities or converting between them?
> >
> > I'd rather not. This would be counterintuitive. People expect rounded 
> > values.
> 
> Which people?

Me, for one. Thank you very much.

> The people I know will expect "100% identical" or even "100.0% 
> identical" to mean identical, period.  They will be quite surprised to 
> hear that "99.95%" is supposed to be included.

Granted, 100.0% means as close as you can get to "completely" with 4 
digits. But if you have an integer, you better use the complete range, 
rather than arbitrarily make one number more important than others.

For if you see an integer, you usually assume a rounded value. If you 
don't, you're hopeless.

> Also, for any kind of decision made upon percentages, it is much more 
> relevant to be able to draw a line at 50% rather than at 49.5%.

I do see too many people in my day job who take the numbers they see for 
absolute truths, so I cannot take that statement seriously, sorry.

> Could you name a _single_ use case where rounding down could cause an 
> actual problem or even inconvenience for people?

Could you name a _single_ use case where it does not?

I mean, honestly, really. Really, really, really. A number is only a weak 
_indicator_, and an integer even more so, for what is _really_ going on.

> >> I very much would like to see the 100% figure reserved for identity.  
> >> This is particularly relevant when interpreting the output of 
> >> git-diff --name-status with regard to R100, C100 and similar flags.
> >
> > You should never depend on the output of --name-status if you're 
> > interested in identifying identical files, but on the object names.
> 
> Which is rather inconvenient.

Frankly, I am getting bored.

This argument crops up ever so often. "If you did that, _I_ could be more 
lazy, and the _hell_ with other people who expect otherwise!".

No, really.

> I _know_ that one can't rely on the output of --name-status right now.

And I _know_ that you can't rely on integer numbers. Or _any_ number which 
is not _completely_ precise.

Really, I am getting bored with this discussion.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] diffcore-rename: favour identical basenames
  2007-06-21 11:52   ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
  2007-06-21 13:19     ` Jeff King
@ 2007-06-23  5:44     ` Junio C Hamano
  1 sibling, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2007-06-23  5:44 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Linus Torvalds, Shawn O. Pearce, git, govindsalinas, gitster

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> When there are several candidates for a rename source, and one of them
> has an identical basename to the rename target, take that one.
>
> Noticed by Govind Salinas, posted by Shawn O. Pearce, partial patch
> by Linus Torvalds.
>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>
> 	On Wed, 20 Jun 2007, Linus Torvalds wrote:
>
> 	> I think we should just consider the basename as an "added 
> 	> similarity  bonus".

Thanks, I obviously agree with both of you.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23  1:31                   ` 100% Johannes Schindelin
@ 2007-06-23 10:18                     ` René Scharfe
  2007-06-23 10:56                       ` 100% Johannes Schindelin
  0 siblings, 1 reply; 44+ messages in thread
From: René Scharfe @ 2007-06-23 10:18 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: David Kastrup, git

Johannes Schindelin schrieb:
> On Fri, 22 Jun 2007, David Kastrup wrote:
>> The people I know will expect "100% identical" or even "100.0% 
>> identical" to mean identical, period.  They will be quite surprised to 
>> hear that "99.95%" is supposed to be included.
> 
> Granted, 100.0% means as close as you can get to "completely" with 4 
> digits. But if you have an integer, you better use the complete range, 
> rather than arbitrarily make one number more important than others.
> 
> For if you see an integer, you usually assume a rounded value. If you 
> don't, you're hopeless.

Why hopeless?  It's a useful convention to define "100%" as "complete
(not rounded)".  See it this way: 50% of the time, a given percent value
will be shown as one point less than it's "true" value, but you gain the
ability to indicate full completeness.  And that's an interesting piece
of information.  The price is small given that the needed accuracy is
more in the range of 10 percent points (I assume).

It's more a question of how to make sure everybody knows what the
numbers mean -- but that's why we have a directory named
"Documentation". :-D  And even a person that hasn't read the docs is
unlikely to really get harmed by inexact percentages, right?

René

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 10:18                     ` 100% René Scharfe
@ 2007-06-23 10:56                       ` Johannes Schindelin
  2007-06-23 11:41                         ` 100% René Scharfe
  2007-06-23 19:33                         ` 100% Junio C Hamano
  0 siblings, 2 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 10:56 UTC (permalink / raw)
  To: René Scharfe; +Cc: David Kastrup, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 975 bytes --]

Hi,

On Sat, 23 Jun 2007, René Scharfe wrote:

> Johannes Schindelin schrieb:
> > On Fri, 22 Jun 2007, David Kastrup wrote:
> >> The people I know will expect "100% identical" or even "100.0% 
> >> identical" to mean identical, period.  They will be quite surprised to 
> >> hear that "99.95%" is supposed to be included.
> > 
> > Granted, 100.0% means as close as you can get to "completely" with 4 
> > digits. But if you have an integer, you better use the complete range, 
> > rather than arbitrarily make one number more important than others.
> > 
> > For if you see an integer, you usually assume a rounded value. If you 
> > don't, you're hopeless.
> 
> Why hopeless?  It's a useful convention to define "100%" as "complete
> (not rounded)".

By the same reasoning, you could say "never round down to 0%, because I 
want to know when there is no similarity".

You cannot be exact when you have to cut off fractions, so why try for 
_exactly_ one number?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 10:56                       ` 100% Johannes Schindelin
@ 2007-06-23 11:41                         ` René Scharfe
  2007-06-23 12:00                           ` 100% Johannes Schindelin
  2007-06-23 19:33                         ` 100% Junio C Hamano
  1 sibling, 1 reply; 44+ messages in thread
From: René Scharfe @ 2007-06-23 11:41 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: David Kastrup, git

Johannes Schindelin schrieb:
> Hi,
> 
> On Sat, 23 Jun 2007, René Scharfe wrote:
> 
>> Johannes Schindelin schrieb:
>>> On Fri, 22 Jun 2007, David Kastrup wrote:
>>>> The people I know will expect "100% identical" or even "100.0% 
>>>> identical" to mean identical, period.  They will be quite surprised to 
>>>> hear that "99.95%" is supposed to be included.
>>> Granted, 100.0% means as close as you can get to "completely" with 4 
>>> digits. But if you have an integer, you better use the complete range, 
>>> rather than arbitrarily make one number more important than others.
>>>
>>> For if you see an integer, you usually assume a rounded value. If you 
>>> don't, you're hopeless.
>> Why hopeless?  It's a useful convention to define "100%" as "complete
>> (not rounded)".
> 
> By the same reasoning, you could say "never round down to 0%, because I 
> want to know when there is no similarity".
> 
> You cannot be exact when you have to cut off fractions, so why try for 
> _exactly_ one number?

Because completeness is special.  If just one bit was available, I'd use
it to indicate equality.  That's what the authors of cmp(1) did, too. :)

And 0% is not special, at least not in a useful way that I can think of.
  I.e. there is no practical difference between "no two lines match" and
"one percent of the lines match".  If you're really interested in
similarities with an index below 10% then you'd better work with
absolute numbers instead of rounded percentages.

If someone came around with an interest in those cases with exactly 0%
similarity, then we might need to decide between rounding up or down.
But even in that hypothetical situation I think "equality" is still more
interesting a data point than "really everything differs".

René

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 11:41                         ` 100% René Scharfe
@ 2007-06-23 12:00                           ` Johannes Schindelin
  2007-06-23 12:11                             ` 100% René Scharfe
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 12:00 UTC (permalink / raw)
  To: René Scharfe; +Cc: David Kastrup, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 464 bytes --]

Hi,

On Sat, 23 Jun 2007, René Scharfe wrote:

> Johannes Schindelin schrieb:
>
> > By the same reasoning, you could say "never round down to 0%, because 
> > I want to know when there is no similarity".
> > 
> > You cannot be exact when you have to cut off fractions, so why try for 
> > _exactly_ one number?
> 
> Because completeness is special.

I am not convinced. My vote is still for the _common_ practice of just 
rounding. IOW keep it as is.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 12:00                           ` 100% Johannes Schindelin
@ 2007-06-23 12:11                             ` René Scharfe
  2007-06-23 12:21                               ` 100% Johannes Schindelin
  0 siblings, 1 reply; 44+ messages in thread
From: René Scharfe @ 2007-06-23 12:11 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: David Kastrup, git

Johannes Schindelin schrieb:
> Hi,
> 
> On Sat, 23 Jun 2007, René Scharfe wrote:
> 
>> Johannes Schindelin schrieb:
>>
>>> By the same reasoning, you could say "never round down to 0%, because 
>>> I want to know when there is no similarity".
>>>
>>> You cannot be exact when you have to cut off fractions, so why try for 
>>> _exactly_ one number?
>> Because completeness is special.
> 
> I am not convinced. My vote is still for the _common_ practice of just 
> rounding. IOW keep it as is.

As I already hinted at, the common result of comparing two files, as
done by e.g. cmp(1), is one bit that indicates equality.  This
information is lost when using up/down rounding, but it is retained when
rounding down.  It's _not_ common to be unable to determine equality
from the result of a file compare.

René

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 12:11                             ` 100% René Scharfe
@ 2007-06-23 12:21                               ` Johannes Schindelin
  2007-06-24 22:23                                 ` 100% René Scharfe
  0 siblings, 1 reply; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 12:21 UTC (permalink / raw)
  To: René Scharfe; +Cc: David Kastrup, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 922 bytes --]

Hi,

On Sat, 23 Jun 2007, René Scharfe wrote:

> As I already hinted at, the common result of comparing two files, as 
> done by e.g. cmp(1), is one bit that indicates equality.  This 
> information is lost when using up/down rounding, but it is retained when 
> rounding down.  It's _not_ common to be unable to determine equality 
> from the result of a file compare.

And as _I_ already hinted, this does not matter. The whole purpose to have 
a number here instead of a bit is to have a larger range. In practice, I 
bet that the 100% are really uninteresting. At least here, they are.

For example, if you move a Java class from one package into another, you 
have to change the package name in the file. Guess what, I am perfectly 
okay if the rename detector says "100% similarity" here. Because if it is 
closer to 100% than to 99%, dammit, I want to see 100%, not 99%.

Nuff said about this subject.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 10:56                       ` 100% Johannes Schindelin
  2007-06-23 11:41                         ` 100% René Scharfe
@ 2007-06-23 19:33                         ` Junio C Hamano
  2007-06-23 20:41                           ` 100% Johannes Schindelin
  1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2007-06-23 19:33 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: René Scharfe, David Kastrup, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> By the same reasoning, you could say "never round down to 0%, because I 
> want to know when there is no similarity".
>
> You cannot be exact when you have to cut off fractions, so why try for 
> _exactly_ one number?

R0 or C0 would not happen in real life, so 0% is a moot issue.

However, wasn't that you who did follow that "certain numbers
are special" logic in diffstat?

You advocated "diff --stat" should draw at least one +/- for a
patch that adds/removes lines.  And I (and others) agreed
because zero is special in the context of that application.

I think reserving R100 to mean "identical byte sequences" has
value, when people look at --name-status output, in the context
of "similarity index".

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 19:33                         ` 100% Junio C Hamano
@ 2007-06-23 20:41                           ` Johannes Schindelin
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Schindelin @ 2007-06-23 20:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: René Scharfe, David Kastrup, git

Hi,

On Sat, 23 Jun 2007, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
> 
> > By the same reasoning, you could say "never round down to 0%, because I 
> > want to know when there is no similarity".
> >
> > You cannot be exact when you have to cut off fractions, so why try for 
> > _exactly_ one number?
> 
> R0 or C0 would not happen in real life, so 0% is a moot issue.

It is, but not when you look at the formula.

> However, wasn't that you who did follow that "certain numbers
> are special" logic in diffstat?
> 
> You advocated "diff --stat" should draw at least one +/- for a
> patch that adds/removes lines.  And I (and others) agreed
> because zero is special in the context of that application.

Actually, it was not me, but I implemented the version that we have now. 
I was reasonably scared that a non-linear diffstat would end up in git, 
therefore I wrote a linear one.

The important thing to not here is that the diffstat as-is makes _better_ 
use of the limited scale that is available.

And as you pointed out, the low end of the scale is not really 
interesting. The interesting parts are those around 100%. By rounding down 
you make less use of the available scale.

> I think reserving R100 to mean "identical byte sequences" has value, 
> when people look at --name-status output, in the context of "similarity 
> index".

Ah, whatever. You do what you want.

Yes, this interpretation has value. No, it is not the only one that has 
value. I am much more used to rounding, since at the end of the day it 
makes better use of the scale, it is commonly used, and therefore _I_ 
expect it (and no, I will not read the documentation when I expect to know 
what it means).

But hey, I don't care any more. AFAIAC you can change it from rounding to 
rounding down, and next year to rounding-up. We could even have an 
algorithm which rounds down only in odd years, and I still would not care.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: 100%
  2007-06-23 12:21                               ` 100% Johannes Schindelin
@ 2007-06-24 22:23                                 ` René Scharfe
  0 siblings, 0 replies; 44+ messages in thread
From: René Scharfe @ 2007-06-24 22:23 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: David Kastrup, git

Johannes Schindelin schrieb:
> Hi,
> 
> On Sat, 23 Jun 2007, René Scharfe wrote:
> 
>> As I already hinted at, the common result of comparing two files, as 
>> done by e.g. cmp(1), is one bit that indicates equality.  This 
>> information is lost when using up/down rounding, but it is retained when 
>> rounding down.  It's _not_ common to be unable to determine equality 
>> from the result of a file compare.
> 
> And as _I_ already hinted, this does not matter. The whole purpose to have 
> a number here instead of a bit is to have a larger range. In practice, I 
> bet that the 100% are really uninteresting. At least here, they are.

You would lose your bet since both David and me expressed interest in
that pure 100% thing.

Rounding down instead of up/down doesn't affect the size of neither the
input nor the output range.  It affects the boundary of the input range,
 (-0.499 .. 100.499 versus 0.000 .. 100.999), but I can't find a problem
with that.

> For example, if you move a Java class from one package into another, you 
> have to change the package name in the file. Guess what, I am perfectly 
> okay if the rename detector says "100% similarity" here. Because if it is 
> closer to 100% than to 99%, dammit, I want to see 100%, not 99%.

That uses a side effect of rounding and won't work for small files.  And
of course (if the file is large enough) there could be other changes
"hidden" in a similarity index value of 100% that was rounded up.

> Nuff said about this subject.

Yes, let's advance this topic to the coding stage.

René

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2007-06-24 22:23 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-21  3:06 Basename matching during rename/copy detection Shawn O. Pearce
2007-06-21  3:13 ` Junio C Hamano
2007-06-21  8:00   ` Andy Parkins
2007-06-21  8:07     ` Junio C Hamano
2007-06-21  9:50       ` Andy Parkins
2007-06-21 11:52         ` Johannes Schindelin
2007-06-21 12:44           ` Andy Parkins
2007-06-21 12:53             ` Matthieu Moy
2007-06-21 13:10               ` Jeff King
2007-06-21 13:18               ` Johannes Schindelin
2007-06-21 13:25                 ` Matthieu Moy
2007-06-21 13:52                   ` Johannes Schindelin
2007-06-21 15:37                     ` Steven Grimm
2007-06-21 15:53                       ` Johannes Schindelin
2007-06-21 16:57                         ` Steven Grimm
2007-06-21 13:22             ` Johannes Schindelin
2007-06-21  3:42 ` Linus Torvalds
2007-06-21 11:52   ` [PATCH] diffcore-rename: favour identical basenames Johannes Schindelin
2007-06-21 13:19     ` Jeff King
2007-06-21 14:03       ` Johannes Schindelin
2007-06-21 16:20       ` Linus Torvalds
2007-06-21 17:52         ` Junio C Hamano
2007-06-21 18:24           ` Linus Torvalds
2007-06-22 15:19         ` Andy Parkins
2007-06-22 15:28           ` Johannes Schindelin
2007-06-22 17:51             ` Aidan Van Dyk
2007-06-22  1:14       ` Johannes Schindelin
2007-06-22  5:41         ` Jeff King
2007-06-22 10:22           ` Johannes Schindelin
2007-06-22  7:17         ` Johannes Sixt
2007-06-22 10:39           ` Johannes Schindelin
2007-06-22 10:52             ` 100% (was: [PATCH] diffcore-rename: favour identical basenames) David Kastrup
2007-06-22 12:49               ` Johannes Schindelin
     [not found]                 ` <86abusi1fw.fsf@lola.quinscape.zz>
2007-06-23  1:31                   ` 100% Johannes Schindelin
2007-06-23 10:18                     ` 100% René Scharfe
2007-06-23 10:56                       ` 100% Johannes Schindelin
2007-06-23 11:41                         ` 100% René Scharfe
2007-06-23 12:00                           ` 100% Johannes Schindelin
2007-06-23 12:11                             ` 100% René Scharfe
2007-06-23 12:21                               ` 100% Johannes Schindelin
2007-06-24 22:23                                 ` 100% René Scharfe
2007-06-23 19:33                         ` 100% Junio C Hamano
2007-06-23 20:41                           ` 100% Johannes Schindelin
2007-06-23  5:44     ` [PATCH] diffcore-rename: favour identical basenames Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).