[PATCH] Ignore end-of-line style when computing similarity score for rename detection

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Ignore end-of-line style when computing similarity score for rename detection
@ 2007-06-28  2:39 Steven Grimm
  2007-06-28  2:46 ` Steven Grimm
  2007-06-28  4:29 ` Junio C Hamano
  0 siblings, 2 replies; 11+ messages in thread
From: Steven Grimm @ 2007-06-28  2:39 UTC (permalink / raw)
  To: 'git'

Signed-off-by: Steven Grimm <koreth@midwinter.com>
---
A couple of source files got checked into my code base with DOS-style 
end-of-line characters. I converted them to UNIX-style (the convention 
for this project) in my branch. Then later, I renamed a couple of them.

Meanwhile, back in the original branch, someone else fixed a bug in one 
of the files and checked it in, still with DOS-style line endings.

When I merged that change into my branch, git didn't detect the rename 
because the fact that every line has a change (the end-of-line 
character) dropped the similarity score way too low.

This patch teaches git to ignore end-of-line style when looking for 
potential rename candidates. A separate question, which I expect may be 
more controversial, is what to do with conflict markers; with this 
patch, the entire file is still marked as in conflict if the end-of-line 
style changes (but it's still an improvement in that we at least detect 
the rename now.)


 diffcore-delta.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index 7338a40..10bbf95 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -143,9 +143,12 @@ static struct spanhash_top *hash_chars(unsigned 
char *buf, unsigned int sz)
                unsigned int c = *buf++;
                unsigned int old_1 = accum1;
                sz--;
-               accum1 = (accum1 << 7) ^ (accum2 >> 25);
-               accum2 = (accum2 << 7) ^ (old_1 >> 25);
-               accum1 += c;
+               /* Ignore \r\n vs. \n when computing similarity. */
+               if (c != '\r') {
+                       accum1 = (accum1 << 7) ^ (accum2 >> 25);
+                       accum2 = (accum2 << 7) ^ (old_1 >> 25);
+                       accum1 += c;
+               }
                if (++n < 64 && c != '\n')
                        continue;
                hashval = (accum1 + accum2 * 0x61) % HASHBASE;
-- 
1.5.2.2.571.ge134

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28  2:39 [PATCH] Ignore end-of-line style when computing similarity score for rename detection Steven Grimm
@ 2007-06-28  2:46 ` Steven Grimm
  2007-06-28  7:22   ` Johannes Sixt
  2007-06-28  4:29 ` Junio C Hamano
  1 sibling, 1 reply; 11+ messages in thread
From: Steven Grimm @ 2007-06-28  2:46 UTC (permalink / raw)
  To: git

Signed-off-by: Steven Grimm <koreth@midwinter.com>
---
Okay, let's try this again with an MUA that won't change my tabs to
spaces -- sorry about that.

A couple of source files got checked into my code base with DOS-style
end-of-line characters. I converted them to UNIX-style (the convention
for this project) in my branch. Then later, I renamed a couple of them.

Meanwhile, back in the original branch, someone else fixed a bug in one
of the files and checked it in, still with DOS-style line endings.

When I merged that change into my branch, git didn't detect the rename
because the fact that every line has a change (the end-of-line
character) dropped the similarity score way too low.

This patch teaches git to ignore end-of-line style when looking for
potential rename candidates. A separate question, which I expect may be
more controversial, is what to do with conflict markers; with this
patch, the entire file is still marked as in conflict if the end-of-line
style changes (but it's still an improvement in that we at least detect
the rename now.)

 diffcore-delta.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index 7338a40..10bbf95 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -143,9 +143,12 @@ static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz)
 		unsigned int c = *buf++;
 		unsigned int old_1 = accum1;
 		sz--;
-		accum1 = (accum1 << 7) ^ (accum2 >> 25);
-		accum2 = (accum2 << 7) ^ (old_1 >> 25);
-		accum1 += c;
+		/* Ignore \r\n vs. \n when computing similarity. */
+		if (c != '\r') {
+			accum1 = (accum1 << 7) ^ (accum2 >> 25);
+			accum2 = (accum2 << 7) ^ (old_1 >> 25);
+			accum1 += c;
+		}
 		if (++n < 64 && c != '\n')
 			continue;
 		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
-- 
1.5.2.2.571.ge134

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28  2:39 [PATCH] Ignore end-of-line style when computing similarity score for rename detection Steven Grimm
  2007-06-28  2:46 ` Steven Grimm
@ 2007-06-28  4:29 ` Junio C Hamano
  2007-06-28  6:04   ` Steven Grimm
  1 sibling, 1 reply; 11+ messages in thread
From: Junio C Hamano @ 2007-06-28  4:29 UTC (permalink / raw)
  To: Steven Grimm; +Cc: 'git'

I wonder if we have a path or the attribute cue at this point of
the operation.  Your patch may solve that special case of CRLF
mangled files, but diffcore-delta works both on text and binary
and I am not sure what damange/side-effect you are causing.

And you are not ignoring "\r\n vs \n difference" as the comment
claims, but are discarding '\r' unconditionally.  When we start
doing something like that, I would feel much better if we _know_
we are operating on a text file at least.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28  4:29 ` Junio C Hamano
@ 2007-06-28  6:04   ` Steven Grimm
  2007-06-28  6:18     ` Shawn O. Pearce
  2007-06-28 12:41     ` Johannes Schindelin
  0 siblings, 2 replies; 11+ messages in thread
From: Steven Grimm @ 2007-06-28  6:04 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Signed-off-by: Steven Grimm <koreth@midwinter.com>
---
Junio rightly points out that it would be a mistake to discard \r
characters from binary files when computing similarity scores. So now we
only do it if the file contents test as non-binary.

The file attributes aren't available at this level of the code, but they
could be propagated down from the higher levels if we don't trust
buffer_is_binary() to make an adequately accurate decision.


 diffcore-delta.c |   19 +++++++++++++------
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index 7338a40..52e648f 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "xdiff-interface.h"
 
 /*
  * Idea here is very simple.
@@ -125,7 +126,8 @@ static struct spanhash_top *add_spanhash(struct spanhash_top *top,
 	}
 }
 
-static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz)
+static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz,
+				       int is_binary)
 {
 	int i, n;
 	unsigned int accum1, accum2, hashval;
@@ -143,9 +145,12 @@ static struct spanhash_top *hash_chars(unsigned char *buf, unsigned int sz)
 		unsigned int c = *buf++;
 		unsigned int old_1 = accum1;
 		sz--;
-		accum1 = (accum1 << 7) ^ (accum2 >> 25);
-		accum2 = (accum2 << 7) ^ (old_1 >> 25);
-		accum1 += c;
+		/* Ignore \r\n vs. \n when computing text file similarity. */
+		if (c != '\r' && ! is_binary) {
+			accum1 = (accum1 << 7) ^ (accum2 >> 25);
+			accum2 = (accum2 << 7) ^ (old_1 >> 25);
+			accum1 += c;
+		}
 		if (++n < 64 && c != '\n')
 			continue;
 		hashval = (accum1 + accum2 * 0x61) % HASHBASE;
@@ -172,14 +177,16 @@ int diffcore_count_changes(void *src, unsigned long src_size,
 	if (src_count_p)
 		src_count = *src_count_p;
 	if (!src_count) {
-		src_count = hash_chars(src, src_size);
+		int src_is_binary = buffer_is_binary(src, src_size);
+		src_count = hash_chars(src, src_size, src_is_binary);
 		if (src_count_p)
 			*src_count_p = src_count;
 	}
 	if (dst_count_p)
 		dst_count = *dst_count_p;
 	if (!dst_count) {
-		dst_count = hash_chars(dst, dst_size);
+		int dst_is_binary = buffer_is_binary(dst, dst_size);
+		dst_count = hash_chars(dst, dst_size, dst_is_binary);
 		if (dst_count_p)
 			*dst_count_p = dst_count;
 	}
-- 
1.5.2.2.571.ge134

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28  6:04   ` Steven Grimm
@ 2007-06-28  6:18     ` Shawn O. Pearce
  2007-06-29  6:34       ` Junio C Hamano
  2007-06-28 12:41     ` Johannes Schindelin
  1 sibling, 1 reply; 11+ messages in thread
From: Shawn O. Pearce @ 2007-06-28  6:18 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Junio C Hamano, git

Steven Grimm <koreth@midwinter.com> wrote:
> Junio rightly points out that it would be a mistake to discard \r
> characters from binary files when computing similarity scores. So now we
> only do it if the file contents test as non-binary.
> 
> The file attributes aren't available at this level of the code, but they
> could be propagated down from the higher levels if we don't trust
> buffer_is_binary() to make an adequately accurate decision.

Ick.  If we can get the attributes into diff_filespec this is
pretty easy, as you can do a crlf->lf conversion on both files if
both are considered to be text, but it doesn't look like it would
be very easy to get the attributes into the diff_filespec.

Actually even better if you can also run the in/out filter things.
I'm thinking of say an XML file that has had whitespace formatting
changes, but whose XSD and processors ignore unnecessary whitespace.
Be nice if the rename detection actually was able to canonicalize
both files before detecting the rename, assuming both files had a
canonicalizer input filter defined that does that...

Of course diff.c defines a nice diff_is_binary() at file scope that
does at least a "can we diff this" decision.  Might be good if that
could be reused for the rename detection.

OK, that's far more than I actually know about diffcore.  This is
one for Junio, Linus, you, and those who are less tired than I feel
right now...  ;-)

Personally I'd rather see us doing the right thing (use attributes
and fallback on guessing if no preference is stated either way)
over doing something half-a**ed (only guessing).

-- 
Shawn.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for  rename detection
  2007-06-28  2:46 ` Steven Grimm
@ 2007-06-28  7:22   ` Johannes Sixt
  2007-06-28  8:16     ` Junio C Hamano
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Sixt @ 2007-06-28  7:22 UTC (permalink / raw)
  To: git

Steven Grimm wrote:
> 
> Signed-off-by: Steven Grimm <koreth@midwinter.com>
> ---
> Okay, let's try this again with an MUA that won't change my tabs to
> spaces -- sorry about that.
> 
> A couple of source files got checked into my code base with DOS-style
> end-of-line characters. I converted them to UNIX-style (the convention
> for this project) in my branch. Then later, I renamed a couple of them.
> 
> Meanwhile, back in the original branch, someone else fixed a bug in one
> of the files and checked it in, still with DOS-style line endings.
> 
> When I merged that change into my branch, git didn't detect the rename
> because the fact that every line has a change (the end-of-line
> character) dropped the similarity score way too low.
> 
> This patch teaches git to ignore end-of-line style when looking for
> potential rename candidates. A separate question, which I expect may be
> more controversial, is what to do with conflict markers; with this
> patch, the entire file is still marked as in conflict if the end-of-line
> style changes (but it's still an improvement in that we at least detect
> the rename now.)

I think that nobody would object to have a use-case description like
this in the commit message...

-- Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for  rename detection
  2007-06-28  7:22   ` Johannes Sixt
@ 2007-06-28  8:16     ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2007-06-28  8:16 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git

Johannes Sixt <J.Sixt@eudaptics.com> writes:

> Steven Grimm wrote:
>> 
>> Signed-off-by: Steven Grimm <koreth@midwinter.com>
>> ---
>> Okay, let's try this again with an MUA that won't change my tabs to
>> spaces -- sorry about that.
>> 
>> A couple of source files got checked into my code base with DOS-style
>> end-of-line characters. I converted them to UNIX-style (the convention
>> for this project) in my branch. Then later, I renamed a couple of them.
>> 
>> Meanwhile, back in the original branch, someone else fixed a bug in one
>> of the files and checked it in, still with DOS-style line endings.
>> 
>> When I merged that change into my branch, git didn't detect the rename
>> because the fact that every line has a change (the end-of-line
>> character) dropped the similarity score way too low.
>> 
>> This patch teaches git to ignore end-of-line style when looking for
>> potential rename candidates. A separate question, which I expect may be
>> more controversial, is what to do with conflict markers; with this
>> patch, the entire file is still marked as in conflict if the end-of-line
>> style changes (but it's still an improvement in that we at least detect
>> the rename now.)
>
> I think that nobody would object to have a use-case description like
> this in the commit message...

Oh, yeah.  100% agreed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28  6:04   ` Steven Grimm
  2007-06-28  6:18     ` Shawn O. Pearce
@ 2007-06-28 12:41     ` Johannes Schindelin
  2007-06-28 18:17       ` Steven Grimm
  1 sibling, 1 reply; 11+ messages in thread
From: Johannes Schindelin @ 2007-06-28 12:41 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Junio C Hamano, git

Hi,

On Wed, 27 Jun 2007, Steven Grimm wrote:

> Junio rightly points out that it would be a mistake to discard \r
> characters from binary files when computing similarity scores.

Somehow I think that this should be triggered by "--ignore-space-at-eol", 
_and_ be accompanied by a test case.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28 12:41     ` Johannes Schindelin
@ 2007-06-28 18:17       ` Steven Grimm
  2007-06-29 10:19         ` Johannes Schindelin
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Grimm @ 2007-06-28 18:17 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git

Johannes Schindelin wrote:
> Somehow I think that this should be triggered by "--ignore-space-at-eol", 
> _and_ be accompanied by a test case.
>   

Should --ignore-space-at-eol be an option to git-merge? Merges are where 
this functionality matters; for simple diffs, --ignore-space-at-eol 
actually already covers it. If we allow that option, should we also 
allow other git-diff options like --ignore-all-space and 
--ignore-space-change? What are the semantics of an autoresolved merge 
with those options in effect -- are they only used for rename detection, 
or do we, e.g., not flag conflicts with only whitespace changes? And if 
we don't, which version do we accept automatically?

-Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28  6:18     ` Shawn O. Pearce
@ 2007-06-29  6:34       ` Junio C Hamano
  0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2007-06-29  6:34 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Steven Grimm, git

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Actually even better if you can also run the in/out filter things.
> I'm thinking of say an XML file that has had whitespace formatting
> changes, but whose XSD and processors ignore unnecessary whitespace.
> Be nice if the rename detection actually was able to canonicalize
> both files before detecting the rename, assuming both files had a
> canonicalizer input filter defined that does that...

This certainly is tempting, but I suspect that should be left to
later rounds.  I suspect that it would introduce a concept of
two different kinds of diffs, one to be mechanically processed
(i.e. use in merge with "git-merge-recursive", and apply with
"git-am"), and another to be inspected by humans to understand.
It often may be useful to munge the input for the latter case,
even though the output from comparing munged input files may not
be readily usable for mechanical application.

Which is not a bad thing per-se, but needs to be done carefully.

A replacement set for Steven to comment on follows.

  [PATCH 1/4] diffcore_count_changes: pass diffcore_filespec
  [PATCH 2/4] diffcore_filespec: add is_binary
  [PATCH 3/4] diffcore-delta.c: update the comment on the algorithm.
  [PATCH 4/4] diffcore-delta.c: Ignore CR in CRLF for text files

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] Ignore end-of-line style when computing similarity score for rename detection
  2007-06-28 18:17       ` Steven Grimm
@ 2007-06-29 10:19         ` Johannes Schindelin
  0 siblings, 0 replies; 11+ messages in thread
From: Johannes Schindelin @ 2007-06-29 10:19 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Junio C Hamano, git

Hi,

On Thu, 28 Jun 2007, Steven Grimm wrote:

> Johannes Schindelin wrote:
> > Somehow I think that this should be triggered by "--ignore-space-at-eol",
> > _and_ be accompanied by a test case.
> >   
> 
> Should --ignore-space-at-eol be an option to git-merge? Merges are where 
> this functionality matters; for simple diffs, --ignore-space-at-eol 
> actually already covers it.

Good point. However, I fail to see how the similarity detection should be 
so decoupled from the application. IOW what good is it if two files are 
rated similar if the merge cannot handle the CRLF/LF differences properly?

So two points here: since the merges are what you target, you definitely 
should mention that in the commit message. And you should make sure that 
all this trickery only kicks in when the merge has a chance to succeed.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-06-29 10:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-28  2:39 [PATCH] Ignore end-of-line style when computing similarity score for rename detection Steven Grimm
2007-06-28  2:46 ` Steven Grimm
2007-06-28  7:22   ` Johannes Sixt
2007-06-28  8:16     ` Junio C Hamano
2007-06-28  4:29 ` Junio C Hamano
2007-06-28  6:04   ` Steven Grimm
2007-06-28  6:18     ` Shawn O. Pearce
2007-06-29  6:34       ` Junio C Hamano
2007-06-28 12:41     ` Johannes Schindelin
2007-06-28 18:17       ` Steven Grimm
2007-06-29 10:19         ` Johannes Schindelin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).