git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] treat any file with NUL as binary
@ 2008-01-15 14:28 Dmitry Potapov
  2008-01-15 21:03 ` Steffen Prohaska
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Dmitry Potapov @ 2008-01-15 14:28 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Steffen Prohaska, Dmitry Potapov

There are two heuristics in Git to detect whether a file is binary
or text. One in xdiff-interface.c relied on existing NUL byte at
the beginning. However, convert.c used a different heuristic, which
relied that the number of non-printable symbols is less than 1%.

Due to difference in approaches whether a file is binary or not,
it was possible that a file that diff treats as binary will not be
treated as text by CRLF conversation. This is very confusing for
a user who seeing that 'git diff' shows file as binary expects it
to be added as binary.

This patch makes is_binary to consider any file that contains at
least one NUL character as binary.
---

Junio,

I believe that the current behavior where 'git diff' shows me a file
as binary and then adds it as text with crlf conversation is a bug.

Though, it is not very likely to happen, it still possible cases where
a binary file contains large amount of text. For instance, a tar file
of text files can be such a file. Probably, word processor that store
text in binary format may also generate a file with more 99% printable
characters. So, such files will be considered as text by current convert
heuristic. Still such files are considered by diff due present of a NUL
character. This is very confusing for a user to see 'git diff' saying
that a file is binary and then having it converted as text. Because I
don't think that any real text file (especially one that requires CRLF
conversation) may contain NUL character, I believe this change should
improve binary heuristic and avoid user confusion.

So, please, consider it for inclusion as a bug fix.

 convert.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 5adef4f..a51da1f 100644
--- a/convert.c
+++ b/convert.c
@@ -17,8 +17,8 @@
 #define CRLF_INPUT	2
 
 struct text_stat {
-	/* CR, LF and CRLF counts */
-	unsigned cr, lf, crlf;
+	/* NUL, CR, LF and CRLF counts */
+	unsigned nul, cr, lf, crlf;
 
 	/* These are just approximations! */
 	unsigned printable, nonprintable;
@@ -51,6 +51,9 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			case '\b': case '\t': case '\033': case '\014':
 				stats->printable++;
 				break;
+			case 0:
+				stats->nul++;
+				/* fall through */
 			default:
 				stats->nonprintable++;
 			}
@@ -66,6 +69,8 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
+	if (stats->nul)
+		return 1;
 	if ((stats->printable >> 7) < stats->nonprintable)
 		return 1;
 	/*
-- 
1.5.4.rc3.3.g0321b

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] treat any file with NUL as binary
  2008-01-15 14:28 [PATCH] treat any file with NUL as binary Dmitry Potapov
@ 2008-01-15 21:03 ` Steffen Prohaska
  2008-01-15 23:11 ` Junio C Hamano
  2008-01-16  1:21 ` Junio C Hamano
  2 siblings, 0 replies; 7+ messages in thread
From: Steffen Prohaska @ 2008-01-15 21:03 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: Junio C Hamano, git


On Jan 15, 2008, at 3:28 PM, Dmitry Potapov wrote:

> There are two heuristics in Git to detect whether a file is binary
> or text. One in xdiff-interface.c relied on existing NUL byte at
> the beginning. However, convert.c used a different heuristic, which
> relied that the number of non-printable symbols is less than 1%.
>
> Due to difference in approaches whether a file is binary or not,
> it was possible that a file that diff treats as binary will not be
> treated as text by CRLF conversation. This is very confusing for
> a user who seeing that 'git diff' shows file as binary expects it
> to be added as binary.
>
> This patch makes is_binary to consider any file that contains at
> least one NUL character as binary.

Shouldn't the commit message explicitly mention that the solution
is to make the check in convert.c stricter than the check in
xdiff-interface.c?  I think a comment in xdiff-interface.c
would also be a good thing to remember future developers about
this.


> ---
>
> Junio,
>
> I believe that the current behavior where 'git diff' shows me a file
> as binary and then adds it as text with crlf conversation is a bug.
>
> Though, it is not very likely to happen, it still possible cases where
> a binary file contains large amount of text. For instance, a tar file
> of text files can be such a file. Probably, word processor that store
> text in binary format may also generate a file with more 99% printable
> characters. So, such files will be considered as text by current  
> convert
> heuristic. Still such files are considered by diff due present of a  
> NUL
> character. This is very confusing for a user to see 'git diff' saying
> that a file is binary and then having it converted as text. Because I
> don't think that any real text file (especially one that requires CRLF
> conversation) may contain NUL character, I believe this change should
> improve binary heuristic and avoid user confusion.

I think this is a good idea.  When reading the code for the first
time it took me some time to accept that we really have different
ways for detecting a binary file and to understand how these
detections are related.

I also agree with Dimitry that convert.c should be stricter than
xdiff-interface.c, because everything else could be confusing. ...


> So, please, consider it for inclusion as a bug fix.

... H\x05ence, this should be considered a bug fix.

The patch looks good to me.

	Steffen

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] treat any file with NUL as binary
  2008-01-15 14:28 [PATCH] treat any file with NUL as binary Dmitry Potapov
  2008-01-15 21:03 ` Steffen Prohaska
@ 2008-01-15 23:11 ` Junio C Hamano
  2008-01-16  1:13   ` Dmitry Potapov
  2008-01-16  1:21 ` Junio C Hamano
  2 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2008-01-15 23:11 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git, Steffen Prohaska

Dmitry Potapov <dpotapov@gmail.com> writes:

> So, please, consider it for inclusion as a bug fix.

Somebody has to go back to the "git log" output and the list
archive to see if you two did not forget other ramifications,
because I vaguely recall this 1% thing was done for a reason and
Linus had a very good argument (at least back then the argument
sounded very good to me) supporting the deliberate difference
between the two "binary" heuristics.

If I did not have that vague recollection and all I had to judge
the proposed change were the issue in this thread, I'd probably
agree this is a good change, though.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] treat any file with NUL as binary
  2008-01-15 23:11 ` Junio C Hamano
@ 2008-01-16  1:13   ` Dmitry Potapov
  2008-01-16  1:16     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Potapov @ 2008-01-16  1:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Steffen Prohaska

On Tue, Jan 15, 2008 at 03:11:07PM -0800, Junio C Hamano wrote:
> Dmitry Potapov <dpotapov@gmail.com> writes:
> 
> > So, please, consider it for inclusion as a bug fix.
> 
> Somebody has to go back to the "git log" output and the list
> archive to see if you two did not forget other ramifications,
> because I vaguely recall this 1% thing was done for a reason and
> Linus had a very good argument (at least back then the argument
> sounded very good to me) supporting the deliberate difference
> between the two "binary" heuristics.

First of all, my patch does not make them being the same, it just
makes one being stricter than the other, and I explained why it
is the tight thing to do.

Second, it is difficult for me to find to what particular words
of Linus *you* refer to. However, if it is something like this
post:

http://article.gmane.org/gmane.comp.version-control.git/39618

Then it seems to me, Linus sounded more in favor of that change than
against it. His main argument was against 'diff' heuristic, which he
felt was not strict enough for CRLF translation: "It's *much* better
to miss some CRLF translation than to do too much of it."


Dmitry

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] treat any file with NUL as binary
  2008-01-16  1:13   ` Dmitry Potapov
@ 2008-01-16  1:16     ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2008-01-16  1:16 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git, Steffen Prohaska

Dmitry Potapov <dpotapov@gmail.com> writes:

> Then it seems to me, Linus sounded more in favor of that change than
> against it. His main argument was against 'diff' heuristic, which he
> felt was not strict enough for CRLF translation: "It's *much* better
> to miss some CRLF translation than to do too much of it."

Yeah, you found the right one, and I agreed with the argument
back then, and I agree with it now.  I think NUL heuristics is a
good one and obviously it makes it tighter than what diff uses,
which should be what we are aiming for in this thread.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] treat any file with NUL as binary
  2008-01-15 14:28 [PATCH] treat any file with NUL as binary Dmitry Potapov
  2008-01-15 21:03 ` Steffen Prohaska
  2008-01-15 23:11 ` Junio C Hamano
@ 2008-01-16  1:21 ` Junio C Hamano
  2008-01-16  1:59   ` Dmitry Potapov
  2 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2008-01-16  1:21 UTC (permalink / raw)
  To: Dmitry Potapov; +Cc: git, Steffen Prohaska

Dmitry Potapov <dpotapov@gmail.com> writes:

> There are two heuristics in Git to detect whether a file is binary
> or text. One in xdiff-interface.c relied on existing NUL byte at

"relies on" (not past tense); we may want to say that it is
stolen from GNU diff to be compatible.

> the beginning. However, convert.c used a different heuristic, which
> relied that the number of non-printable symbols is less than 1%.
>
> Due to difference in approaches whether a file is binary or not,
> it was possible that a file that diff treats as binary will not be
> treated as text by CRLF conversation. This is very confusing for

"conversion".

> a user who seeing that 'git diff' shows file as binary expects it

"sees".

> to be added as binary.
>
> This patch makes is_binary to consider any file that contains at
> least one NUL character as binary.
> ---
>
> So, please, consider it for inclusion as a bug fix.

Please typofix and apply "s/.$/, to ensure that the heuristics
used for CRLF conversion is tighter than what is used by diff./"
or something like that at the end.

Also please add sign-off.  The patch looks correct.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] treat any file with NUL as binary
  2008-01-16  1:21 ` Junio C Hamano
@ 2008-01-16  1:59   ` Dmitry Potapov
  0 siblings, 0 replies; 7+ messages in thread
From: Dmitry Potapov @ 2008-01-16  1:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Steffen Prohaska, Dmitry Potapov

There are two heuristics in Git to detect whether a file is binary
or text. One in xdiff-interface.c (which is taken from GNU diff)
relies on existence of the NUL byte at the beginning. However,
convert.c used a different heuristic, which relied on the percent
of non-printable symbols (less than 1% for text files).

Due to differences in detection whether a file is binary or not,
it was possible that a file that diff treats as binary could be
treated as text by CRLF conversion. This is very confusing for a
user who sees that 'git diff' shows the file as binary expects it
to be added as binary.

This patch makes is_binary to consider any file that contains at
least one NUL character as binary, to ensure that the heuristics
used for CRLF conversion is tighter than what is used by diff.

Signed-off-by: Dmitry Potapov <dpotapov@gmail.com>
---
 convert.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 5adef4f..a51da1f 100644
--- a/convert.c
+++ b/convert.c
@@ -17,8 +17,8 @@
 #define CRLF_INPUT	2
 
 struct text_stat {
-	/* CR, LF and CRLF counts */
-	unsigned cr, lf, crlf;
+	/* NUL, CR, LF and CRLF counts */
+	unsigned nul, cr, lf, crlf;
 
 	/* These are just approximations! */
 	unsigned printable, nonprintable;
@@ -51,6 +51,9 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			case '\b': case '\t': case '\033': case '\014':
 				stats->printable++;
 				break;
+			case 0:
+				stats->nul++;
+				/* fall through */
 			default:
 				stats->nonprintable++;
 			}
@@ -66,6 +69,8 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
+	if (stats->nul)
+		return 1;
 	if ((stats->printable >> 7) < stats->nonprintable)
 		return 1;
 	/*
-- 
1.5.3.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-01-16  1:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-15 14:28 [PATCH] treat any file with NUL as binary Dmitry Potapov
2008-01-15 21:03 ` Steffen Prohaska
2008-01-15 23:11 ` Junio C Hamano
2008-01-16  1:13   ` Dmitry Potapov
2008-01-16  1:16     ` Junio C Hamano
2008-01-16  1:21 ` Junio C Hamano
2008-01-16  1:59   ` Dmitry Potapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).