git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] treat any file with NUL as binary
@ 2008-01-15 14:28 Dmitry Potapov
  2008-01-15 21:03 ` Steffen Prohaska
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Dmitry Potapov @ 2008-01-15 14:28 UTC (permalink / raw)
  To: Junio C Hamano, git; +Cc: Steffen Prohaska, Dmitry Potapov

There are two heuristics in Git to detect whether a file is binary
or text. One in xdiff-interface.c relied on existing NUL byte at
the beginning. However, convert.c used a different heuristic, which
relied that the number of non-printable symbols is less than 1%.

Due to difference in approaches whether a file is binary or not,
it was possible that a file that diff treats as binary will not be
treated as text by CRLF conversation. This is very confusing for
a user who seeing that 'git diff' shows file as binary expects it
to be added as binary.

This patch makes is_binary to consider any file that contains at
least one NUL character as binary.
---

Junio,

I believe that the current behavior where 'git diff' shows me a file
as binary and then adds it as text with crlf conversation is a bug.

Though, it is not very likely to happen, it still possible cases where
a binary file contains large amount of text. For instance, a tar file
of text files can be such a file. Probably, word processor that store
text in binary format may also generate a file with more 99% printable
characters. So, such files will be considered as text by current convert
heuristic. Still such files are considered by diff due present of a NUL
character. This is very confusing for a user to see 'git diff' saying
that a file is binary and then having it converted as text. Because I
don't think that any real text file (especially one that requires CRLF
conversation) may contain NUL character, I believe this change should
improve binary heuristic and avoid user confusion.

So, please, consider it for inclusion as a bug fix.

 convert.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/convert.c b/convert.c
index 5adef4f..a51da1f 100644
--- a/convert.c
+++ b/convert.c
@@ -17,8 +17,8 @@
 #define CRLF_INPUT	2
 
 struct text_stat {
-	/* CR, LF and CRLF counts */
-	unsigned cr, lf, crlf;
+	/* NUL, CR, LF and CRLF counts */
+	unsigned nul, cr, lf, crlf;
 
 	/* These are just approximations! */
 	unsigned printable, nonprintable;
@@ -51,6 +51,9 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 			case '\b': case '\t': case '\033': case '\014':
 				stats->printable++;
 				break;
+			case 0:
+				stats->nul++;
+				/* fall through */
 			default:
 				stats->nonprintable++;
 			}
@@ -66,6 +69,8 @@ static void gather_stats(const char *buf, unsigned long size, struct text_stat *
 static int is_binary(unsigned long size, struct text_stat *stats)
 {
 
+	if (stats->nul)
+		return 1;
 	if ((stats->printable >> 7) < stats->nonprintable)
 		return 1;
 	/*
-- 
1.5.4.rc3.3.g0321b

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-01-16  1:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-15 14:28 [PATCH] treat any file with NUL as binary Dmitry Potapov
2008-01-15 21:03 ` Steffen Prohaska
2008-01-15 23:11 ` Junio C Hamano
2008-01-16  1:13   ` Dmitry Potapov
2008-01-16  1:16     ` Junio C Hamano
2008-01-16  1:21 ` Junio C Hamano
2008-01-16  1:59   ` Dmitry Potapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).