[RFC/PATCH] fix "git diff" to create wrong UTF-8 text

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC/PATCH] fix "git diff" to create wrong UTF-8 text
@ 2008-01-01 23:20 Tsugikazu Shibata
  2008-01-02  5:26 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Tsugikazu Shibata @ 2008-01-01 23:20 UTC (permalink / raw)
  To: git; +Cc: tshibata

Hello,

I met a problem in patch text from "git diff" for UTF-8 text.
Patch text following to "@@" sometimes cut the string with max
80bytes. In case of UTF-8 text written in Japanese and English, most
of Japanese character are consist of 3 bytes for a character and also
ASCII character is single byte.
So, cut the string with 80bytes may cause cut off 1 or 2 byte for a
character at the bottom. This will cause the broken code of result of
"git diff".

It seems no problem to read such patch text for the patch command but
the problem is not readable for me. ie. Emacs cannot handle the
encoding for such file and show me octal numbers.

The patch below is my quick and dirty solution (but It works fine !)
I tested this patch with using Linux kernel document
(Documentation/ja_JP/HOWTO)
I believe this should be work for another language using UTF-8 and
solve this issue.

Please note that this is focused only for UTF-8 but we may need to
support another encoding.
So, How can we turn on this UTF-8 processing?
Any suggestions are welcome.

Thanks,

Sigined-off-by: Tsugikazu Shibata <tshibata@ab.jp.nec.com>
---

diff -upr git-1.5.3.7/xdiff/xutils.c git-1.5.3.7-dev/xdiff/xutils.c
--- git-1.5.3.7/xdiff/xutils.c	2007-12-02 06:21:12.000000000 +0900
+++ git-1.5.3.7-dev/xdiff/xutils.c	2007-12-31 01:30:51.000000000 +0900
@@ -332,6 +332,32 @@ long xdl_atol(char const *str, char cons
 }

+/* return utf character size of bytes */
+int utf8charsize(const unsigned char c) {
+	int l;
+	if ( c < 0x7f ) l = 1;
+	else if (( c > 0xc0) && ( c < 0xdf)) l=2;
+	else if (( c > 0xe0) && ( c < 0xef)) l=3;
+	else if (( c > 0xf0) && ( c < 0xf7)) l=4;
+	else if (( c > 0xf8) && ( c < 0xfb)) l=5;
+	else if (( c > 0xfc) && ( c < 0xfd)) l=6;
+	else l=1; /* fale safe */
+	return l;
+}
+
+int utf8width(const char *up, int len) {
+        int cs;
+        int l=len;
+        const char *p = up;
+        while ((l > 0) && (p[0] != '\0')) {
+		cs = utf8charsize(p[0]);
+		if (l >= cs) {
+			l -= cs; p += cs;
+		} else l=0; /* do not split multi byte char. */
+        }
+        return p-up;
+}
+
 int xdl_emit_hunk_hdr(long s1, long c1, long s2, long c2,
 		      const char *func, long funclen, xdemitcb_t *ecb) {
 	int nb = 0;
@@ -368,6 +394,7 @@ int xdl_emit_hunk_hdr(long s1, long c1,
 		buf[nb++] = ' ';
 		if (funclen > sizeof(buf) - nb - 1)
 			funclen = sizeof(buf) - nb - 1;
+		funclen = utf8width(func, funclen);
 		memcpy(buf + nb, func, funclen);
 		nb += funclen;
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC/PATCH] fix "git diff" to create wrong UTF-8 text
  2008-01-01 23:20 [RFC/PATCH] fix "git diff" to create wrong UTF-8 text Tsugikazu Shibata
@ 2008-01-02  5:26 ` Junio C Hamano
  2008-01-02  9:49   ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02  5:26 UTC (permalink / raw)
  To: Tsugikazu Shibata; +Cc: git

Tsugikazu Shibata <tshibata@ab.jp.nec.com> writes:

> I believe this should be work for another language using UTF-8 and
> solve this issue.
> ...
> @@ -368,6 +394,7 @@ int xdl_emit_hunk_hdr(long s1, long c1,
>  		buf[nb++] = ' ';
>  		if (funclen > sizeof(buf) - nb - 1)
>  			funclen = sizeof(buf) - nb - 1;
> +		funclen = utf8width(func, funclen);
>  		memcpy(buf + nb, func, funclen);
>  		nb += funclen;
>  	}

I'd rather not do this in xdiff/ level for two reasons.

We consider the functions there strictly "borrowed code", and
ideally I'd rather even not to have that "chop down to funclen"
logic at that level.

The code at that level does not know what paths it is dealing
with and cannot consult git specific data (i.e. attributes); it
would make it harder to enhance it later by introducing per-path
encoding information.

How about...

 * (optional) lift the funclen limit from xdl_emit_hunk_hdr()
   and xdl_emit_diff() and have them preserve the full line;

 * Around ll.554 in diff.c::fn_out_consume(), look at
   line[i..eol] and apply the "chomp between character" logic
   there.  I think it is very sensible to use the UTF-8 logic by
   default as you did above (but I suspect you may be able to
   reuse helper functions in utf8.c such as git_wcwidth(),
   instead of rolling your own).  Chomp the funcname line given
   from xdiff layer in this function.

 * (future) make the length of the funcname line configurable
   either from the command line or configuration.

 * (future) add per-path blob encoding information (default to
   UTF-8) to struct emit_callback, and initialize it from the
   gitattributes(5) mechanism, just like we have added ws_rule
   recently there.  Use that to decide how inter-character chomp
   works to customize the logic in diff.c::fn_out_consume() you
   would introduce above.

I think the two future enhancements I listed above as examples
would be cleaner to implement if the line chomping is done in
fn_out_consume().

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] utf8_width(): allow non NUL-terminated input
  2008-01-02  5:26 ` Junio C Hamano
@ 2008-01-02  9:49   ` Junio C Hamano
  2008-01-02  9:50   ` [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character Junio C Hamano
  2008-01-02  9:50   ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
  2 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02  9:49 UTC (permalink / raw)
  To: Tsugikazu Shibata; +Cc: git

The original interface assumed that the input string to be
always terminated with NUL, but that wasn't too useful.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * This is a preparatory step for the next one that follows what
   I outlined in my response.  As we are in pre-1.5.4 freeze, I
   won't be applying this to 'master' but this and the one that
   implements the sane truncation might be 'next' material.

 utf8.c |   68 ++++++++++++++++++++++++++++++++++++++-------------------------
 utf8.h |    2 +-
 2 files changed, 42 insertions(+), 28 deletions(-)

diff --git a/utf8.c b/utf8.c
index 9efcdb9..5d641fa 100644
--- a/utf8.c
+++ b/utf8.c
@@ -157,57 +157,71 @@ static int git_wcwidth(ucs_char_t ch)
  * pointed to by the variable start. The pointer is updated to point at
  * the next character. If it was not valid UTF-8, the pointer is set to NULL.
  */
-int utf8_width(const char **start)
+int utf8_width(const char **start, size_t *remainder_p)
 {
 	unsigned char *s = (unsigned char *)*start;
 	ucs_char_t ch;
+	size_t remainder, incr;
 
-	if (*s < 0x80) {
+	/*
+	 * A caller that assumes NUL terminated text can choose
+	 * not to bother with the remainder length.
+	 */
+	remainder = (remainder_p ? *remainder_p : 999);
+
+	if (remainder < 1) {
+		goto invalid;
+	} else if (*s < 0x80) {
 		/* 0xxxxxxx */
 		ch = *s;
-		*start += 1;
+		incr = 1;
 	} else if ((s[0] & 0xe0) == 0xc0) {
 		/* 110XXXXx 10xxxxxx */
-		if ((s[1] & 0xc0) != 0x80 ||
-				/* overlong? */
-				(s[0] & 0xfe) == 0xc0)
+		if (remainder < 2 ||
+		    (s[1] & 0xc0) != 0x80 ||
+		    (s[0] & 0xfe) == 0xc0)
 			goto invalid;
 		ch = ((s[0] & 0x1f) << 6) | (s[1] & 0x3f);
-		*start += 2;
+		incr = 2;
 	} else if ((s[0] & 0xf0) == 0xe0) {
 		/* 1110XXXX 10Xxxxxx 10xxxxxx */
-		if ((s[1] & 0xc0) != 0x80 ||
-				(s[2] & 0xc0) != 0x80 ||
-				/* overlong? */
-				(s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) ||
-				/* surrogate? */
-				(s[0] == 0xed && (s[1] & 0xe0) == 0xa0) ||
-				/* U+FFFE or U+FFFF? */
-				(s[0] == 0xef && s[1] == 0xbf &&
-				 (s[2] & 0xfe) == 0xbe))
+		if (remainder < 3 ||
+		    (s[1] & 0xc0) != 0x80 ||
+		    (s[2] & 0xc0) != 0x80 ||
+		    /* overlong? */
+		    (s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) ||
+		    /* surrogate? */
+		    (s[0] == 0xed && (s[1] & 0xe0) == 0xa0) ||
+		    /* U+FFFE or U+FFFF? */
+		    (s[0] == 0xef && s[1] == 0xbf &&
+		     (s[2] & 0xfe) == 0xbe))
 			goto invalid;
 		ch = ((s[0] & 0x0f) << 12) |
 			((s[1] & 0x3f) << 6) | (s[2] & 0x3f);
-		*start += 3;
+		incr = 3;
 	} else if ((s[0] & 0xf8) == 0xf0) {
 		/* 11110XXX 10XXxxxx 10xxxxxx 10xxxxxx */
-		if ((s[1] & 0xc0) != 0x80 ||
-				(s[2] & 0xc0) != 0x80 ||
-				(s[3] & 0xc0) != 0x80 ||
-				/* overlong? */
-				(s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) ||
-				/* > U+10FFFF? */
-				(s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4)
+		if (remainder < 4 ||
+		    (s[1] & 0xc0) != 0x80 ||
+		    (s[2] & 0xc0) != 0x80 ||
+		    (s[3] & 0xc0) != 0x80 ||
+		    /* overlong? */
+		    (s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) ||
+		    /* > U+10FFFF? */
+		    (s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4)
 			goto invalid;
 		ch = ((s[0] & 0x07) << 18) | ((s[1] & 0x3f) << 12) |
 			((s[2] & 0x3f) << 6) | (s[3] & 0x3f);
-		*start += 4;
+		incr = 4;
 	} else {
 invalid:
 		*start = NULL;
 		return 0;
 	}
 
+	*start += incr;
+	if (remainder_p)
+		*remainder_p = remainder - incr;
 	return git_wcwidth(ch);
 }
 
@@ -218,7 +232,7 @@ int is_utf8(const char *text)
 			text++;
 			continue;
 		}
-		utf8_width(&text);
+		utf8_width(&text, NULL);
 		if (!text)
 			return 0;
 	}
@@ -278,7 +292,7 @@ int print_wrapped_text(const char *text, int indent, int indent2, int width)
 			continue;
 		}
 		if (assume_utf8)
-			w += utf8_width(&text);
+			w += utf8_width(&text, NULL);
 		else {
 			w++;
 			text++;
diff --git a/utf8.h b/utf8.h
index 15db6f1..4295a19 100644
--- a/utf8.h
+++ b/utf8.h
@@ -1,7 +1,7 @@
 #ifndef GIT_UTF8_H
 #define GIT_UTF8_H
 
-int utf8_width(const char **start);
+int utf8_width(const char **start, size_t *remain);
 int is_utf8(const char *text);
 int is_encoding_utf8(const char *name);
 
-- 
1.5.4.rc2.13.g690bd

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character
  2008-01-02  5:26 ` Junio C Hamano
  2008-01-02  9:49   ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
@ 2008-01-02  9:50   ` Junio C Hamano
  2008-01-02  9:50   ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
  2 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02  9:50 UTC (permalink / raw)
  To: Tsugikazu Shibata; +Cc: git

We truncate hunk-header line at 80 bytes, but that 80th byte
could be in the middle of a character, which is bad.  This uses
utf8_width() function to make sure we do not cut a character in
the middle.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * And this implements the sane_truncate_line() that
   unconditionally assumes UTF-8.

   The 80-byte cutoff is meant to chomp at 80 display columns,
   and in many encodings including UTF-8 80 cols are usually
   more than 80 bytes.  To implement 80 column chomping, we
   would need to tweak the low level xdl_find_func() interface
   further so that it gives us back more than 80-bytes.  Ideally
   that layer should not be doing any chomping.

   Another thing I noticed but did not bother to fix was that
   the emit_line() interface is doing bad things with color.
   It emits:

	start-color text (= line, len) reset-color

   and the text often ends with LF.  We should emit reset-color
   before the terminating LF instead.

 diff.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/diff.c b/diff.c
index 5bdc111..b9159bc 100644
--- a/diff.c
+++ b/diff.c
@@ -10,6 +10,7 @@
 #include "color.h"
 #include "attr.h"
 #include "run-command.h"
+#include "utf8.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -465,10 +466,13 @@ static void diff_words_show(struct diff_words_data *diff_words)
 	}
 }
 
+typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len);
+
 struct emit_callback {
 	struct xdiff_emit_state xm;
 	int nparents, color_diff;
 	unsigned ws_rule;
+	sane_truncate_fn truncate;
 	const char **label_path;
 	struct diff_words_data *diff_words;
 	int *found_changesp;
@@ -521,6 +525,27 @@ static void emit_add_line(const char *reset, struct emit_callback *ecbdata, cons
 	}
 }
 
+static unsigned long sane_truncate_line(struct emit_callback *ecb, char *line, unsigned long len)
+{
+	const char *cp;
+	unsigned long width, allot;
+
+	if (ecb->truncate)
+		return ecb->truncate(line, len);
+	cp = line;
+	width = 0;
+	allot = len;
+	while (0 < len) {
+		int w = utf8_width(&cp, &len);
+		if (!cp)
+			break; /* truncated in the middle? */
+		width += w;
+		if (allot <= width)
+			break;
+	}
+	return allot - len;
+}
+
 static void fn_out_consume(void *priv, char *line, unsigned long len)
 {
 	int i;
@@ -551,8 +576,11 @@ static void fn_out_consume(void *priv, char *line, unsigned long len)
 		;
 	if (2 <= i && i < len && line[i] == ' ') {
 		ecbdata->nparents = i - 1;
+		len = sane_truncate_line(ecbdata, line, len);
 		emit_line(diff_get_color(ecbdata->color_diff, DIFF_FRAGINFO),
 			  reset, line, len);
+		if (line[len-1] != '\n')
+			putchar('\n');
 		return;
 	}
 
-- 
1.5.4.rc2.13.g690bd

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/2] attribute "coding": specify blob encoding
  2008-01-02  5:26 ` Junio C Hamano
  2008-01-02  9:49   ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
  2008-01-02  9:50   ` [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character Junio C Hamano
@ 2008-01-02  9:50   ` Junio C Hamano
  2008-01-03 21:23     ` しらいしななこ
  2 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2008-01-02  9:50 UTC (permalink / raw)
  To: Tsugikazu Shibata; +Cc: git

This teaches "diff hunk header" function about custom character
encoding per path via gitattributes(5) so that it can sensibly
chomp a line without truncating a character in the middle.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * This is not intended for serious inclusion, but was done more
   as a demonstration of the concept, hence [3/2].

 Makefile   |    5 ++-
 coding.c   |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 coding.h   |   10 +++++++++
 diff.c     |   48 ++++++++++++++++++++++++++++++++++++++++-----
 diffcore.h |    1 +
 5 files changed, 118 insertions(+), 8 deletions(-)
 create mode 100644 coding.c
 create mode 100644 coding.h

diff --git a/Makefile b/Makefile
index 21c80e6..de205e6 100644
--- a/Makefile
+++ b/Makefile
@@ -293,7 +293,8 @@ LIB_H = \
 	run-command.h strbuf.h tag.h tree.h git-compat-util.h revision.h \
 	tree-walk.h log-tree.h dir.h path-list.h unpack-trees.h builtin.h \
 	utf8.h reflog-walk.h patch-ids.h attr.h decorate.h progress.h \
-	mailmap.h remote.h parse-options.h transport.h diffcore.h hash.h
+	mailmap.h remote.h parse-options.h transport.h diffcore.h hash.h \
+	coding.h
 
 DIFF_OBJS = \
 	diff.o diff-lib.o diffcore-break.o diffcore-order.o \
@@ -316,7 +317,7 @@ LIB_OBJS = \
 	alloc.o merge-file.o path-list.o help.o unpack-trees.o $(DIFF_OBJS) \
 	color.o wt-status.o archive-zip.o archive-tar.o shallow.o utf8.o \
 	convert.o attr.o decorate.o progress.o mailmap.o symlinks.o remote.o \
-	transport.o bundle.o walker.o parse-options.o ws.o
+	transport.o bundle.o walker.o parse-options.o ws.o coding.o
 
 BUILTIN_OBJS = \
 	builtin-add.o \
diff --git a/coding.c b/coding.c
new file mode 100644
index 0000000..63c5dc7
--- /dev/null
+++ b/coding.c
@@ -0,0 +1,62 @@
+#include "coding.h"
+
+unsigned long truncate_euc_jp(char *line, unsigned long len)
+{
+	/*
+	 * line contains len bytes but the end of the string could
+	 * be incomplete (chomped in the middle of a character).
+	 * grab complete lines and count the display width, and
+	 * return the number of bytes to chomp at (which could
+	 * be smaller than 'len' given, to discard the corrupt
+	 * characters at the end
+	 */
+	char *ep = line + len;
+	unsigned long sofar = 0;
+
+	while (line < ep) {
+		int ch = (*line++ & 0xFF);
+		int width;
+
+		if (ch < 0x80) {
+			/* JIS X 0201 lower half */
+			sofar++;
+			width = 1;
+		} else if (0xA1 <= ch && ch <= 0xFE) {
+			/* JIS X 0208 */
+			if (ep <= line)
+				goto invalid;
+			ch = (*line++ & 0xFF);
+			if (ch < 0xA1 || 0xFE < ch)
+				goto invalid;
+			width = 2;
+			sofar += 2;
+		} else if (ch == 0x8E) {
+			/* JIS X 0201 upper half */
+			if (ep <= line)
+				goto invalid;
+			ch = (*line++ & 0xFF);
+			if (ch <= 0xA1 || 0xDF < ch)
+				goto invalid;
+			width = 1;
+			sofar += 2;
+		} else if (ch == 0x8F) {
+			/* JIS X 0212 */
+			if (ep <= line)
+				goto invalid;
+			ch = (*line++ & 0xFF);
+			if (ch < 0xA1 || 0xFE < ch)
+				goto invalid;
+			if (ep <= line)
+				goto invalid;
+			ch = (*line++ & 0xFF);
+			if (ch < 0xA1 || 0xFE < ch)
+				goto invalid;
+			width = 2;
+			sofar += 3;
+		} else {
+		invalid:
+			return sofar;
+		}
+	}
+	return sofar;
+}
diff --git a/coding.h b/coding.h
new file mode 100644
index 0000000..0e02701
--- /dev/null
+++ b/coding.h
@@ -0,0 +1,10 @@
+#ifndef CODING_H
+#define CODING_H
+
+#include "cache.h"
+
+typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len);
+
+unsigned long truncate_euc_jp(char *, unsigned long);
+
+#endif
diff --git a/diff.c b/diff.c
index b9159bc..40c8d2b 100644
--- a/diff.c
+++ b/diff.c
@@ -11,6 +11,7 @@
 #include "attr.h"
 #include "run-command.h"
 #include "utf8.h"
+#include "coding.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -466,8 +467,6 @@ static void diff_words_show(struct diff_words_data *diff_words)
 	}
 }
 
-typedef unsigned long (*sane_truncate_fn)(char *line, unsigned long len);
-
 struct emit_callback {
 	struct xdiff_emit_state xm;
 	int nparents, color_diff;
@@ -1125,30 +1124,34 @@ static void emit_binary_diff(mmfile_t *one, mmfile_t *two)
 static void setup_diff_attr_check(struct git_attr_check *check)
 {
 	static struct git_attr *attr_diff;
+	static struct git_attr *attr_coding;
 
 	if (!attr_diff) {
 		attr_diff = git_attr("diff", 4);
+		attr_coding = git_attr("coding", 6);
 	}
 	check[0].attr = attr_diff;
+	check[1].attr = attr_coding;
 }
 
 static void diff_filespec_check_attr(struct diff_filespec *one)
 {
-	struct git_attr_check attr_diff_check;
+	struct git_attr_check attr_diff_check[2];
 	int check_from_data = 0;
 
 	if (one->checked_attr)
 		return;
 
-	setup_diff_attr_check(&attr_diff_check);
+	setup_diff_attr_check(attr_diff_check);
 	one->is_binary = 0;
 	one->funcname_pattern_ident = NULL;
+	one->coding = NULL;
 
-	if (!git_checkattr(one->path, 1, &attr_diff_check)) {
+	if (!git_checkattr(one->path, 2, attr_diff_check)) {
 		const char *value;
 
 		/* binaryness */
-		value = attr_diff_check.value;
+		value = attr_diff_check[0].value;
 		if (ATTR_TRUE(value))
 			;
 		else if (ATTR_FALSE(value))
@@ -1161,6 +1164,15 @@ static void diff_filespec_check_attr(struct diff_filespec *one)
 			;
 		else
 			one->funcname_pattern_ident = value;
+
+		/* coding */
+		value = attr_diff_check[1].value;
+		if (ATTR_TRUE(value) ||
+		    ATTR_FALSE(value) ||
+		    ATTR_UNSET(value))
+			;
+		else
+			one->coding = value;
 	}
 
 	if (check_from_data) {
@@ -1233,6 +1245,27 @@ static const char *diff_funcname_pattern(struct diff_filespec *one)
 	return NULL;
 }
 
+static struct {
+	const char *coding;
+	sane_truncate_fn fn;
+} builtin_truncate_fn[] = {
+	{ "euc-jp", truncate_euc_jp },
+	{ "utf-8", NULL },
+};
+
+static sane_truncate_fn diff_truncate_line_fn(struct diff_filespec *one)
+{
+	int i;
+
+	diff_filespec_check_attr(one);
+	if (!one->coding)
+		return NULL;
+	for (i = 0; i < ARRAY_SIZE(builtin_truncate_fn); i++)
+		if (!strcmp(one->coding, builtin_truncate_fn[i].coding))
+			return builtin_truncate_fn[i].fn;
+	return NULL;
+}
+
 static void builtin_diff(const char *name_a,
 			 const char *name_b,
 			 struct diff_filespec *one,
@@ -1318,6 +1351,9 @@ static void builtin_diff(const char *name_a,
 		ecbdata.color_diff = DIFF_OPT_TST(o, COLOR_DIFF);
 		ecbdata.found_changesp = &o->found_changes;
 		ecbdata.ws_rule = whitespace_rule(name_b ? name_b : name_a);
+		ecbdata.truncate = diff_truncate_line_fn(one);
+		if (!ecbdata.truncate)
+			ecbdata.truncate = diff_truncate_line_fn(two);
 		xpp.flags = XDF_NEED_MINIMAL | o->xdl_opts;
 		xecfg.ctxlen = o->context;
 		xecfg.flags = XDL_EMIT_FUNCNAMES;
diff --git a/diffcore.h b/diffcore.h
index cc96c20..711a4df 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -28,6 +28,7 @@ struct diff_filespec {
 	void *data;
 	void *cnt_data;
 	const char *funcname_pattern_ident;
+	const char *coding;      /* blob text coding */
 	unsigned long size;
 	int count;               /* Reference count */
 	int xfrm_flags;		 /* for use by the xfrm */
-- 
1.5.4.rc2.13.g690bd

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/2] attribute "coding": specify blob encoding
  2008-01-02  9:50   ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
@ 2008-01-03 21:23     ` しらいしななこ
  2008-01-03 21:54       ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: しらいしななこ @ 2008-01-03 21:23 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Tsugikazu Shibata, git

Quoting Junio C Hamano <gitster@pobox.com>:

> This teaches "diff hunk header" function about custom character
> encoding per path via gitattributes(5) so that it can sensibly
> chomp a line without truncating a character in the middle.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>
>  * This is not intended for serious inclusion, but was done more
>    as a demonstration of the concept, hence [3/2].

Why not?  It looks a useful addition for us Japanese people.

> +static struct {
> +	const char *coding;
> +	sane_truncate_fn fn;
> +} builtin_truncate_fn[] = {
> +	{ "euc-jp", truncate_euc_jp },
> +	{ "utf-8", NULL },
> +};

Can you also do JIS and Shift JIS?  I ask because many of my old notes are in Shift JIS and I think it is the same for many other people.

-- 
Nanako Shiraishi
http://ivory.ap.teacup.com/nanako3/

----------------------------------------------------------------------
Get a free email account with anti spam protection.
http://www.bluebottle.com/tag/2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/2] attribute "coding": specify blob encoding
  2008-01-03 21:23     ` しらいしななこ
@ 2008-01-03 21:54       ` Junio C Hamano
  2008-01-04 16:16         ` Tsugikazu Shibata
  0 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2008-01-03 21:54 UTC (permalink / raw)
  To: しらいしななこ
  Cc: Tsugikazu Shibata, git

しらいしななこ  <nanako3@bluebottle.com> writes:

> Quoting Junio C Hamano <gitster@pobox.com>:
> 
>> This teaches "diff hunk header" function about custom character
>> encoding per path via gitattributes(5) so that it can sensibly
>> chomp a line without truncating a character in the middle.
>>
>> Signed-off-by: Junio C Hamano <gitster@pobox.com>
>> ---
>>
>>  * This is not intended for serious inclusion, but was done more
>>    as a demonstration of the concept, hence [3/2].
>
> Why not?  It looks a useful addition for us Japanese people.

    (offtopic) I was once told that "us Japanese people" is a
    bad thing to say in public because it sets an unfriendly
    tone by creating a psychological divide between "us" and
    "others".  After all I am one of you ;-)

The reason I do not like the patch as-is is because I have
doubts about the way "coding" acts in the patch.

There already is clean/smudge filter mechanism.  Even if your
work tree has files in euc-jp or Shift_JIS, you could choose to
internally use UTF-8 at git object level.  Then the part that
deals with diff hunk headers (the topic of the patch we are
discussing) would have to work only on UTF-8 data.

	Side note: when getting the data from a file in the work
	tree, we convert into internal representation before
	running diff (see diff.c::diff_populate_filespec()), but
	we do not convert it back to external representation by
	running the smudge filter on the diff output.  We might
	optionally want to but if somebody is going to do this,
	the patch accepting side also needs to be modified to
	reverse the conversion.

The solution with clean/smudge is not applicable to everybody.
It needs to be agreed upon project-wide what encoding is used as
the canonical encoding for the project, and when the project
chooses to use UTF-8, the above would become a cleaner and
workable approach.

If the project, on the other hand, chooses to use a non UTF-8
encoding (e.g. euc-jp) as the canonical representation,
something like my patch may be necessary.

Between these two ways to skin the cat, I do not want to close
the door for either one of them too early, although I am
somewhat partial to "internally everything is UTF-8" approach.

Maybe we would want to use "coding" (short, sweet and nice name
for an attribute) to mean a canned smudge/clean filter pair that
runs to/from UTF-8 iconv, making the "internally, everything is
UTF-8" approach a more officially supported option.  If we
choose to go that route, the way "coding" attribute was used in
the patch directly conflicts with that design, as the world view
my "coding" patch takes is "whatever coding project chooses is
used internally, and the attribute is used to teach coding
specific actions to the underlying logic".

>> +static struct {
>> +	const char *coding;
>> +	sane_truncate_fn fn;
>> +} builtin_truncate_fn[] = {
>> +	{ "euc-jp", truncate_euc_jp },
>> +	{ "utf-8", NULL },
>> +};
>
>Can you also do JIS and Shift JIS?  I ask because many of my
>old notes are in Shift JIS and I think it is the same for many
>other people. 

I guess somebody else could (hint, hint,...).  Shift_JIS should
be more or less straightforward to add.

With the current code structure, however, ISO-2022 (you said
"JIS" -- Japanese often use that word to mean 7-bit ISO-2022 and
so did you in this context) is a bit cumbersome to handle, as
you cannot just truncate but also have to add a few escape bytes
to go back to ASCII at the end of line.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/2] attribute "coding": specify blob encoding
  2008-01-03 21:54       ` Junio C Hamano
@ 2008-01-04 16:16         ` Tsugikazu Shibata
  2008-01-04 20:53           ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Tsugikazu Shibata @ 2008-01-04 16:16 UTC (permalink / raw)
  To: gitster; +Cc: nanako3, git, tshibata

On Thu, 03 Jan 2008 13:54:58 -0800, gitster wrote:
> しらいしななこ  <nanako3@bluebottle.com> writes:
> 
> > Quoting Junio C Hamano <gitster@pobox.com>:
> > 
> >> This teaches "diff hunk header" function about custom character
> >> encoding per path via gitattributes(5) so that it can sensibly
> >> chomp a line without truncating a character in the middle.
> >>
> >> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> >> ---
> >>
> >>  * This is not intended for serious inclusion, but was done more
> >>    as a demonstration of the concept, hence [3/2].
> >
...
> >> +static struct {
> >> +	const char *coding;
> >> +	sane_truncate_fn fn;
> >> +} builtin_truncate_fn[] = {
> >> +	{ "euc-jp", truncate_euc_jp },
> >> +	{ "utf-8", NULL },
> >> +};
> >
> >Can you also do JIS and Shift JIS?  I ask because many of my
> >old notes are in Shift JIS and I think it is the same for many
> >other people. 
> 
> I guess somebody else could (hint, hint,...).  Shift_JIS should
> be more or less straightforward to add.
> 
> With the current code structure, however, ISO-2022 (you said
> "JIS" -- Japanese often use that word to mean 7-bit ISO-2022 and
> so did you in this context) is a bit cumbersome to handle, as
> you cannot just truncate but also have to add a few escape bytes
> to go back to ASCII at the end of line.

I guess that there are many other encodings and support everything are
not reasonable.
In my experience, It seems that chopping a multi-byte character
occurs the field after "@@". I believe this field is for the name of
the appropriate function. Also, I believe most of the function names
are in ASCII. 
So, question is why we should think of this field in case of non
programming language files ? 
In case of text file using any coding, Should we need to add something
after "@@"?
How about not to add anything after "@@" when the file name was .txt
or no extension (ie. HOWTO)?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/2] attribute "coding": specify blob encoding
  2008-01-04 16:16         ` Tsugikazu Shibata
@ 2008-01-04 20:53           ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2008-01-04 20:53 UTC (permalink / raw)
  To: Tsugikazu Shibata; +Cc: nanako3, git

Tsugikazu Shibata <tshibata@ab.jp.nec.com> writes:

> So, question is why we should think of this field in case of non
> programming language files ? 
> In case of text file using any coding, Should we need to add something
> after "@@"?

See gitattributes(5), "Defining a custom hunk-header".  The
associated configuration variable for this feature is somewhat
misnamed [*1*], but if you do not want to see anything, I think
you should be able to define a custom pattern that match
emptyness.

For .txt files, I always hoped that someday, somebody clever
would come up with a custom pattern to grok AsciiDoc mark-up to
pick a label-looking line so that the hunks in patches to our
documentation would be easier to find.  A simpler and more
generally applicable possibility is to find the previous line
that starts a paragraph (e.g. you would pick "For .txt files..."
for this part because it is the non-blank line that immediately
follows the first blank line before this part).  But I think the
code to find an appropriate string for hunk header is not
capable of looking at more than one line, so neither of these
would be easy to implement without changing the underlying logic
and semantics of "diff.*.funcname" configuration.

[Footnote]

*1* It may be misnamed but it is in line with GNU diff, though.
They call -p "show-c-function" and -F "show-function-line" --
both assume function names are what people are interested in
having on hunk headers.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-01-04 20:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-01 23:20 [RFC/PATCH] fix "git diff" to create wrong UTF-8 text Tsugikazu Shibata
2008-01-02  5:26 ` Junio C Hamano
2008-01-02  9:49   ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
2008-01-02  9:50   ` [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character Junio C Hamano
2008-01-02  9:50   ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
2008-01-03 21:23     ` しらいしななこ
2008-01-03 21:54       ` Junio C Hamano
2008-01-04 16:16         ` Tsugikazu Shibata
2008-01-04 20:53           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).