git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH] fix "git diff" to create wrong UTF-8 text
@ 2008-01-01 23:20 Tsugikazu Shibata
  2008-01-02  5:26 ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Tsugikazu Shibata @ 2008-01-01 23:20 UTC (permalink / raw)
  To: git; +Cc: tshibata

Hello,

I met a problem in patch text from "git diff" for UTF-8 text.
Patch text following to "@@" sometimes cut the string with max
80bytes. In case of UTF-8 text written in Japanese and English, most
of Japanese character are consist of 3 bytes for a character and also
ASCII character is single byte.
So, cut the string with 80bytes may cause cut off 1 or 2 byte for a
character at the bottom. This will cause the broken code of result of
"git diff".

It seems no problem to read such patch text for the patch command but
the problem is not readable for me. ie. Emacs cannot handle the
encoding for such file and show me octal numbers.

The patch below is my quick and dirty solution (but It works fine !)
I tested this patch with using Linux kernel document
(Documentation/ja_JP/HOWTO)
I believe this should be work for another language using UTF-8 and
solve this issue.

Please note that this is focused only for UTF-8 but we may need to
support another encoding.
So, How can we turn on this UTF-8 processing?
Any suggestions are welcome.

Thanks,

Sigined-off-by: Tsugikazu Shibata <tshibata@ab.jp.nec.com>
---

diff -upr git-1.5.3.7/xdiff/xutils.c git-1.5.3.7-dev/xdiff/xutils.c
--- git-1.5.3.7/xdiff/xutils.c	2007-12-02 06:21:12.000000000 +0900
+++ git-1.5.3.7-dev/xdiff/xutils.c	2007-12-31 01:30:51.000000000 +0900
@@ -332,6 +332,32 @@ long xdl_atol(char const *str, char cons
 }


+/* return utf character size of bytes */
+int utf8charsize(const unsigned char c) {
+	int l;
+	if ( c < 0x7f ) l = 1;
+	else if (( c > 0xc0) && ( c < 0xdf)) l=2;
+	else if (( c > 0xe0) && ( c < 0xef)) l=3;
+	else if (( c > 0xf0) && ( c < 0xf7)) l=4;
+	else if (( c > 0xf8) && ( c < 0xfb)) l=5;
+	else if (( c > 0xfc) && ( c < 0xfd)) l=6;
+	else l=1; /* fale safe */
+	return l;
+}
+
+int utf8width(const char *up, int len) {
+        int cs;
+        int l=len;
+        const char *p = up;
+        while ((l > 0) && (p[0] != '\0')) {
+		cs = utf8charsize(p[0]);
+		if (l >= cs) {
+			l -= cs; p += cs;
+		} else l=0; /* do not split multi byte char. */
+        }
+        return p-up;
+}
+
 int xdl_emit_hunk_hdr(long s1, long c1, long s2, long c2,
 		      const char *func, long funclen, xdemitcb_t *ecb) {
 	int nb = 0;
@@ -368,6 +394,7 @@ int xdl_emit_hunk_hdr(long s1, long c1,
 		buf[nb++] = ' ';
 		if (funclen > sizeof(buf) - nb - 1)
 			funclen = sizeof(buf) - nb - 1;
+		funclen = utf8width(func, funclen);
 		memcpy(buf + nb, func, funclen);
 		nb += funclen;
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-01-04 20:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-01 23:20 [RFC/PATCH] fix "git diff" to create wrong UTF-8 text Tsugikazu Shibata
2008-01-02  5:26 ` Junio C Hamano
2008-01-02  9:49   ` [PATCH 1/2] utf8_width(): allow non NUL-terminated input Junio C Hamano
2008-01-02  9:50   ` [PATCH 2/2] diff: do not chomp hunk-header in the middle of a character Junio C Hamano
2008-01-02  9:50   ` [PATCH 3/2] attribute "coding": specify blob encoding Junio C Hamano
2008-01-03 21:23     ` しらいしななこ
2008-01-03 21:54       ` Junio C Hamano
2008-01-04 16:16         ` Tsugikazu Shibata
2008-01-04 20:53           ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).