From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Ping Yin" <pkufranky@gmail.com>,
"mailinggit list" <git@vger.kernel.org>
Subject: Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
Date: Tue, 29 Nov 2022 13:54:21 -0500 [thread overview]
Message-ID: <Y4ZVXWNHO25IFYQL@coredump.intra.peff.net> (raw)
In-Reply-To: <Y4ZOHwwgtztwhbhr@coredump.intra.peff.net>
On Tue, Nov 29, 2022 at 01:23:27PM -0500, Jeff King wrote:
> > I suspect that "--word-diff" internal is not even aware what a
> > character is, but if you assume UTF-8 (precomposed), then you should
> > be able to tell where the character boundary is by only looking at
> > the high-bit patterns to avoid producing such an output.
>
> Agreed that we should probably avoid breaking characters. But what
> puzzles me more is that we break it between B8 and BA, and not
> elsewhere. Why not between E4 and B8? Why not between BA and "1"?
>
> If the rule is "break on ascii whitespace", then I'd have expected the
> whole four-character sequence to be taken as a unit. In other words, it
> does should not have to care that a character is, as long as the bytes
> for space characters cannot appear inside other characters (which is
> true of utf8).
Even more puzzling is that it produces the expected output for me:
[note that \x is a bash-ism]
$ printf '\xe4\xb8\xba1' >one
$ printf '\xe4\xb8\xba2' >two
$ git diff --no-index --word-diff one two
diff --git a/one b/two
index 9ae469fc41..576e6e32d8 100644
--- a/one
+++ b/two
@@ -1 +1 @@
[-为1-]{+为2+}
I wonder if OP has diff.wordRegex config (or attributes triggering a
diff.*.wordRegex) that is doing something else.
-Peff
next prev parent reply other threads:[~2022-11-29 18:54 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
2022-11-29 3:49 ` Ping Yin
2022-11-29 8:18 ` Bagas Sanjaya
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
2022-11-29 11:32 ` Junio C Hamano
2022-11-29 18:23 ` Jeff King
2022-11-29 18:54 ` Jeff King [this message]
2022-12-01 7:08 ` Ping Yin
2022-12-01 7:33 ` Ping Yin
2022-12-01 14:51 ` Phillip Wood
2022-12-01 15:51 ` Ping Yin
2022-12-01 20:06 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y4ZVXWNHO25IFYQL@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pkufranky@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.