From: Junio C Hamano <gitster@pobox.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Ping Yin <pkufranky@gmail.com>, mailinggit list <git@vger.kernel.org>
Subject: Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese
Date: Tue, 29 Nov 2022 20:32:58 +0900 [thread overview]
Message-ID: <xmqqlenu2dxx.fsf@gitster.g> (raw)
In-Reply-To: <221129.867czejabi.gmgdl@evledraar.gmail.com> ("Ævar Arnfjörð Bjarmason"'s message of "Tue, 29 Nov 2022 11:52:38 +0100")
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> or (if chinese can not be displayed correctly)
>>
>> - <E4><B8><BA>1
>> + <E4><B8><BA>2
>>
>> Actual result of "git diff --color-words"
>>
>> <E4><B8>[-<BA>1-]{+<BA>2+}
>> ...
> I think we could provide new ways to do per-language diffs, right now
> you can use --word-diff-regex, but it would be handy to e.g. have a
> built-in collection of those (or other non-regex boundary algorithms)
> for Chinese etc.
I think you are thinking it with unnecessaarily complexity.
The only thing that needs noticing in the above example, I think is,
that the three-byte sequence E4-B8-BA in the example is supposed to
be a single unicode character, and the actual result depicted can
happen only if we (incorrectly) chomp that single character in the
middle.
No matter what language we are using, we shouldn't do that.
I suspect that "--word-diff" internal is not even aware what a
character is, but if you assume UTF-8 (precomposed), then you should
be able to tell where the character boundary is by only looking at
the high-bit patterns to avoid producing such an output.
next prev parent reply other threads:[~2022-11-29 11:33 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin
2022-11-29 3:49 ` Ping Yin
2022-11-29 8:18 ` Bagas Sanjaya
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
2022-11-29 11:32 ` Junio C Hamano [this message]
2022-11-29 18:23 ` Jeff King
2022-11-29 18:54 ` Jeff King
2022-12-01 7:08 ` Ping Yin
2022-12-01 7:33 ` Ping Yin
2022-12-01 14:51 ` Phillip Wood
2022-12-01 15:51 ` Ping Yin
2022-12-01 20:06 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqqlenu2dxx.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=pkufranky@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).