* [bug] git diff --word-diff gives wrong result for utf-8 chinese
@ 2022-11-29 3:46 Ping Yin
2022-11-29 3:49 ` Ping Yin
2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason
0 siblings, 2 replies; 12+ messages in thread
From: Ping Yin @ 2022-11-29 3:46 UTC (permalink / raw)
To: mailinggit list
[-- Attachment #1: Type: text/plain, Size: 359 bytes --]
Result of "git diff"
- 为1
+ 为2
or (if chinese can not be displayed correctly)
- <E4><B8><BA>1
+ <E4><B8><BA>2
Actual result of "git diff --color-words"
<E4><B8>[-<BA>1-]{+<BA>2+}
Expected result of "git diff --color-words"
为[-1-]{+2+}
or (if chinese can not be displayed correctly)
<E4><B8><BA>[-1-]{+2+}
Ping Yin
[-- Attachment #2: image.png --]
[-- Type: image/png, Size: 5682 bytes --]
[-- Attachment #3: image.png --]
[-- Type: image/png, Size: 9992 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin @ 2022-11-29 3:49 ` Ping Yin 2022-11-29 8:18 ` Bagas Sanjaya 2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 12+ messages in thread From: Ping Yin @ 2022-11-29 3:49 UTC (permalink / raw) To: mailinggit list sorry, typo, s/--color-words/--word-diff/g Ping Yin On Tue, Nov 29, 2022 at 11:46 AM Ping Yin <pkufranky@gmail.com> wrote: > > Result of "git diff" > > - 为1 > + 为2 > > or (if chinese can not be displayed correctly) > > - <E4><B8><BA>1 > + <E4><B8><BA>2 > > Actual result of "git diff --color-words" > > <E4><B8>[-<BA>1-]{+<BA>2+} > > Expected result of "git diff --color-words" > > 为[-1-]{+2+} > > or (if chinese can not be displayed correctly) > > <E4><B8><BA>[-1-]{+2+} > > > Ping Yin ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 3:49 ` Ping Yin @ 2022-11-29 8:18 ` Bagas Sanjaya 0 siblings, 0 replies; 12+ messages in thread From: Bagas Sanjaya @ 2022-11-29 8:18 UTC (permalink / raw) To: Ping Yin, mailinggit list On 11/29/22 10:49, Ping Yin wrote: > sorry, typo, s/--color-words/--word-diff/g > Hi, welcome to Git mailing list! Please remind yourself: * Do not send HTML mails, send plain-text ones instead. Many mailing lists (including vger.kernel.org that powers Git ML) reject HTML emails for these are likely spam. Make sure your email isn't mangled (tabs and spaces as-is, no line wrapping). * Do not top-post, reply inline with appropriate context instead. I have to cut the reply context as a result. * When you submit a patch and people reply with their reviews, engage with them (either sending revised patch addressing the reviews or reply with justification). They will ignore you if you ignore them. Thanks. -- An old man doll... just what I always wanted! - Clara ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin 2022-11-29 3:49 ` Ping Yin @ 2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason 2022-11-29 11:32 ` Junio C Hamano 1 sibling, 1 reply; 12+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2022-11-29 10:52 UTC (permalink / raw) To: Ping Yin; +Cc: mailinggit list On Tue, Nov 29 2022, Ping Yin wrote: > Result of "git diff" > > - 为1 > + 为2 > > or (if chinese can not be displayed correctly) > > - <E4><B8><BA>1 > + <E4><B8><BA>2 > > Actual result of "git diff --color-words" > > <E4><B8>[-<BA>1-]{+<BA>2+} > > Expected result of "git diff --color-words" > > 为[-1-]{+2+} > > or (if chinese can not be displayed correctly) I think we could provide new ways to do per-language diffs, right now you can use --word-diff-regex, but it would be handy to e.g. have a built-in collection of those (or other non-regex boundary algorithms) for Chinese etc. But as for considering this a bug, or changing the existing behavior I think we'd need to deal with: * We (approximately) split on space now, which is certainly ASCII-biased, and outside of CJK fairly somewhat universal. * If we're going to split on "real words" in some cross-language aware way, are we going to run into conflicts between what different languages would consider sensible rules? * We probably don't want to make the "diff" dependent on the user's locale, but e.g. saying "I want a Chinese diff" via a CLI option would be OK. * Even for say Chinese, there's probably interesting edge cases when it's combined with other languages or character sets (e.g. Chinese + HTML). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason @ 2022-11-29 11:32 ` Junio C Hamano 2022-11-29 18:23 ` Jeff King 0 siblings, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2022-11-29 11:32 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: Ping Yin, mailinggit list Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> or (if chinese can not be displayed correctly) >> >> - <E4><B8><BA>1 >> + <E4><B8><BA>2 >> >> Actual result of "git diff --color-words" >> >> <E4><B8>[-<BA>1-]{+<BA>2+} >> ... > I think we could provide new ways to do per-language diffs, right now > you can use --word-diff-regex, but it would be handy to e.g. have a > built-in collection of those (or other non-regex boundary algorithms) > for Chinese etc. I think you are thinking it with unnecessaarily complexity. The only thing that needs noticing in the above example, I think is, that the three-byte sequence E4-B8-BA in the example is supposed to be a single unicode character, and the actual result depicted can happen only if we (incorrectly) chomp that single character in the middle. No matter what language we are using, we shouldn't do that. I suspect that "--word-diff" internal is not even aware what a character is, but if you assume UTF-8 (precomposed), then you should be able to tell where the character boundary is by only looking at the high-bit patterns to avoid producing such an output. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 11:32 ` Junio C Hamano @ 2022-11-29 18:23 ` Jeff King 2022-11-29 18:54 ` Jeff King 0 siblings, 1 reply; 12+ messages in thread From: Jeff King @ 2022-11-29 18:23 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, Ping Yin, mailinggit list On Tue, Nov 29, 2022 at 08:32:58PM +0900, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > > >> or (if chinese can not be displayed correctly) > >> > >> - <E4><B8><BA>1 > >> + <E4><B8><BA>2 > >> > >> Actual result of "git diff --color-words" > >> > >> <E4><B8>[-<BA>1-]{+<BA>2+} > >> ... > > I think we could provide new ways to do per-language diffs, right now > > you can use --word-diff-regex, but it would be handy to e.g. have a > > built-in collection of those (or other non-regex boundary algorithms) > > for Chinese etc. > > I think you are thinking it with unnecessaarily complexity. > > The only thing that needs noticing in the above example, I think is, > that the three-byte sequence E4-B8-BA in the example is supposed to > be a single unicode character, and the actual result depicted can > happen only if we (incorrectly) chomp that single character in the > middle. > > No matter what language we are using, we shouldn't do that. > > I suspect that "--word-diff" internal is not even aware what a > character is, but if you assume UTF-8 (precomposed), then you should > be able to tell where the character boundary is by only looking at > the high-bit patterns to avoid producing such an output. Agreed that we should probably avoid breaking characters. But what puzzles me more is that we break it between B8 and BA, and not elsewhere. Why not between E4 and B8? Why not between BA and "1"? If the rule is "break on ascii whitespace", then I'd have expected the whole four-character sequence to be taken as a unit. In other words, it does should not have to care that a character is, as long as the bytes for space characters cannot appear inside other characters (which is true of utf8). -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 18:23 ` Jeff King @ 2022-11-29 18:54 ` Jeff King 2022-12-01 7:08 ` Ping Yin 2022-12-01 7:33 ` Ping Yin 0 siblings, 2 replies; 12+ messages in thread From: Jeff King @ 2022-11-29 18:54 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, Ping Yin, mailinggit list On Tue, Nov 29, 2022 at 01:23:27PM -0500, Jeff King wrote: > > I suspect that "--word-diff" internal is not even aware what a > > character is, but if you assume UTF-8 (precomposed), then you should > > be able to tell where the character boundary is by only looking at > > the high-bit patterns to avoid producing such an output. > > Agreed that we should probably avoid breaking characters. But what > puzzles me more is that we break it between B8 and BA, and not > elsewhere. Why not between E4 and B8? Why not between BA and "1"? > > If the rule is "break on ascii whitespace", then I'd have expected the > whole four-character sequence to be taken as a unit. In other words, it > does should not have to care that a character is, as long as the bytes > for space characters cannot appear inside other characters (which is > true of utf8). Even more puzzling is that it produces the expected output for me: [note that \x is a bash-ism] $ printf '\xe4\xb8\xba1' >one $ printf '\xe4\xb8\xba2' >two $ git diff --no-index --word-diff one two diff --git a/one b/two index 9ae469fc41..576e6e32d8 100644 --- a/one +++ b/two @@ -1 +1 @@ [-为1-]{+为2+} I wonder if OP has diff.wordRegex config (or attributes triggering a diff.*.wordRegex) that is doing something else. -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 18:54 ` Jeff King @ 2022-12-01 7:08 ` Ping Yin 2022-12-01 7:33 ` Ping Yin 1 sibling, 0 replies; 12+ messages in thread From: Ping Yin @ 2022-12-01 7:08 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, mailinggit list > I wonder if OP has diff.wordRegex config (or attributes triggering a > diff.*.wordRegex) that is doing something else. Wow, you are right, sorry for the noise. $ git config -l | grep word diff.wordregex=[[:alnum:]_]+|[^[:space:]] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-11-29 18:54 ` Jeff King 2022-12-01 7:08 ` Ping Yin @ 2022-12-01 7:33 ` Ping Yin 2022-12-01 14:51 ` Phillip Wood 1 sibling, 1 reply; 12+ messages in thread From: Ping Yin @ 2022-12-01 7:33 UTC (permalink / raw) To: Jeff King Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, mailinggit list > > If the rule is "break on ascii whitespace", Is there a way to achieve this: break english by word, and break chinese by utf-8 character ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-12-01 7:33 ` Ping Yin @ 2022-12-01 14:51 ` Phillip Wood 2022-12-01 15:51 ` Ping Yin 2022-12-01 20:06 ` Jeff King 0 siblings, 2 replies; 12+ messages in thread From: Phillip Wood @ 2022-12-01 14:51 UTC (permalink / raw) To: Ping Yin, Jeff King Cc: Junio C Hamano, Ævar Arnfjörð Bjarmason, mailinggit list Hi Ping On 01/12/2022 07:33, Ping Yin wrote: >>> If the rule is "break on ascii whitespace", > > Is there a way to achieve this: break english by word, and break > chinese by utf-8 character You could extend your current regex so that it matches whole utf-8 codepoints which is what git does for the builtin userdiff regexes. I've not tested it but I think git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf '[\xc0-\xff][\x80-\xbf]+')" should work. The downside is that you end up with a .gitconfig that is not valid utf-8. Perhaps someone else has a clever idea to get around that. Best Wishes Phillip ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-12-01 14:51 ` Phillip Wood @ 2022-12-01 15:51 ` Ping Yin 2022-12-01 20:06 ` Jeff King 1 sibling, 0 replies; 12+ messages in thread From: Ping Yin @ 2022-12-01 15:51 UTC (permalink / raw) To: phillip.wood Cc: Jeff King, Junio C Hamano, Ævar Arnfjörð Bjarmason, mailinggit list Ping Yin On Thu, Dec 1, 2022 at 10:51 PM Phillip Wood <phillip.wood123@gmail.com> wrote: > > Hi Ping > > On 01/12/2022 07:33, Ping Yin wrote: > > git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf > '[\xc0-\xff][\x80-\xbf]+')" > > should work. The downside is that you end up with a .gitconfig that is > not valid utf-8. Perhaps someone else has a clever idea to get around that. Wow, it works. Thanks very much. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [bug] git diff --word-diff gives wrong result for utf-8 chinese 2022-12-01 14:51 ` Phillip Wood 2022-12-01 15:51 ` Ping Yin @ 2022-12-01 20:06 ` Jeff King 1 sibling, 0 replies; 12+ messages in thread From: Jeff King @ 2022-12-01 20:06 UTC (permalink / raw) To: phillip.wood Cc: Ping Yin, Junio C Hamano, Ævar Arnfjörð Bjarmason, mailinggit list On Thu, Dec 01, 2022 at 02:51:29PM +0000, Phillip Wood wrote: > On 01/12/2022 07:33, Ping Yin wrote: > > > > If the rule is "break on ascii whitespace", > > > > Is there a way to achieve this: break english by word, and break > > chinese by utf-8 character > > You could extend your current regex so that it matches whole utf-8 > codepoints which is what git does for the builtin userdiff regexes. I've not > tested it but I think > > git config --global diff.wordregex "[[:alnum:]_]+|[^[:space:]]|$(printf > '[\xc0-\xff][\x80-\xbf]+')" > > should work. The downside is that you end up with a .gitconfig that is not > valid utf-8. Perhaps someone else has a clever idea to get around that. I think in more advanced regular expression engines you can do stuff like matching "[\x{4e00}-\x{9fcc}]", or even "\p{Han}". But I don't know that the stock libc regex is capable of anything like this, even with EREs. That's the only option Git provides for matching word regexes, but in theory we could support libpcre. We already can optionally build against it; we would just need config/plumbing to get it into diff.c:find_word_boundaries(). -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-12-01 20:07 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-29 3:46 [bug] git diff --word-diff gives wrong result for utf-8 chinese Ping Yin 2022-11-29 3:49 ` Ping Yin 2022-11-29 8:18 ` Bagas Sanjaya 2022-11-29 10:52 ` Ævar Arnfjörð Bjarmason 2022-11-29 11:32 ` Junio C Hamano 2022-11-29 18:23 ` Jeff King 2022-11-29 18:54 ` Jeff King 2022-12-01 7:08 ` Ping Yin 2022-12-01 7:33 ` Ping Yin 2022-12-01 14:51 ` Phillip Wood 2022-12-01 15:51 ` Ping Yin 2022-12-01 20:06 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).