From: "René Scharfe" <l.s.r@web.de>
To: Johannes Sixt <j6t@kdbg.org>
Cc: "Git List" <git@vger.kernel.org>,
"Diomidis Spinellis" <dds@aueb.gr>,
"Eric Sunshine" <sunshine@sunshineco.com>,
demerphq <demerphq@gmail.com>,
"Mario Grgic" <mario_grgic@hotmail.com>,
"D. Ben Knoble" <ben.knoble@gmail.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>
Subject: Re: [PATCH] userdiff: support regexec(3) with multi-byte support
Date: Fri, 7 Apr 2023 09:49:10 +0200 [thread overview]
Message-ID: <39eb2a9f-83e0-449e-1157-152c43d49b48@web.de> (raw)
In-Reply-To: <7fe0aa93-a764-66b0-5015-2f5fbd3901ab@kdbg.org>
Am 07.04.23 um 00:35 schrieb Johannes Sixt:
> Am 06.04.23 um 22:19 schrieb René Scharfe:
>> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
>> 2022-08-26) we use the system library for all regular expression
>> matching on macOS, not just for git grep. It supports multi-byte
>> strings and rejects invalid multi-byte characters.
>>
>> This broke all built-in userdiff word regexes in UTF-8 locales because
>> they all include such invalid bytes in expressions that are intended to
>> match multi-byte characters without explicit support for that from the
>> regex engine.
>>
>> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
>> regexes to match a single non-space or multi-byte character. The \xNN
>> characters are invalid if interpreted as UTF-8 because they have their
>> high bit set, which indicates they are part of a multi-byte character,
>> but they are surrounded by single-byte characters.
>
> Perhpas the expression should be "[\xc4\x80-\xf7\xbf\xbf\xbf]+", i.e.,
> sequences of code points U+0080 to U+10FFFF?
regcomp(3) on macOS doesn't like it:
fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[Ā-????]
Looks like it objects to U+10FFFF here; "[\xc4\x80-\xf3\xa0\x80\x80]" is
accepted for example.
\xc4\x80 is U+0100, by the way; U+0080 would be \xc2\x80. And
regcomp(3) doesn't like that either ("[\xc2\x80-\xf3\xa0\x80\x80]"):
fatal: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<U+0080>-]
>> Replace that expression with "|[^[:space:]]" if the regex engine
>> supports multi-byte matching, as there is no need to have an explicit
>> range for multi-byte characters then.
>
> This is not equivalent. The original treated a sequence of non-ASCII
> characters as a word. The new version treats each individual non-space
> character (both ASCII and non-ASCII) as a word.
I assume you mean "The original treated [a single non-space as well as]
a sequence of non-ASCII characters [making up a single multi-byte
character] as a word.". That works as intended by 664d44ee7f (userdiff:
simplify word-diff safeguard, 2011-01-11).
The new one doesn't match multi-byte whitespace anymore. What other
differences do they have?
René
next prev parent reply other threads:[~2023-04-07 7:49 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-29 22:55 regex compilation error with --color-words Eric Sunshine
2023-03-30 7:55 ` Diomidis Spinellis
2023-03-31 20:44 ` René Scharfe
2023-04-02 9:44 ` René Scharfe
2023-04-03 16:29 ` Junio C Hamano
2023-04-03 19:32 ` René Scharfe
2023-04-06 20:19 ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
2023-04-06 22:35 ` Johannes Sixt
2023-04-07 7:49 ` René Scharfe [this message]
2023-04-07 10:56 ` Johannes Sixt
2023-04-07 14:41 ` D. Ben Knoble
2023-04-07 16:02 ` Junio C Hamano
2023-04-07 17:23 ` Eric Sunshine
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=39eb2a9f-83e0-449e-1157-152c43d49b48@web.de \
--to=l.s.r@web.de \
--cc=avarab@gmail.com \
--cc=ben.knoble@gmail.com \
--cc=dds@aueb.gr \
--cc=demerphq@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=j6t@kdbg.org \
--cc=mario_grgic@hotmail.com \
--cc=peff@peff.net \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).