All of lore.kernel.org
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Hamza Mahfooz" <someguy@effective-light.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sat, 16 Oct 2021 23:00:58 -0700	[thread overview]
Message-ID: <xmqq1r4k197p.fsf@gitster.g> (raw)
In-Reply-To: <fc7eb9fc-9521-5484-b05f-9c20086fd485@web.de> ("René Scharfe"'s message of "Sat, 16 Oct 2021 21:44:13 +0200")

René Scharfe <l.s.r@web.de> writes:

>>> Literal patterns are those that don't use any wildcards or case-folding.
>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>> in UTF-8 and is not just a literal pattern.
>>>
>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>> ASCII chars?
>>> ...
>>     echo 'René Scharfe' >f &&
>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>     1
>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>     f:René Scharfe
>>     0
>>
>> So it's a choose-your-own adventure where you can pick if you're
>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>
> Yes, I do, and it's what Hamza's patch is fixing.

That may be correct but is this discussion still about "Why enable
... for literal patterns that consist of only ASCII"?  Calling "." a
"metacharacter" and wanting it to match anything other than a single
dot would mean the pattern we are discussing is no longer "literal",
isn't it?  I am puzzled.

>> These sorts of patterns demonstrate nicely that the relationship between your
>> pattern being ASCII and wanting or not wanting UTF-8 matching semantics
>> isn't what you might imagine it to be.
>
> Differences are:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: literal
>
> o h i l master hamza rene
> 0 0 0 0      0     1    0
> 0 0 0 1      0     1    0
> 0 0 1 0      0     1    1   <== your first example
> 0 0 1 1      0     1    0
> 0 1 1 1      0     0    1
>
> Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two
> lines).
>
> Turning on PCRE2_UTF for literal matching (fourth line) goes against
> 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
> 2019-07-26).
>
> Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth
> line) also goes against that, so my intuition betrayed me.  When I
> adjust it, I get:
>
> 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
> That looks deceptively simple -- just drop has_non_ascii(p->pattern)
> from the original condition.
>
> Your second example is handle the same by all versions btw.:
>
> o h i l master hamza rene
> 0 1 1 0      1     1    1
>
> René

  reply	other threads:[~2021-10-17  6:01 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano [this message]
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq1r4k197p.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.