git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Carlo Arenas <carenas@gmail.com>
Cc: "René Scharfe" <l.s.r@web.de>,
	"Andreas Schwab" <schwab@linux-m68k.org>,
	"Hamza Mahfooz" <someguy@effective-light.com>,
	git@vger.kernel.org, "Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Thu, 18 Nov 2021 01:00:26 +0100	[thread overview]
Message-ID: <211118.86pmqy5naz.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <CAPUEsph8NMJtK2r_fuT6UPsgmoPAeGVfTLOj8uz6NaOj4UZ1dg@mail.gmail.com>


On Wed, Nov 17 2021, Carlo Arenas wrote:

> On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
>> two bytes in "é" and match them under -i with/without PCRE_UTF.
>
> Is there a real use case for why someone would do that? and how is
> that "literal" valid UTF to warrant setting PCRE2_UTF?

We don't check the validity of the input pattern as correct UTF-8 before
feeding it to PCRE. We don't even check if you're under a UTF-8 locale,
a call to is_utf8_locale() is one of the conditions we'll try (but check
how loose we are under NO_GETTEXT=Y even then), but it's not a necessary
one.

So in practice we'll likely have users pasting say Big5, UTF-32, JIS
encoding or whatever into "git grep" and then grepping arbitrary binary
data.

I don't really have a specific use-case in mind. I'm just trying to
counter the apparent recurring misconception that's popped up more than
once in these recent threads that the UTF-8 *mode* is only something we
need to worry about if the pattern contains non-ASCII.

The implications of UTF-8 as a matching mode go much deeper than that in
Perl's and PCRE's regex engines, e.g. in this case what's considered a
character boundary.

> I would expect that someone including random bytes in the expression
> is really more interested in binary matching anyway and the use of -i
> with it probably should be an error.
>
> Indeed I suspect the fact that pcre2_compile lets it through might be
> a bug in PCRE2
>
> $ pcre2test
> PCRE2 version 10.39 2021-10-29
>   re> /\303/utf,caseless
> data> \303
>  0: \x{c3}
> data> é
> No match

It's really not "random bytes".

We're not talking about someone doing a git grep where the argument is
fed from a "cat" of /dev/urandom. Just plain old boring natural language
encodings in common use can and will conflict with what's considered a
character boundary in UTF-8,

Anyway, I haven't had time to re-page this topic at large into my
wetware and really don't know what we should be doing exactly at this
point to move forward, except my previous suggestion to either revert
the recent changes, or at least to narrow them down so that we get the
old behavior for everything except the revision.c "git log" caller.

I think a really good next step aside from dealing with that immediate
problem would be to harden some of our test data in a way similar to
what we've got in t7816-grep-binary-pattern.sh, but for partial matches,
mixtures of valid/invalid/partial character patterns and subjects etc.

We might be able to lift some tests from existing test suites,
perl.git's t/re/re_tests and pcre2.git's RunGrepTest (And testdata/
directory) are probably good places to start looking.

We don't need to test e.g. PCRE itself independently, but it is worth
testing how whatever special-sauce we add on top (if and when to turn on
its flags based on heuristics) impacts things.

We've also got the problem that whatever code we write will target
multiple PCREv2 versions, which isn't something PCREv2 itself needs to
deal with.


  reply	other threads:[~2021-11-18  0:18 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03   ` Junio C Hamano
2021-10-16 16:25     ` René Scharfe
2021-10-16 17:12       ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44         ` René Scharfe
2021-10-17  6:00           ` Junio C Hamano
2021-10-17  6:55             ` René Scharfe
2021-10-17  9:44               ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43   ` Andreas Schwab
2021-11-15 22:41     ` Ævar Arnfjörð Bjarmason
2021-11-16  2:12       ` Carlo Arenas
2021-11-16  8:41       ` Andreas Schwab
2021-11-16  9:06         ` Carlo Arenas
2021-11-16  9:18           ` Andreas Schwab
2021-11-16  9:29           ` Andreas Schwab
2021-11-16  9:38             ` Carlo Arenas
2021-11-16  9:55               ` Andreas Schwab
2021-11-16 11:00                 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32                   ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35                     ` Hamza Mahfooz
2021-11-17 14:31                       ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22                     ` Carlo Arenas
2021-11-16 18:48                     ` Junio C Hamano
2021-11-17 10:23                   ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18  7:29                     ` Junio C Hamano
2021-11-18 10:15                       ` Carlo Arenas
2021-11-18 12:49                         ` Hamza Mahfooz
2021-11-19  6:58                           ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21                     ` Hamza Mahfooz
2021-11-17 18:46               ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56                 ` Carlo Arenas
2021-11-17 20:59                 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53                   ` Carlo Arenas
2021-11-18  0:00                     ` Ævar Arnfjörð Bjarmason [this message]
2021-11-18 18:17                   ` Junio C Hamano
2021-11-18 20:57                     ` René Scharfe
2021-11-19  7:00                       ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08                         ` René Scharfe
2021-11-19 17:33                           ` Carlo Arenas
2021-11-19 17:11                         ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24   ` Hamza Mahfooz
2021-10-15 19:28     ` Junio C Hamano
2021-10-15 19:40       ` Hamza Mahfooz
2021-10-15 19:49       ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=211118.86pmqy5naz.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=schwab@linux-m68k.org \
    --cc=someguy@effective-light.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).