From: "René Scharfe" <l.s.r@web.de>
To: Carlo Arenas <carenas@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC PATCH] grep: default to posix digits with -P
Date: Thu, 4 Jan 2024 22:14:23 +0100 [thread overview]
Message-ID: <ff21e91a-f2db-4d37-ab9b-da9a492db8d3@web.de> (raw)
In-Reply-To: <CAPUEspgbVx6wbp4UNjjxFO8iNJw3i2FnJxNwn4pk5EbaAP-7gQ@mail.gmail.com>
Am 02.01.24 um 20:02 schrieb Carlo Arenas:
> On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@web.de> wrote:
>>
>> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
>>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>> }
>>> options |= PCRE2_CASELESS;
>>> }
>>> - if (!opt->ignore_locale && is_utf8_locale() && !literal)
>>> - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>>> + if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>>> + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> - /*
>>> - * Work around a JIT bug related to invalid Unicode character handling
>>> - * fixed in 10.35:
>>> - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> - */
>>> - options &= ~PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> + /*
>>> + * Work around a JIT bug related to invalid Unicode character handling
>>> + * fixed in 10.35:
>>> + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> + */
>>> + options |= PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
>>> + if (!opt->perl_digit)
>>> + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
>>> #endif
>>> +#endif
>>
>> Why do we need that extra level of indentation?
>
> I was just trying to simplify the code by including all the logic in
> one single set.
>
> The original lack of indentation that was introduced by later fixes to
> the code was IMHO also misguided since the obvious "objective" as set
> in the original code that added PCRE2_UCP was that it should be used
> whenever UTF was also in use as shown by
> acabd2048ee0ee53728100408970ab45a6dab65e.
One level of indentation is correct because the code in question is part
of a function and it's not nested in a loop or conditional. And
preprocessor directives don't affect the indentation of C code. Am I
missing something?
> Of course, we soon found out that the original implementation of
> PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
> exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.
>
> Note that the problematic code is only relevant when JIT is also
> enabled, but JIT is almost always enabled.
>
>> The old code turned PCRE2_UCP on by default and turned it off for older
>> versions. The new code enables PCRE2_UCP only for newer versions. The
>> result should be the same, no? So why change that part at all?
>
> Because it gets us a little closer to the real reason why we need to
> disable UCP for anything older than 10.35, and that is that there is a
> bug there that is ONLY relevant if we are using JIT.
How so? If the same flags are set in the end then it seems like a
lateral movement to me.
Do you want to disable JIT compilation for older versions?
> My hope though is that with the release of 10.43 (currently in RC1),
> 10.34 will become irrelevant soon enough and this whole code could be
> cleaned up further.
Cleanup is good, but if we package the workarounds nicely we can keep
them around indefinitely without them bothering us. Debian buster
still has a few months of support left and comes with PCRE2 10.32..
>> But the comment is now off, isn't it? The workaround was turning
>> PCRE2_UCP off for older versions (because those were broken), not
>> turning it on for newer versions (because it would be required by some
>> unfixed regression).
>
> The comment was never correct, because it was turning it off, because
> the combination of JIT + MATCH_INVALID_UTF (which was released in
> 10.34) + UCP is broken.
The code disabled PCRE2_UCP for PCRE2 10.34 and older. The comment
claimed that this was done as a workaround for a bug fixed in 10.35.
You seem to say the same. What was incorrect about the comment?
>> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and
>> GIT_PCRE2_VERSION_10_43_OR_HIGHER? Keeping them separate would help
>> keep the #ifdef parts as short as possible and maintainable
>> independently.
>
> I thought that nesting them would make it simpler to maintain, since
> there are somehow connected anyway.
>
> The additional flags that are added in PCRE 10.43 are ONLY needed and
> useful on top of PCRE2_UCP, which itself only makes sense on top of
> UTF.
This conditional stacking is complicated. I find the old model where
we say which features we want up front and then disable buggy ones
for specific versions easier to grasp.
>>> @@ -27,13 +34,13 @@ do
>>> if ! test_have_prereq PERF_GREP_ENGINES_THREADS
>>> then
>>> test_perf "grep -P '$pattern'" --prereq PCRE "
>>> - git -P grep -f pat || :
>>> + git grep -P -f pat || :
>>> "
>>> else
>>> for threads in $GIT_PERF_GREP_THREADS
>>> do
>>> test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
>>> - git -c grep.threads=$threads -P grep -f pat || :
>>> + git -c grep.threads=$threads grep -P -f pat || :
>>
>> What's going on here? You remove the -P option of git (--no-pager) and
>> add the -P option of git grep (--perl-regexp). So this perf test never
>> tested PCRE despite its name? Seems worth a separate patch.
>
> My original code was a dud. This would make that at least more obvious,
The change is good, but I don't see any connection to the
grep.perl.digit feature that this patch is primarily about,
so I (still) think it deserves its own patch.
>> Do the performance numbers in acabd2048e (grep: correctly identify utf-8
>> characters with \{b,w} in -P, 2023-01-08) still hold up in that light?
>
> FWIW the performance numbers still hold up, but just because I did the
> tests independently using hyperfine, and indeed when I did the first
> version of this patch, I had a nice reproducible description that
> showed how to get 20% better performance while searching the git
> repository itself for something like 4 digits (as used in Copyright
> dates).
Great!
> My idea was to reply to my own post with instructions on how to test
> this, which basically require to also compile the recently released
> 10.43RC1 and build against that.
OK, so this is about the grep.perl.digit feature, right?
What I meant above was: Is the statement in acabd2048e (grep: correctly
identify utf-8 characters with \{b,w} in -P, 2023-01-08) about 20-40%
performance impact (in which direction, by the way) still measurable
with the fixed perf test script?
With the fix and PCRE2 10.42 and PCRE2_UCP I get:
Test this tree
----------------------------------------------
7822.1: grep -P '\bhow' 0.05(0.02+0.30)
7822.2: grep -P '\bÆvar' 0.05(0.02+0.29)
7822.3: grep -P '\d+ \bÆvar' 0.05(0.02+0.27)
7822.4: grep -P '\bBelón\b' 0.04(0.02+0.23)
7822.5: grep -P '\w{12}\b' 0.03(0.02+0.13)
... and without PCRE2_UCP:
Test this tree
----------------------------------------------
7822.1: grep -P '\bhow' 0.05(0.02+0.26)
7822.2: grep -P '\bÆvar' 0.04(0.02+0.18)
7822.3: grep -P '\d+ \bÆvar' 0.05(0.02+0.26)
7822.4: grep -P '\bBelón\b' 0.05(0.02+0.27)
7822.5: grep -P '\w{12}\b' 0.04(0.02+0.18)
Hmm, inconclusive. Perhaps the test data is too small?
> Indeed, AFAIK the test would fail if built with an older version of
> PCRE, but wasn't sure if making a prerequisite that hardcoded the
> version in test-tool might be the best approach to prevent that,
> especially with all the libification work.
You could extend test-pcre2-config to report whether that feature is
active. Performance tests could set prerequisites based on its output.
> PS. your reply got lost somehow in my SPAM folder, so I apologize for
> the late reply
No worries, I need days to reply without any detour through the
spam folder..
René
next prev parent reply other threads:[~2024-01-04 21:14 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-01 15:03 [RFC PATCH] grep: default to posix digits with -P Carlo Marcelo Arenas Belón
2024-01-01 17:18 ` René Scharfe
2024-01-02 19:02 ` Carlo Arenas
2024-01-04 21:14 ` René Scharfe [this message]
2024-07-10 13:49 ` Mathias Krause
2024-01-01 18:07 ` Eric Sunshine
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ff21e91a-f2db-4d37-ab9b-da9a492db8d3@web.de \
--to=l.s.r@web.de \
--cc=carenas@gmail.com \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).