* Compatibility between GNU and Git grep -P
[not found] ` <c5j7tduynkzhqpcgqc7iei4mmlnlwvfohmj7ryfhifpay6hhtn@ha3apuuzhxzz>
@ 2023-04-21 21:11 ` Paul Eggert
2023-04-21 22:05 ` Carlo Arenas
0 siblings, 1 reply; 3+ messages in thread
From: Paul Eggert @ 2023-04-21 21:11 UTC (permalink / raw)
To: Carlo Marcelo Arenas Belón, Jim Meyering
Cc: grep-devel, demerphq, pcre2-dev,
Ævar Arnfjörð Bjarmason, Junio C Hamano, git
[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]
In <https://lists.gnu.org/r/grep-devel/2023-04/msg00017.html> Carlo
Marcelo Arenas Belón wrote:
> After using this for a while think the following will be better suited
> for a release because:
>
> * the unreleased PCRE2 code is still changing and is unlikely to be released
> for a couple of months.
> * the current way to configure PCRE2 make it difficult to link with the
> unreleased code (this might be an independent bug), but it is likely that
> the wrong headers might be used by mistake.
> * the tests and documentation were not completely accurate.
Thanks for looking into this. I'm concerned about the resulting patches,
though, because I see recent activity in on the Git grep -P side here:
https://lore.kernel.org/git/xmqqzgaf2zpt.fsf@gitster.g/
Bleeding-edge (i.e., "master") GNU grep uses PCRE2_UCP |
PCRE2_EXTRA_ASCII_BSD with unreleased PCRE2 (which introduces
PCRE2_EXTRA_ASCII_BSD), and it uses neither flag with the current PCRE2
release. You're proposing to change GNU grep to never use either flag,
regardless of PCRE2 release.
In contrast, bleeding-edge (i.e., "next") Git grep -P always uses
PCRE2_UCP and never uses PCRE2_EXTRA_ASCII_BSD. I.e., it disagrees with
GNU grep regardless of whether your proposed changes were adopted.
Given Jim's strong desire that \d should match only ASCII digits, I
doubt whether GNU grep will simply use PCRE2_UCP without
PCRE2_EXTRA_ASCII_BSD.
If we want the two grep -P's to stay compatible, I see two ways forward:
1. Leave GNU grep alone and modify Git grep to behave like GNU grep (see
attached patch to Git).
2. Adopt your proposed change to GNU grep, and revert the recent change
to Git grep so that it never uses PCRE2_UCP.
Either way, we should see what the Git folks say about this.
[-- Attachment #2: 0001-grep-be-compatible-with-GNU-grep-P.patch --]
[-- Type: text/x-patch, Size: 1042 bytes --]
From 5f5e54157a01c540bde02c305c8ee5e1a39d4f1c Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Fri, 21 Apr 2023 14:06:25 -0700
Subject: [PATCH] grep: be compatible with GNU grep -P
Use PCRE2_UCP only when PCRE2_EXTRA_ASCII_BSD is defined,
for compatibility with GNU grep.
---
grep.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/grep.c b/grep.c
index 073559f2cd..e9dc8dc0bc 100644
--- a/grep.c
+++ b/grep.c
@@ -320,8 +320,13 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
}
options |= PCRE2_CASELESS;
}
- if (!opt->ignore_locale && is_utf8_locale() && !literal)
- options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
+ if (!opt->ignore_locale && is_utf8_locale() && !literal) {
+ options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+#ifdef PCRE2_EXTRA_ASCII_BSD
+ /* Be compatible with GNU grep -P '\d'. */
+ options |= (PCRE2_UCP | PCRE2_EXTRA_ASCII_BSD);
+#endif
+ }
#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
/*
--
2.39.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: Compatibility between GNU and Git grep -P
2023-04-21 21:11 ` Compatibility between GNU and Git grep -P Paul Eggert
@ 2023-04-21 22:05 ` Carlo Arenas
2023-04-30 6:51 ` Paul Eggert
0 siblings, 1 reply; 3+ messages in thread
From: Carlo Arenas @ 2023-04-21 22:05 UTC (permalink / raw)
To: Paul Eggert
Cc: Jim Meyering, grep-devel, demerphq, pcre2-dev,
Ævar Arnfjörð Bjarmason, Junio C Hamano, git
On Fri, Apr 21, 2023 at 2:11 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> In <https://lists.gnu.org/r/grep-devel/2023-04/msg00017.html> Carlo
> Marcelo Arenas Belón wrote:
>
> > After using this for a while think the following will be better suited
> > for a release because:
> >
> > * the unreleased PCRE2 code is still changing and is unlikely to be released
> > for a couple of months.
> > * the current way to configure PCRE2 make it difficult to link with the
> > unreleased code (this might be an independent bug), but it is likely that
> > the wrong headers might be used by mistake.
> > * the tests and documentation were not completely accurate.
Just to clarify; those points were made about the GNU grep codebase, hence are
not really relevant about git's which had an independent thread[1] and
that will be better to use instead to avoid more confusion.
> Thanks for looking into this. I'm concerned about the resulting patches,
> though, because I see recent activity in on the Git grep -P side here:
>
> https://lore.kernel.org/git/xmqqzgaf2zpt.fsf@gitster.g/
This is really not that recent, and has been released already with git
2.40, so at least at that point in time git and grep 3.9 were
consistent. That was changed with grep 3.10 though.
FWIW, it doesn't seem git had any issues (other than the crasher with
PCRE2 10.34) with the transition to matching multibyte digits with
'\d' and which is what perl (and therefore PCRE2) does, but as I
explained in the other thread I think it might be wise (on the context
of what is usually matched against with git) to not do that in the
long term, and was therefore working on adding the necessary features
to PCRE2 to be able to do so. Note that no decision has been made
though, which is why I didn't even bother sending an RFC patch.
> Given Jim's strong desire that \d should match only ASCII digits, I
> doubt whether GNU grep will simply use PCRE2_UCP without
> PCRE2_EXTRA_ASCII_BSD.
My assumption is that you would also need PCRE2_EXTRA_ASCII_DIGIT, and
indeed bleeding edge pcre2grep[2] had a compatibility option added
assuming as much.
> Either way, we should see what the Git folks say about this.
The proposed patch for git would IMHO just cause the same risk I was
trying to prevent with my proposed change to GNU grep.
There are no plans to release PCRE2 10.43 and based on its regular
cadence wouldn't be for another couple of months, so this code is a
little premature and will need updating eitherway.
Carlo
[1] https://lore.kernel.org/git/2554712d-e386-3bab-bc6c-1f0e85d999db@cs.ucla.edu/
[2] https://github.com/PCRE2Project/pcre2/commit/3bbdb6dd713b39868934fdc978ba61b81da6d1c5
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Compatibility between GNU and Git grep -P
2023-04-21 22:05 ` Carlo Arenas
@ 2023-04-30 6:51 ` Paul Eggert
0 siblings, 0 replies; 3+ messages in thread
From: Paul Eggert @ 2023-04-30 6:51 UTC (permalink / raw)
To: Carlo Arenas
Cc: Jim Meyering, grep-devel, demerphq, pcre2-dev,
Ævar Arnfjörð Bjarmason, Junio C Hamano, git
[-- Attachment #1: Type: text/plain, Size: 2527 bytes --]
On 2023-04-21 15:05, Carlo Arenas wrote:
> This is really not that recent, and has been released already with git
> 2.40, so at least at that point in time git and grep 3.9 were
> consistent. That was changed with grep 3.10 though.
OK, so currently GNU grep -P and git grep -P treat \d differently (the
former matches only ASCII digits; the latter matches all decimal
digits). And the next GNU grep -P (i.e., the one currently on Savannah
master), when combined with a future PCRE2 release, is planned to change
behavior in other areas, but the two programs will continue to treat \d
differently. Not good, obviously.
> FWIW, it doesn't seem git had any issues (other than the crasher with
> PCRE2 10.34) with the transition to matching multibyte digits with
> '\d' and which is what perl (and therefore PCRE2) does
I suspect the issues that Jim is worried about have more to do with
people attacking grep-using programs with data that unexpectedly match
\d. This is more likely to happen with GNU grep -P (which gets all sorts
of weird junk thrown at it) than with Git grep -P (which tends to lead a
more cloistered life). If my suspicion is right, then even if Git users
don't have issues with PCRE2_UCP and \d, that doesn't mean GNU grep
users would be free of such issues.
> My assumption is that you would also need PCRE2_EXTRA_ASCII_DIGIT, and
> indeed bleeding edge pcre2grep[2] had a compatibility option added
> assuming as much.
Although I wasn't aware of that future PCRE2 option, I am not sure GNU
grep -P should use it. As things stand, the next GNU grep -P release,
when combined with the next PCRE2 release, will have \d match ASCII
digits only, and will have [[:digit:]] match all decimal digits. This is
compatible with how plain GNU grep [[:digit:]] works (plain GNU grep
lacks \d of course, so there's no compatibility issue there).
Quite possibly GNU grep -P should retain [[:digit:]] compatibility with
plain grep by not using PCRE2_EXTRA_ASCII_DIGIT. Though it'd be
unfortunate that \d would not mean the same thing as [[:digit:]], that
might be better than the alternative of having [[:digit:]] mean
something different with -P than without -P.
> The proposed patch for git would IMHO just cause the same risk I was
> trying to prevent with my proposed change to GNU grep.
It sounds like this ship has already sailed. At best we can now try to
repair it.
For now I installed the attached documentation patch. I left the grep
code alone as we are so close to a release.
[-- Attachment #2: 0001-doc-improve-doc-for-P-d.patch --]
[-- Type: text/x-patch, Size: 3571 bytes --]
From 8d3afeebcc2bdf2e8fd4ed1c5256e54be95f36a1 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sat, 29 Apr 2023 23:41:14 -0700
Subject: [PATCH] doc: improve doc for -P '\d'
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
This follows up to Carlo Marcelo Arenas Belón’s email
<https://lists.gnu.org/r/grep-devel/2023-04/msg00017.html>
that proposed changing the code too. These patches change
only the documentation since we’re so near a release.
* NEWS: Be less optimistic about the fix for -P '\d',
and warn that behavior is likely to change again.
* doc/grep.texi (grep Programs): Be less specific about -P \d
behavior, since it’s still in flux. Warn about mismatching
Unicode versions, or disagreements about obscure constructs.
---
NEWS | 14 ++++++++------
doc/grep.texi | 13 +++++--------
2 files changed, 13 insertions(+), 14 deletions(-)
diff --git a/NEWS b/NEWS
index c15764c..995d14e 100644
--- a/NEWS
+++ b/NEWS
@@ -4,11 +4,12 @@ GNU grep NEWS -*- outline -*-
** Bug fixes
- With -P, patterns like [\d] now work again. The fix relies on PCRE2
- support for the PCRE2_EXTRA_ASCII_BSD flag planned for PCRE2 10.43.
- With PCRE2 version 10.42 or earlier, behavior reverts to that of
- grep 3.8, in that patterns like \w and \b use ASCII rather than
- Unicode interpretations.
+ With -P, patterns like [\d] now work again. Fixing this has caused
+ grep to revert to the behavior of grep 3.8, in that patterns like \w
+ and \b go back to using ASCII rather than Unicode interpretations.
+ However, future versions of GNU grep and/or PCRE2 are likely to fix
+ this and change the behavior of \w and \b back to Unicode again,
+ without breaking [\d] as 3.10 did.
[bug introduced in grep 3.10]
grep no longer fails on files dated after the year 2038,
@@ -25,7 +26,8 @@ GNU grep NEWS -*- outline -*-
previous versions of grep wouldn't respect the user provided settings for
PCRE_CFLAGS and PCRE_LIBS when building if a libpcre2-8 pkg-config module
- found in the system.
+ was found.
+
* Noteworthy changes in release 3.10 (2023-03-22) [stable]
diff --git a/doc/grep.texi b/doc/grep.texi
index ce6d6dc..ff31d5d 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1154,18 +1154,15 @@ For documentation, refer to @url{https://www.pcre.org/}, with these caveats:
@samp{\d} matches only the ten ASCII digits
(and @samp{\D} matches the complement), regardless of locale.
Use @samp{\p@{Nd@}} to also match non-ASCII digits.
-
-When @command{grep} is built with PCRE2 10.42 and earlier,
-@samp{\d} and @samp{\D} ignore in-regexp directives like @samp{(?aD)}
-and work like @samp{[0-9]} and @samp{[^0-9]} respectively.
-However, later versions of PCRE2 likely will fix this,
-and the plan is for @command{grep} to respect those directives if possible.
+(The behavior of @samp{\d} and @samp{\D} is unspecified after
+in-regexp directives like @samp{(?aD)}.)
@item
Although PCRE tracks the syntax and semantics of Perl's regular
-expressions, the match is not always exact, partly because Perl
+expressions, the match is not always exact. For example, Perl
evolves and a Perl installation may predate or postdate the PCRE2
-installation on the same host.
+installation on the same host, or their Unicode versions may differ,
+or Perl and PCRE2 may disagree about an obscure construct.
@item
By default, @command{grep} applies each regexp to a line at a time,
--
2.39.2
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-04-30 6:51 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <m28rf0pa2m.fsf@meyering.net>
[not found] ` <4322c414-2bb7-924f-0f6d-dbf517599c3f@cs.ucla.edu>
[not found] ` <CA+8g5KFqgtKadru7g0LMPpoNDO0vxGGsva_+hQAUcOOfMTd22w@mail.gmail.com>
[not found] ` <c5j7tduynkzhqpcgqc7iei4mmlnlwvfohmj7ryfhifpay6hhtn@ha3apuuzhxzz>
2023-04-21 21:11 ` Compatibility between GNU and Git grep -P Paul Eggert
2023-04-21 22:05 ` Carlo Arenas
2023-04-30 6:51 ` Paul Eggert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).