From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Junio C Hamano" <gitster@pobox.com>,
"Git List" <git@vger.kernel.org>,
"Hamza Mahfooz" <someguy@effective-light.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Carlo Marcelo Arenas Belón" <carenas@gmail.com>,
"Andreas Schwab" <schwab@linux-m68k.org>
Subject: Re: [v2.35.0 regression] some PCRE hangs under UTF-8 locale (was: [PATCH 1/2] grep/pcre2: use PCRE2_UTF even with ASCII patterns)
Date: Sun, 30 Jan 2022 10:04:22 +0100 [thread overview]
Message-ID: <20220130090422.GA4769@szeder.dev> (raw)
In-Reply-To: <dca59178-6e9b-315b-06ee-8e3201aa391c@web.de>
On Sun, Jan 30, 2022 at 08:55:02AM +0100, René Scharfe wrote:
> Am 29.01.22 um 18:25 schrieb SZEDER Gábor:
> > On Sat, Dec 18, 2021 at 08:50:02PM +0100, René Scharfe wrote:
> >> compile_pcre2_pattern() currently uses the option PCRE2_UTF only for
> >> patterns with non-ASCII characters. Patterns with ASCII wildcards can
> >> match non-ASCII strings, though. Without that option PCRE2 mishandles
> >> UTF-8 input, though -- it matches parts of multi-byte characters. Fix
> >> that by using PCRE2_UTF even for ASCII-only patterns.
> >>
> >> This is a remake of the reverted ae39ba431a (grep/pcre2: fix an edge
> >> case concerning ascii patterns and UTF-8 data, 2021-10-15). The change
> >> to the condition and the test are simplified and more targeted.
> >>
> >> Original-patch-by: Hamza Mahfooz <someguy@effective-light.com>
> >> Signed-off-by: René Scharfe <l.s.r@web.de>
> >> ---
> >> grep.c | 2 +-
> >> t/t7812-grep-icase-non-ascii.sh | 6 ++++++
> >> 2 files changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/grep.c b/grep.c
> >> index fe847a0111..5badb6d851 100644
> >> --- a/grep.c
> >> +++ b/grep.c
> >> @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
> >> }
> >> options |= PCRE2_CASELESS;
> >> }
> >> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> >> + if (!opt->ignore_locale && is_utf8_locale() &&
> >> !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> >>
> >
> > I tried to use 'git grep -P' for the first time ever, and it hung
> > right away, spinning all CPUs at 100%. I could narrow it down, both
> > the complexity of the pattern and the size of the input, see the test
> > below, and it bisects to this patch.
> >
> >
> > --- >8 ---
> >
> > #!/bin/sh
> >
> > test_description='test'
> >
> > . ./test-lib.sh
> >
> > test_expect_success PCRE 'test' '
> > # LC_ALL=C works
> > LC_ALL=en_US.UTF-8 &&
> > cat >ascii <<-\EOF &&
> > foo
> > bar
> > baz
> > EOF
> > cat >utf8 <<-\EOF &&
> > foo
> > bar
> > báz
> > EOF
> > git add ascii utf8 &&
> >
> > # These all work as expected:
> > git grep --threads=1 -P " " ascii &&
> > git grep --threads=1 -P "^ " ascii &&
> > git grep --threads=1 -P "\s" ascii &&
> > git grep --threads=1 -P "^\s" ascii &&
> > git grep --threads=1 -P " " utf8 &&
> > git grep --threads=1 -P "^ " utf8 &&
> > git grep --threads=1 -P "\s" utf8 &&
> >
> > # This hangs (but it does work with basic and extended regexp):
> > git grep --threads=1 -P "^\s" utf8
> > '
> >
> > test_done
>
> I get the following result and no hang with PCRE2 10.39:
>
> utf8: bar
> utf8: báz
>
> e0c6029 (Fix inifinite loop when a single byte newline is searched in
> JIT., 2020-05-29) [1] sounds like it might have fixed it. It's part of
> version 10.36.
I saw this hang on two Ubuntu 20.04 based boxes, which predate that
fix you mention only by a month or two, and apparently the almost two
years since then was not enough for this fix to trickle down into
updated 20.04 pcre packages, because:
> Do you still get the error when you disable JIT, i.e. when you use the
> pattern "(*NO_JIT)^\s" instead?
No, with this pattern it works as expected.
So is there a more convenient way to disable PCRE JIT in Git? FWIW,
(non-git) 'grep -P' works with the same patterns.
> René
>
>
> [1] https://github.com/PhilipHazel/pcre2/commit/e0c6029a62db9c2161941ecdf459205382d4d379
next prev parent reply other threads:[~2022-01-30 9:04 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-18 19:50 [PATCH 1/2] grep/pcre2: use PCRE2_UTF even with ASCII patterns René Scharfe
2021-12-18 19:53 ` [PATCH 2/2] grep/pcre2: factor out literal variable René Scharfe
2021-12-19 19:37 ` Ævar Arnfjörð Bjarmason
2021-12-20 20:52 ` Junio C Hamano
2021-12-20 22:03 ` Ævar Arnfjörð Bjarmason
2021-12-20 20:53 ` Junio C Hamano
2021-12-20 20:47 ` Junio C Hamano
2022-01-29 17:25 ` [v2.35.0 regression] some PCRE hangs under UTF-8 locale (was: [PATCH 1/2] grep/pcre2: use PCRE2_UTF even with ASCII patterns) SZEDER Gábor
2022-01-30 7:55 ` René Scharfe
2022-01-30 9:04 ` SZEDER Gábor [this message]
2022-01-30 13:32 ` René Scharfe
2022-01-31 21:01 ` Ævar Arnfjörð Bjarmason
2022-02-05 17:00 ` René Scharfe
2022-02-06 10:08 ` SZEDER Gábor
2022-02-12 20:46 ` Ævar Arnfjörð Bjarmason
2022-02-17 21:14 ` René Scharfe
2022-02-17 22:56 ` [v2.35.0 regression] some PCRE hangs under UTF-8 locale Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220130090422.GA4769@szeder.dev \
--to=szeder.dev@gmail.com \
--cc=avarab@gmail.com \
--cc=carenas@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=l.s.r@web.de \
--cc=schwab@linux-m68k.org \
--cc=someguy@effective-light.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.