From: "Carlo Marcelo Arenas Belón" <carenas@gmail.com>
To: git@vger.kernel.org
Cc: avarab@gmail.com, gitster@pobox.com,
"Carlo Marcelo Arenas Belón" <carenas@gmail.com>
Subject: [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
Date: Sun, 8 Jan 2023 07:52:17 -0800 [thread overview]
Message-ID: <20230108155217.2817-1-carenas@gmail.com> (raw)
In-Reply-To: <20230108062335.72114-1-carenas@gmail.com>
When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.
This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:
$ git grep -P '\bÆvar'
Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.
This has an impact on performance that has been estimated to be
between 20% to 40% and that is shown through the added performance
test.
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
grep.c | 2 +-
t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 1 deletion(-)
create mode 100755 t/perf/p7822-grep-perl-character.sh
diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
options |= PCRE2_CASELESS;
}
if (!opt->ignore_locale && is_utf8_locale() && !literal)
- options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+ options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
#ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh
new file mode 100755
index 0000000000..87009c60df
--- /dev/null
+++ b/t/perf/p7822-grep-perl-character.sh
@@ -0,0 +1,42 @@
+#!/bin/sh
+
+test_description="git-grep's perl regex
+
+If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8'
+etc.) we will test the patterns under those numbers of threads.
+"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+if test -n "$GIT_PERF_GREP_THREADS"
+then
+ test_set_prereq PERF_GREP_ENGINES_THREADS
+fi
+
+for pattern in \
+ '\\bhow' \
+ '\\bÆvar' \
+ '\\d+ \\bÆvar' \
+ '\\bBelón\\b' \
+ '\\w{12}\\b'
+do
+ echo '$pattern' >pat
+ if ! test_have_prereq PERF_GREP_ENGINES_THREADS
+ then
+ test_perf "grep -P '$pattern'" --prereq PCRE "
+ git -P grep -f pat || :
+ "
+ else
+ for threads in $GIT_PERF_GREP_THREADS
+ do
+ test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
+ git -c grep.threads=$threads -P grep -f pat || :
+ "
+ done
+ fi
+done
+
+test_done
--
2.39.0.199.g555ddd67e6
next prev parent reply other threads:[~2023-01-08 15:54 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-08 6:23 [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
2023-01-08 6:39 ` Junio C Hamano
2023-01-08 15:52 ` Carlo Marcelo Arenas Belón [this message]
2023-01-09 11:35 ` [PATCH v2] " Ævar Arnfjörð Bjarmason
2023-01-09 18:40 ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
2023-01-09 19:51 ` Ævar Arnfjörð Bjarmason
2023-01-09 23:12 ` Paul Eggert
2023-01-10 4:49 ` [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} " Carlo Arenas
2023-01-16 20:48 ` Junio C Hamano
2023-04-03 21:38 ` -P '\d' in GNU and git grep Paul Eggert
2023-04-04 3:30 ` bug#60690: " Jim Meyering
2023-04-04 6:46 ` Paul Eggert
2023-04-04 15:31 ` Jim Meyering
2023-04-04 6:56 ` Carlo Arenas
2023-04-04 18:25 ` bug#60690: " Paul Eggert
2023-04-04 19:31 ` Junio C Hamano
2023-04-05 18:32 ` Paul Eggert
2023-04-05 19:04 ` Paul Eggert
2023-04-05 19:37 ` Junio C Hamano
2023-04-05 19:40 ` Jim Meyering
2023-04-05 20:03 ` Paul Eggert
2023-04-05 21:20 ` Carlo Arenas
2023-04-06 15:45 ` demerphq
2023-04-07 16:48 ` Paul Eggert
2023-04-06 13:39 ` demerphq
2023-04-07 19:00 ` Paul Eggert
2023-04-08 5:01 ` Carlo Arenas
2023-04-08 22:45 ` Paul Eggert
2023-01-17 10:51 ` [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
2023-01-17 12:38 ` Ævar Arnfjörð Bjarmason
2023-01-17 15:19 ` Junio C Hamano
2023-01-18 7:35 ` Carlo Arenas
2023-01-18 11:49 ` Ævar Arnfjörð Bjarmason
2023-01-18 16:20 ` Junio C Hamano
2023-01-18 23:06 ` Ævar Arnfjörð Bjarmason
2023-01-18 23:24 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230108155217.2817-1-carenas@gmail.com \
--to=carenas@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).