From: Hamza Mahfooz <someguy@effective-light.com>
To: git@vger.kernel.org
Cc: "Junio C Hamano" <gitster@pobox.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Hamza Mahfooz" <someguy@effective-light.com>
Subject: [PATCH v11 3/3] grep: fix an edge case concerning ascii patterns and UTF-8 data
Date: Thu, 7 Oct 2021 16:31:48 -0400 [thread overview]
Message-ID: <20211007203148.23888-3-someguy@effective-light.com> (raw)
In-Reply-To: <20211007203148.23888-1-someguy@effective-light.com>
If we attempt to grep non-ascii log message text with an ascii pattern, we
run into the following issue:
$ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
grep: (standard input): binary file matches
So, to fix this teach the grep code to mark the pattern as UTF-8 (even if
the pattern is composed of only ascii characters), so long as the log
output is encoded using UTF-8.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
---
grep.c | 6 +++--
grep.h | 1 +
revision.c | 2 ++
t/t7812-grep-icase-non-ascii.sh | 48 +++++++++++++++++++++++++++++++++
4 files changed, 55 insertions(+), 2 deletions(-)
diff --git a/grep.c b/grep.c
index fe847a0111..978d30f053 100644
--- a/grep.c
+++ b/grep.c
@@ -382,8 +382,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
}
options |= PCRE2_CASELESS;
}
- if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
- !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+ if ((opt->utf8_all_the_things && !has_non_ascii(p->pattern)) ||
+ (!opt->ignore_locale && is_utf8_locale() &&
+ has_non_ascii(p->pattern) && !(!opt->ignore_case &&
+ (p->fixed || p->is_fixed))))
options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
#ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
diff --git a/grep.h b/grep.h
index 808ad76f0c..c9ddd637d1 100644
--- a/grep.h
+++ b/grep.h
@@ -167,6 +167,7 @@ struct grep_opt {
int extended_regexp_option;
int pattern_type_option;
int ignore_locale;
+ int utf8_all_the_things;
char colors[NR_GREP_COLORS][COLOR_MAXLEN];
unsigned pre_context;
unsigned post_context;
diff --git a/revision.c b/revision.c
index 0dabb5a0bc..0d751dceb7 100644
--- a/revision.c
+++ b/revision.c
@@ -2874,6 +2874,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
&revs->grep_filter);
if (!is_encoding_utf8(get_log_output_encoding()))
revs->grep_filter.ignore_locale = 1;
+ else
+ revs->grep_filter.utf8_all_the_things = 1;
compile_grep_patterns(&revs->grep_filter);
if (revs->reverse && revs->reflog_info)
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index e5d1e4ea68..42323b31c0 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -53,6 +53,54 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
test_cmp expected actual
'
+test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' '
+ cat >expected <<-\EOF &&
+ Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com>
+ EOF
+ test_write_lines "forth" >file4 &&
+ git add file4 &&
+ git commit --author="À Ú Thor <author@example.com>" -m sécond &&
+ git log -1 --color=always --perl-regexp --author=".*Thor" >log &&
+ grep Author log >actual.raw &&
+ test_decode_color <actual.raw >actual &&
+ test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' '
+ cat >expected <<-\EOF &&
+ Commit: Ç<BOLD;RED> O Mîtter <committer@example.com><RESET>
+ EOF
+ test_write_lines "fifth" >file5 &&
+ git add file5 &&
+ export GIT_COMMITTER_NAME="Ç O Mîtter" &&
+ export GIT_COMMITTER_EMAIL="committer@example.com" &&
+ git -c i18n.commitEncoding=latin1 commit -m thïrd &&
+ git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
+ grep Commit: log >actual.raw &&
+ test_decode_color <actual.raw >actual &&
+ test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' '
+ cat >expected <<-\EOF &&
+ sé<BOLD;RED>con<RESET>d
+ EOF
+ git log -1 --color=always --perl-regexp --grep="con" >log &&
+ grep con log >actual.raw &&
+ test_decode_color <actual.raw >actual &&
+ test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' '
+ cat >expected <<-\EOF &&
+ <BOLD;RED>thïrd<RESET>
+ EOF
+ git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log &&
+ grep "th.*rd" log >actual.raw &&
+ test_decode_color <actual.raw >actual &&
+ test_cmp expected actual
+'
+
test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
printf "\\200\\n" >invalid-0x80 &&
echo "ævar" >expected &&
--
2.33.0
next prev parent reply other threads:[~2021-10-07 20:31 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-07 20:31 [PATCH v11 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-07 20:31 ` [PATCH v11 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-07 20:31 ` Hamza Mahfooz [this message]
2021-10-08 21:26 ` [PATCH v11 3/3] grep: fix an edge case concerning ascii patterns and UTF-8 data Junio C Hamano
2021-10-09 6:44 ` Junio C Hamano
2021-10-09 15:52 ` Hamza Mahfooz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211007203148.23888-3-someguy@effective-light.com \
--to=someguy@effective-light.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).