* git grep does not find all occurrences on macOS @ 2024-10-14 13:34 David Gstir 2024-10-15 0:26 ` Taylor Blau 0 siblings, 1 reply; 6+ messages in thread From: David Gstir @ 2024-10-14 13:34 UTC (permalink / raw) To: git; +Cc: Richard Weinberger Hi! I encountered rather subtle issue on in git 2.47.0 on macOS 14.7 (installed from Homebrew): git grep will not find all occurrences of string patterns containing a “.” under some conditions. In my case I have an ISO-8859 encoded text file which contains umlauts. If the string I’m grepping for occurs after a non-ASCII character in this file, git grep will not find it. I’ve put up a reproducer here https://github.com/iokill/repro-git-grep-issue, but the gist of it is "git grep quz.baz" on the ISO-8859-encoded file below will not return anything, when it should return the line "quz.baz=3": -->8------------- foo=bar umlauts=äöü quz.baz=3 --8<------------- From what I’ve found so far, this occurs on macOS, but not on Linux and it has to be connected to the regex matching code, because the issue does not occur with --fixed-strings or --perl-regexp. Thanks! - David ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git grep does not find all occurrences on macOS 2024-10-14 13:34 git grep does not find all occurrences on macOS David Gstir @ 2024-10-15 0:26 ` Taylor Blau 2024-10-15 20:15 ` René Scharfe 0 siblings, 1 reply; 6+ messages in thread From: Taylor Blau @ 2024-10-15 0:26 UTC (permalink / raw) To: David Gstir; +Cc: git, Richard Weinberger, René Scharfe, Jeff King On Mon, Oct 14, 2024 at 03:34:02PM +0200, David Gstir wrote: > Hi! > > I encountered rather subtle issue on in git 2.47.0 on macOS 14.7 (installed from Homebrew): > > git grep will not find all occurrences of string patterns containing a “.” under some > conditions. In my case I have an ISO-8859 encoded text file which contains umlauts. > If the string I’m grepping for occurs after a non-ASCII character in this file, git grep > will not find it. > > I’ve put up a reproducer here https://github.com/iokill/repro-git-grep-issue, but the gist > of it is "git grep quz.baz" on the ISO-8859-encoded file below will not return anything, > when it should return the line "quz.baz=3": Interesting. I have a macOS system, but don't build Git on it regularly enough to test. I CC'd some other active contributors who have recently touched grep.c, perhaps they may have a better insight into what is going on here. Thanks, Taylor ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git grep does not find all occurrences on macOS 2024-10-15 0:26 ` Taylor Blau @ 2024-10-15 20:15 ` René Scharfe 2024-10-20 11:02 ` [PATCH] grep: disable lookahead on error René Scharfe 0 siblings, 1 reply; 6+ messages in thread From: René Scharfe @ 2024-10-15 20:15 UTC (permalink / raw) To: Taylor Blau, David Gstir; +Cc: git, Richard Weinberger, Jeff King Am 15.10.24 um 02:26 schrieb Taylor Blau: > On Mon, Oct 14, 2024 at 03:34:02PM +0200, David Gstir wrote: >> Hi! >> >> I encountered rather subtle issue on in git 2.47.0 on macOS 14.7 (installed from Homebrew): >> >> git grep will not find all occurrences of string patterns containing a “.” under some >> conditions. In my case I have an ISO-8859 encoded text file which contains umlauts. >> If the string I’m grepping for occurs after a non-ASCII character in this file, git grep >> will not find it. >> >> I’ve put up a reproducer here https://github.com/iokill/repro-git-grep-issue, but the gist >> of it is "git grep quz.baz" on the ISO-8859-encoded file below will not return anything, >> when it should return the line "quz.baz=3": Can reproduce on macOS 15.0.1. Bisects to 1819ad327b (grep: fix multibyte regex handling under macOS, 2022-08-26), which enabled the use of the system's regex engine. grep(1) does find that line. regexec(3) returns REG_ILLSEQ (illegal byte sequence) for that file, which makes sense. Interpreting that result as a non-match of the whole file is not the best way to handle it, though. Reporting the error would be one option. Turning off lookahead and matching each line separately might be better. Would setting the attribute working-tree-encoding help here? Not fully: The file would be converted to UTF-8 before commit, but git grep without a tree argument would still read the raw file, without conversion. Shouldn't it respect the attribute and call convert_to_git()? Using -P to use Perl regular expressions would work in the example. René ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] grep: disable lookahead on error 2024-10-15 20:15 ` René Scharfe @ 2024-10-20 11:02 ` René Scharfe 2024-10-21 21:57 ` Taylor Blau 2024-10-22 5:58 ` David Gstir 0 siblings, 2 replies; 6+ messages in thread From: René Scharfe @ 2024-10-20 11:02 UTC (permalink / raw) To: Taylor Blau, David Gstir; +Cc: git, Richard Weinberger, Jeff King regexec(3) can fail. E.g. on macOS it fails if it is used with an UTF-8 locale to match a valid regex against a buffer containing invalid UTF-8 characters. git grep has two ways to search for matches in a file: Either it splits its contents into lines and matches them separately, or it matches the whole content and figures out line boundaries later. The latter is done by look_ahead() and it's quicker in the common case where most files don't contain a match. Fall back to line-by-line matching if look_ahead() encounters an regexec(3) error by propagating errors out of patmatch() and bailing out of look_ahead() if there is one. This way we at least can find matches in lines that contain only valid characters. That matches the behavior of grep(1) on macOS. pcre2match() dies if pcre2_jit_match() or pcre2_match() fail, but since we use the flag PCRE2_MATCH_INVALID_UTF it handles invalid UTF-8 characters gracefully. So implement the fall-back only for regexec(3) and leave the PCRE2 matching unchanged. Reported-by: David Gstir <david@sigma-star.at> Signed-off-by: René Scharfe <l.s.r@web.de> --- grep.c | 30 ++++++++++++++++++++---------- t/t7810-grep.sh | 9 +++++++++ 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/grep.c b/grep.c index 701e58de04..506f0423c8 100644 --- a/grep.c +++ b/grep.c @@ -906,15 +906,17 @@ static int patmatch(struct grep_pat *p, const char *line, const char *eol, regmatch_t *match, int eflags) { - int hit; - if (p->pcre2_pattern) - hit = !pcre2match(p, line, eol, match, eflags); - else - hit = !regexec_buf(&p->regexp, line, eol - line, 1, match, - eflags); + return !pcre2match(p, line, eol, match, eflags); - return hit; + switch (regexec_buf(&p->regexp, line, eol - line, 1, match, eflags)) { + case 0: + return 1; + case REG_NOMATCH: + return 0; + default: + return -1; + } } static void strip_timestamp(const char *bol, const char **eol_p) @@ -952,6 +954,8 @@ static int headerless_match_one_pattern(struct grep_pat *p, again: hit = patmatch(p, bol, eol, pmatch, eflags); + if (hit < 0) + hit = 0; if (hit && p->word_regexp) { if ((pmatch[0].rm_so < 0) || @@ -1461,6 +1465,8 @@ static int look_ahead(struct grep_opt *opt, regmatch_t m; hit = patmatch(p, bol, bol + *left_p, &m, 0); + if (hit < 0) + return -1; if (!hit || m.rm_so < 0 || m.rm_eo < 0) continue; if (earliest < 0 || m.rm_so < earliest) @@ -1655,9 +1661,13 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle if (try_lookahead && !(last_hit && (show_function || - lno <= last_hit + opt->post_context)) - && look_ahead(opt, &left, &lno, &bol)) - break; + lno <= last_hit + opt->post_context))) { + hit = look_ahead(opt, &left, &lno, &bol); + if (hit < 0) + try_lookahead = 0; + else if (hit) + break; + } eol = end_of_line(bol, &left); if ((ctx == GREP_CONTEXT_HEAD) && (eol == bol)) diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh index af2cf2f78a..64ac4f04ee 100755 --- a/t/t7810-grep.sh +++ b/t/t7810-grep.sh @@ -87,6 +87,7 @@ test_expect_success setup ' # Still a no-op. function dummy() {} EOF + printf "\200\nASCII\n" >invalid-utf8 && if test_have_prereq FUNNYNAMES then echo unusual >"\"unusual\" pathname" && @@ -534,6 +535,14 @@ do test_cmp expected actual ' + test_expect_success "grep $L searches past invalid lines on UTF-8 locale" ' + LC_ALL=en_US.UTF-8 git grep A. invalid-utf8 >actual && + cat >expected <<-EOF && + invalid-utf8:ASCII + EOF + test_cmp expected actual + ' + test_expect_success FUNNYNAMES "grep $L should quote unusual pathnames" ' cat >expected <<-EOF && ${HC}"\"unusual\" pathname":unusual -- 2.47.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] grep: disable lookahead on error 2024-10-20 11:02 ` [PATCH] grep: disable lookahead on error René Scharfe @ 2024-10-21 21:57 ` Taylor Blau 2024-10-22 5:58 ` David Gstir 1 sibling, 0 replies; 6+ messages in thread From: Taylor Blau @ 2024-10-21 21:57 UTC (permalink / raw) To: René Scharfe; +Cc: David Gstir, git, Richard Weinberger, Jeff King On Sun, Oct 20, 2024 at 01:02:32PM +0200, René Scharfe wrote: > --- > grep.c | 30 ++++++++++++++++++++---------- > t/t7810-grep.sh | 9 +++++++++ > 2 files changed, 29 insertions(+), 10 deletions(-) Thanks, René. The patch looks good to me, but it would be nice to hear from David who could confirm that it fixes the issue on his end as well. Thanks, Taylor ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] grep: disable lookahead on error 2024-10-20 11:02 ` [PATCH] grep: disable lookahead on error René Scharfe 2024-10-21 21:57 ` Taylor Blau @ 2024-10-22 5:58 ` David Gstir 1 sibling, 0 replies; 6+ messages in thread From: David Gstir @ 2024-10-22 5:58 UTC (permalink / raw) To: René Scharfe; +Cc: Taylor Blau, git, Richard Weinberger, Jeff King Hi René, > On 20.10.2024, at 13:02, René Scharfe <l.s.r@web.de> wrote: > > regexec(3) can fail. E.g. on macOS it fails if it is used with an UTF-8 > locale to match a valid regex against a buffer containing invalid UTF-8 > characters. > > git grep has two ways to search for matches in a file: Either it splits > its contents into lines and matches them separately, or it matches the > whole content and figures out line boundaries later. The latter is done > by look_ahead() and it's quicker in the common case where most files > don't contain a match. > > Fall back to line-by-line matching if look_ahead() encounters an > regexec(3) error by propagating errors out of patmatch() and bailing out > of look_ahead() if there is one. This way we at least can find matches > in lines that contain only valid characters. That matches the behavior > of grep(1) on macOS. > > pcre2match() dies if pcre2_jit_match() or pcre2_match() fail, but since > we use the flag PCRE2_MATCH_INVALID_UTF it handles invalid UTF-8 > characters gracefully. So implement the fall-back only for regexec(3) > and leave the PCRE2 matching unchanged. > > Reported-by: David Gstir <david@sigma-star.at> > Signed-off-by: René Scharfe <l.s.r@web.de> thanks for fixing this! I’ve tested it on my end and your patch works. Feel free to add my Tested-By. Thanks, David ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-10-22 5:58 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-14 13:34 git grep does not find all occurrences on macOS David Gstir 2024-10-15 0:26 ` Taylor Blau 2024-10-15 20:15 ` René Scharfe 2024-10-20 11:02 ` [PATCH] grep: disable lookahead on error René Scharfe 2024-10-21 21:57 ` Taylor Blau 2024-10-22 5:58 ` David Gstir
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).