* git grep: ^$ false match at end of file
@ 2025-01-09 23:52 Olly Betts
2025-01-10 11:43 ` Jeff King
0 siblings, 1 reply; 5+ messages in thread
From: Olly Betts @ 2025-01-09 23:52 UTC (permalink / raw)
To: git
git grep '^$' seems to match at the end of the file, reporting a line
number one greater than the number of lines in that file. This does
not match the behaviour of grep.
To reproduce:
$ git init -q git-grep-bug
$ cd git-grep-bug
$ echo test > test.txt
$ git add test.txt
$ git commit -m test
[master (root-commit) 55b48b26] test
1 file changed, 1 insertion(+)
create mode 100644 test.txt
$ git grep -n '^$'
test.txt:2:
$ grep -n '^$' test.txt
$
(The -n option isn't required to trigger it.)
I'm using the git 1:2.47.1-1 packages from Debian unstable. I can also
reproduce with git 1:2.48.0~rc1+next.20250101-1 from Debian
experimental.
Cheers,
Olly
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: git grep: ^$ false match at end of file 2025-01-09 23:52 git grep: ^$ false match at end of file Olly Betts @ 2025-01-10 11:43 ` Jeff King 2025-01-10 12:02 ` Jeff King 0 siblings, 1 reply; 5+ messages in thread From: Jeff King @ 2025-01-10 11:43 UTC (permalink / raw) To: Olly Betts; +Cc: Junio C Hamano, git On Thu, Jan 09, 2025 at 11:52:55PM +0000, Olly Betts wrote: > git grep '^$' seems to match at the end of the file, reporting a line > number one greater than the number of lines in that file. This does > not match the behaviour of grep. > > To reproduce: > > $ git init -q git-grep-bug > $ cd git-grep-bug > $ echo test > test.txt > $ git add test.txt > $ git commit -m test > [master (root-commit) 55b48b26] test > 1 file changed, 1 insertion(+) > create mode 100644 test.txt > $ git grep -n '^$' > test.txt:2: > $ grep -n '^$' test.txt Interesting case. Bisection shows that it started doing that in 34349bea60 (Merge branch 'jc/grep-lookahead', 2010-01-20). So it has been that way for quite a long time. But it is doubly curious, since neither of the parent trees exhibit the behavior. It is the merge itself which causes the problem. In the first-parent tree 34349bea60^1, we are still calling external "grep", which could explain why we don't see any problem. But building with NO_EXTERNAL_GREP (and confirming that it uses the internal code), it doesn't show the problem either! So where did the bug come from? Puzzled. That branch itself contains a merge, e2d2e383d8 (Merge branch 'jc/maint-1.6.4-grep-lookahead' into jc/maint-grep-lookahead, 2010-01-12). If we merge that into 34349bea60^1, the innocent first-parent, then the bug appears. And that brings in a bunch of lookahead code that could plausibly be the problem. I'm still confused why 34349bea60^2 (which does have the lookahead code) doesn't show the bug. I guess there's some bad interaction with what had happened in the meantime along the first-parent branch. Looking at: git diff 34349bea60^2 34349bea60 -- grep.c builtin-grep.c turns up: diff --git a/builtin-grep.c b/builtin-grep.c index 12833733db..da854fa94f 100644 --- a/builtin-grep.c +++ b/builtin-grep.c @@ -182,8 +182,6 @@ static int grep_file(struct grep_opt *opt, const char *filename) error("'%s': %s", filename, strerror(errno)); return 0; } - if (!st.st_size) - return 0; /* empty file -- no grep hit */ if (!S_ISREG(st.st_mode)) return 0; sz = xsize_t(st.st_size); @@ -198,6 +196,7 @@ static int grep_file(struct grep_opt *opt, const char *filename) return 0; } close(i); + data[sz] = 0; if (opt->relative && opt->prefix_length) filename = quote_path_relative(filename, -1, &buf, opt->prefix); i = grep_buffer(opt, filename, data, sz); @@ -223,7 +222,7 @@ static int grep_cache(struct grep_opt *opt, const char **paths, int cached) * are identical, even if worktree file has been modified, so use * cache version instead */ - if (cached || (ce->ce_flags & CE_VALID)) { + if (cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce)) { if (ce_stage(ce)) continue; hit |= grep_sha1(opt, ce->sha1, ce->name, 0); Ah. That middle hunk seems to be the culprit. But that probably means we were looking at uninitialized memory before, and the lookahead code was always wrong (but got lucky when there was a non-zero byte in that final slot). :-/ So probably the issue is the changes from a26345b608 (grep: optimize built-in grep by skipping lines that do not hit, 2010-01-10). I'll stop digging on it for now (but adding Junio to the cc as the author there). Probably it would have been faster just to start with a debugger than to look through the history. ;) -Peff ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: git grep: ^$ false match at end of file 2025-01-10 11:43 ` Jeff King @ 2025-01-10 12:02 ` Jeff King 2025-01-10 12:59 ` Andreas Schwab 0 siblings, 1 reply; 5+ messages in thread From: Jeff King @ 2025-01-10 12:02 UTC (permalink / raw) To: Olly Betts; +Cc: Junio C Hamano, git On Fri, Jan 10, 2025 at 06:43:08AM -0500, Jeff King wrote: > I'll stop digging on it for now (but adding Junio to the cc as the > author there). Probably it would have been faster just to start with a > debugger than to look through the history. ;) OK, my curiosity got the better of me. This fixes it: diff --git a/grep.c b/grep.c index 4e155ee9e6..9eac3dd95d 100644 --- a/grep.c +++ b/grep.c @@ -1470,10 +1470,12 @@ static int look_ahead(struct grep_opt *opt, hit = patmatch(p, bol, bol + *left_p, &m, 0); if (hit < 0) return -1; if (!hit || m.rm_so < 0 || m.rm_eo < 0) continue; + if (m.rm_so == *left_p) + continue; /* don't match nothing */ if (earliest < 0 || m.rm_so < earliest) earliest = m.rm_so; } if (earliest < 0) { but it is weird to me that patmatch() will match "^$" to the end of the buffer at all. It is just calling regexec_buf() behind the scenes, so I guess this is just a weird special case there, and may even depend on the regex implementation. If I pass "-P" to use pcre instead, the problem goes away even without my patch. If we skip look-ahead the problem also goes away. I'd have thought match_line() would have the same problem, but there we process line by line, and regexec_buf() never even sees the newline. So I guess the rationale is: some regexec implementations are weird about this special regex, and we should not trust their result with it on a whole buffer with newlines. -Peff ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: git grep: ^$ false match at end of file 2025-01-10 12:02 ` Jeff King @ 2025-01-10 12:59 ` Andreas Schwab 2025-01-13 6:26 ` Jeff King 0 siblings, 1 reply; 5+ messages in thread From: Andreas Schwab @ 2025-01-10 12:59 UTC (permalink / raw) To: Jeff King; +Cc: Olly Betts, Junio C Hamano, git On Jan 10 2025, Jeff King wrote: > but it is weird to me that patmatch() will match "^$" to the end of the > buffer at all. It is just calling regexec_buf() behind the scenes, so I > guess this is just a weird special case there, and may even depend on > the regex implementation. Shouldn't the matcher be called with REG_NOTEOL in that case? -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: git grep: ^$ false match at end of file 2025-01-10 12:59 ` Andreas Schwab @ 2025-01-13 6:26 ` Jeff King 0 siblings, 0 replies; 5+ messages in thread From: Jeff King @ 2025-01-13 6:26 UTC (permalink / raw) To: Andreas Schwab; +Cc: Olly Betts, Junio C Hamano, git On Fri, Jan 10, 2025 at 01:59:18PM +0100, Andreas Schwab wrote: > On Jan 10 2025, Jeff King wrote: > > > but it is weird to me that patmatch() will match "^$" to the end of the > > buffer at all. It is just calling regexec_buf() behind the scenes, so I > > guess this is just a weird special case there, and may even depend on > > the regex implementation. > > Shouldn't the matcher be called with REG_NOTEOL in that case? Perhaps. If regexec_buf() is assuming we are feeding lines, then without REG_NOTEOL it thinks the end of the buffer is the end of a line. Which makes sense, but trips up this case because we are not feeding lines, but rather a whole buffer. So the final newline is not the start of an empty line, but the true end of the buffer. But what if the buffer doesn't end in a newline? In the example, the file is something like "content\n". But what if it was just "content"? Then the end of the buffer really is the end of a line, isn't it? And REG_NOTEOL would not be appropriate. So without REG_NOTEOL: [this is wrong, per the report] $ echo content >file.txt $ git grep --no-index -n '^$' file.txt file.txt:2: [this is right] $ printf content >file.txt $ git grep --no-index -n '^$' file.txt $ echo $? 1 and with it, like this patch: diff --git a/grep.c b/grep.c index 4e155ee9e6..7e3b6d9474 100644 --- a/grep.c +++ b/grep.c @@ -1467,7 +1467,7 @@ static int look_ahead(struct grep_opt *opt, int hit; regmatch_t m; - hit = patmatch(p, bol, bol + *left_p, &m, 0); + hit = patmatch(p, bol, bol + *left_p, &m, REG_NOTEOL); if (hit < 0) return -1; if (!hit || m.rm_so < 0 || m.rm_eo < 0) we get: [this is now right] $ git grep --no-index -n '^$' file.txt $ echo $? 1 [and this stays right] $ printf content >file.txt $ git grep --no-index -n '^$' file.txt $ echo $? 1 but: [without REG_NOTEOL, this matches] $ printf content >file.txt $ git grep --no-index -n 't$' file.txt file.txt:1:content [but with that flag, it no longer does] $ printf content >file.txt $ git grep --no-index -n 't$' file.txt $ echo $? 1 So I do think "\n" at the end of the buffer is a special case. Perhaps we should always omit it, and then leave REG_NOTEOL unset, making the end of the buffer consistently the end of the final line. Like this, which no longer matches "^$" but does match "t$": diff --git a/grep.c b/grep.c index 4e155ee9e6..c4bb9f1081 100644 --- a/grep.c +++ b/grep.c @@ -1646,6 +1646,8 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle bol = gs->buf; left = gs->size; + if (left && gs->buf[left-1] == '\n') + left--; while (left) { const char *eol; int hit; -Peff ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-01-13 6:26 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-09 23:52 git grep: ^$ false match at end of file Olly Betts 2025-01-10 11:43 ` Jeff King 2025-01-10 12:02 ` Jeff King 2025-01-10 12:59 ` Andreas Schwab 2025-01-13 6:26 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).