git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git grep: ^$ false match at end of file
@ 2025-01-09 23:52 Olly Betts
  2025-01-10 11:43 ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Olly Betts @ 2025-01-09 23:52 UTC (permalink / raw)
  To: git

git grep '^$' seems to match at the end of the file, reporting a line
number one greater than the number of lines in that file.  This does
not match the behaviour of grep.

To reproduce:

$ git init -q git-grep-bug
$ cd git-grep-bug
$ echo test > test.txt
$ git add test.txt
$ git commit -m test
[master (root-commit) 55b48b26] test
 1 file changed, 1 insertion(+)
 create mode 100644 test.txt
$ git grep -n '^$'
test.txt:2:
$ grep -n '^$' test.txt
$

(The -n option isn't required to trigger it.)

I'm using the git 1:2.47.1-1 packages from Debian unstable.  I can also
reproduce with git 1:2.48.0~rc1+next.20250101-1 from Debian
experimental.

Cheers,
    Olly

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: git grep: ^$ false match at end of file
  2025-01-09 23:52 git grep: ^$ false match at end of file Olly Betts
@ 2025-01-10 11:43 ` Jeff King
  2025-01-10 12:02   ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff King @ 2025-01-10 11:43 UTC (permalink / raw)
  To: Olly Betts; +Cc: Junio C Hamano, git

On Thu, Jan 09, 2025 at 11:52:55PM +0000, Olly Betts wrote:

> git grep '^$' seems to match at the end of the file, reporting a line
> number one greater than the number of lines in that file.  This does
> not match the behaviour of grep.
> 
> To reproduce:
> 
> $ git init -q git-grep-bug
> $ cd git-grep-bug
> $ echo test > test.txt
> $ git add test.txt
> $ git commit -m test
> [master (root-commit) 55b48b26] test
>  1 file changed, 1 insertion(+)
>  create mode 100644 test.txt
> $ git grep -n '^$'
> test.txt:2:
> $ grep -n '^$' test.txt

Interesting case. Bisection shows that it started doing that in
34349bea60 (Merge branch 'jc/grep-lookahead', 2010-01-20). So it has
been that way for quite a long time. But it is doubly curious, since
neither of the parent trees exhibit the behavior. It is the merge itself
which causes the problem.

In the first-parent tree 34349bea60^1, we are still calling external
"grep", which could explain why we don't see any problem. But building
with NO_EXTERNAL_GREP (and confirming that it uses the internal code),
it doesn't show the problem either!

So where did the bug come from? Puzzled.

That branch itself contains a merge, e2d2e383d8 (Merge branch
'jc/maint-1.6.4-grep-lookahead' into jc/maint-grep-lookahead,
2010-01-12). If we merge that into 34349bea60^1, the innocent
first-parent, then the bug appears. And that brings in a bunch of
lookahead code that could plausibly be the problem.

I'm still confused why 34349bea60^2 (which does have the lookahead code)
doesn't show the bug. I guess there's some bad interaction with what had
happened in the meantime along the first-parent branch.

Looking at:

  git diff 34349bea60^2 34349bea60 -- grep.c builtin-grep.c

turns up:

diff --git a/builtin-grep.c b/builtin-grep.c
index 12833733db..da854fa94f 100644
--- a/builtin-grep.c
+++ b/builtin-grep.c
@@ -182,8 +182,6 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 			error("'%s': %s", filename, strerror(errno));
 		return 0;
 	}
-	if (!st.st_size)
-		return 0; /* empty file -- no grep hit */
 	if (!S_ISREG(st.st_mode))
 		return 0;
 	sz = xsize_t(st.st_size);
@@ -198,6 +196,7 @@ static int grep_file(struct grep_opt *opt, const char *filename)
 		return 0;
 	}
 	close(i);
+	data[sz] = 0;
 	if (opt->relative && opt->prefix_length)
 		filename = quote_path_relative(filename, -1, &buf, opt->prefix);
 	i = grep_buffer(opt, filename, data, sz);
@@ -223,7 +222,7 @@ static int grep_cache(struct grep_opt *opt, const char **paths, int cached)
 		 * are identical, even if worktree file has been modified, so use
 		 * cache version instead
 		 */
-		if (cached || (ce->ce_flags & CE_VALID)) {
+		if (cached || (ce->ce_flags & CE_VALID) || ce_skip_worktree(ce)) {
 			if (ce_stage(ce))
 				continue;
 			hit |= grep_sha1(opt, ce->sha1, ce->name, 0);

Ah. That middle hunk seems to be the culprit. But that probably means we
were looking at uninitialized memory before, and the lookahead code was
always wrong (but got lucky when there was a non-zero byte in that final
slot). :-/

So probably the issue is the changes from a26345b608 (grep: optimize
built-in grep by skipping lines that do not hit, 2010-01-10).

I'll stop digging on it for now (but adding Junio to the cc as the
author there). Probably it would have been faster just to start with a
debugger than to look through the history. ;)

-Peff

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: git grep: ^$ false match at end of file
  2025-01-10 11:43 ` Jeff King
@ 2025-01-10 12:02   ` Jeff King
  2025-01-10 12:59     ` Andreas Schwab
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff King @ 2025-01-10 12:02 UTC (permalink / raw)
  To: Olly Betts; +Cc: Junio C Hamano, git

On Fri, Jan 10, 2025 at 06:43:08AM -0500, Jeff King wrote:

> I'll stop digging on it for now (but adding Junio to the cc as the
> author there). Probably it would have been faster just to start with a
> debugger than to look through the history. ;)

OK, my curiosity got the better of me. This fixes it:

diff --git a/grep.c b/grep.c
index 4e155ee9e6..9eac3dd95d 100644
--- a/grep.c
+++ b/grep.c
@@ -1470,10 +1470,12 @@ static int look_ahead(struct grep_opt *opt,
 		hit = patmatch(p, bol, bol + *left_p, &m, 0);
 		if (hit < 0)
 			return -1;
 		if (!hit || m.rm_so < 0 || m.rm_eo < 0)
 			continue;
+		if (m.rm_so == *left_p)
+			continue; /* don't match nothing */
 		if (earliest < 0 || m.rm_so < earliest)
 			earliest = m.rm_so;
 	}
 
 	if (earliest < 0) {

but it is weird to me that patmatch() will match "^$" to the end of the
buffer at all. It is just calling regexec_buf() behind the scenes, so I
guess this is just a weird special case there, and may even depend on
the regex implementation. If I pass "-P" to use pcre instead, the
problem goes away even without my patch.

If we skip look-ahead the problem also goes away. I'd have thought
match_line() would have the same problem, but there we process line by
line, and regexec_buf() never even sees the newline.

So I guess the rationale is: some regexec implementations are weird
about this special regex, and we should not trust their result with it
on a whole buffer with newlines.

-Peff

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: git grep: ^$ false match at end of file
  2025-01-10 12:02   ` Jeff King
@ 2025-01-10 12:59     ` Andreas Schwab
  2025-01-13  6:26       ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas Schwab @ 2025-01-10 12:59 UTC (permalink / raw)
  To: Jeff King; +Cc: Olly Betts, Junio C Hamano, git

On Jan 10 2025, Jeff King wrote:

> but it is weird to me that patmatch() will match "^$" to the end of the
> buffer at all. It is just calling regexec_buf() behind the scenes, so I
> guess this is just a weird special case there, and may even depend on
> the regex implementation.

Shouldn't the matcher be called with REG_NOTEOL in that case?

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: git grep: ^$ false match at end of file
  2025-01-10 12:59     ` Andreas Schwab
@ 2025-01-13  6:26       ` Jeff King
  0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2025-01-13  6:26 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Olly Betts, Junio C Hamano, git

On Fri, Jan 10, 2025 at 01:59:18PM +0100, Andreas Schwab wrote:

> On Jan 10 2025, Jeff King wrote:
> 
> > but it is weird to me that patmatch() will match "^$" to the end of the
> > buffer at all. It is just calling regexec_buf() behind the scenes, so I
> > guess this is just a weird special case there, and may even depend on
> > the regex implementation.
> 
> Shouldn't the matcher be called with REG_NOTEOL in that case?

Perhaps. If regexec_buf() is assuming we are feeding lines, then without
REG_NOTEOL it thinks the end of the buffer is the end of a line. Which
makes sense, but trips up this case because we are not feeding lines,
but rather a whole buffer. So the final newline is not the start of an
empty line, but the true end of the buffer.

But what if the buffer doesn't end in a newline? In the example, the
file is something like "content\n".  But what if it was just "content"?
Then the end of the buffer really is the end of a line, isn't it? And
REG_NOTEOL would not be appropriate.

So without REG_NOTEOL:

  [this is wrong, per the report]
  $ echo content >file.txt
  $ git grep --no-index -n '^$' file.txt
  file.txt:2:

  [this is right]
  $ printf content >file.txt
  $ git grep --no-index -n '^$' file.txt
  $ echo $?
  1

and with it, like this patch:

diff --git a/grep.c b/grep.c
index 4e155ee9e6..7e3b6d9474 100644
--- a/grep.c
+++ b/grep.c
@@ -1467,7 +1467,7 @@ static int look_ahead(struct grep_opt *opt,
 		int hit;
 		regmatch_t m;
 
-		hit = patmatch(p, bol, bol + *left_p, &m, 0);
+		hit = patmatch(p, bol, bol + *left_p, &m, REG_NOTEOL);
 		if (hit < 0)
 			return -1;
 		if (!hit || m.rm_so < 0 || m.rm_eo < 0)

we get:

  [this is now right]
  $ git grep --no-index -n '^$' file.txt
  $ echo $?
  1

  [and this stays right]
  $ printf content >file.txt
  $ git grep --no-index -n '^$' file.txt
  $ echo $?
  1

but:

  [without REG_NOTEOL, this matches]
  $ printf content >file.txt
  $ git grep --no-index -n 't$' file.txt
  file.txt:1:content

  [but with that flag, it no longer does]
  $ printf content >file.txt
  $ git grep --no-index -n 't$' file.txt
  $ echo $?
  1

So I do think "\n" at the end of the buffer is a special case. Perhaps
we should always omit it, and then leave REG_NOTEOL unset, making the
end of the buffer consistently the end of the final line. Like this,
which no longer matches "^$" but does match "t$":

diff --git a/grep.c b/grep.c
index 4e155ee9e6..c4bb9f1081 100644
--- a/grep.c
+++ b/grep.c
@@ -1646,6 +1646,8 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 
 	bol = gs->buf;
 	left = gs->size;
+	if (left && gs->buf[left-1] == '\n')
+		left--;
 	while (left) {
 		const char *eol;
 		int hit;

-Peff

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-01-13  6:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-09 23:52 git grep: ^$ false match at end of file Olly Betts
2025-01-10 11:43 ` Jeff King
2025-01-10 12:02   ` Jeff King
2025-01-10 12:59     ` Andreas Schwab
2025-01-13  6:26       ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).