git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Lidong Yan <yldhome2d2@gmail.com>
Cc: git@vger.kernel.org,  stolee@gmail.com,  ttaylorr@github.com
Subject: Re: [PATCH v2] bloom: enable bloom filter with wildcard pathspec in revision traversal
Date: Fri, 08 Aug 2025 08:50:39 -0700	[thread overview]
Message-ID: <xmqqsei1izhs.fsf@gitster.g> (raw)
In-Reply-To: <20250808065834.22743-1-yldhome2d2@gmail.com> (Lidong Yan's message of "Fri, 8 Aug 2025 14:58:34 +0800")

Lidong Yan <yldhome2d2@gmail.com> writes:

>  static int forbid_bloom_filters(struct pathspec *spec)
>  {
> -	if (spec->has_wildcard)
> -		return 1;
> -	if (spec->magic & ~PATHSPEC_LITERAL)
> +	unsigned int allowed_magic =
> +		PATHSPEC_FROMTOP |
> +		PATHSPEC_MAXDEPTH |
> +		PATHSPEC_LITERAL |
> +		PATHSPEC_GLOB |
> +		PATHSPEC_ATTR;
> +
> +	if (spec->magic & ~allowed_magic)
>  		return 1;
>  	for (size_t nr = 0; nr < spec->nr; nr++)
> -		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> +		if (spec->items[nr].magic & ~allowed_magic)
>  			return 1;

Quite straight-forward and easy to see that this is a simple
enhancement of the existing code.  Good.

> @@ -693,9 +698,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
>  	size_t len;
>  	int res = 0;
>  
> +	len = pi->nowildcard_len;
> +	if (len != pi->len) {
> +		/*
> +		 * for path like "/dir/file*", nowildcard part would be
> +		 * "/dir/file", but only "/dir" should be used for the

Leading "/" makes it look as if the pathspec element can begin with
a slash, but it can not, can it?

> +		 * bloom filter
> +		 */
> +		while (len > 0 && pi->match[len - 1] != '/')
> +			len--;

In a tree that has both "builtin/" directory and "builtin.h" file,
what pathspec_element do we get here when we run

	$ git log "builtin*"

We need to be able to catch commits that touch anything in
"builtin/" and also anything at the top-level that begins with
"builtin", like "builtin.h" and other things that might have existed
ever in the history.  I think len goes down to 0 and that is the
correct behaviour.

The code does deal with the case where len is reduced down to zero,
but a bit poorly.  It would allocate a zero-length string, then
realize it frees it, and returns -1.  As it must know how long the
resulting path is before it allocated, and by definition len is
pi->len if it did not have to allocate, it should be able to take
the "goto cleanup" code path before it even attempts to allocate,
and it should not have to do strlen().  But the current code is not
incorrect.

> +	}
>  	/* remove single trailing slash from path, if needed */
> -	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> -		path_alloc = xmemdupz(pi->match, pi->len - 1);
> +	if (len > 0 && pi->match[len - 1] == '/')
> +		len--;
> +
> +	if (len != pi->len) {
> +		path_alloc = xmemdupz(pi->match, len);
>  		path = path_alloc;
>  	} else
>  		path = pi->match;

> +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' '
> +	test_bloom_filters_used "-- A/\* file4" &&

We'd ask the Bloom filter: skip commits that you know that can never
be touching either "A/(anything)" or "file4".

> +	test_bloom_filters_used "-- A/file\*" &&

We'd ask the Bloom filter: skip commits that you know that can never
be touching either "A/".

> +	test_bloom_filters_used "-- * A/\*"

What do we ask?  The second one says that a commit that might touch
"A/(something)" is worth investigating, but what about the first
one?  

Ahhh, that one is not quoted, so the shell expands to existing files
(which presumably do not have any wildcard characters).  OK.  If the
lone * were quoted, we shouldn't be using the Bloom filter.

> +'
> +
> +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' '
>  	test_bloom_filters_not_used "-- file\*" &&

OK, this is exactly the case I wondered about in the above wrt "builtin*"
and I agree with the expectation of this test.

> -	test_bloom_filters_not_used "-- A/\* file4" &&
> -	test_bloom_filters_not_used "-- file4 A/\*" &&
> -	test_bloom_filters_not_used "-- * A/\*"
> +	test_bloom_filters_not_used "-- file\* A/\*" &&

This one I understand.  The first one reduces len down to 0 in the
wildcard stripping loop, and makes us say "a commit that might touch
any path is worth investigating", which amounts to the same thing as
not using the Bloom filter at all.

> +	test_bloom_filters_not_used "-- file\* *" &&

Ditto.  Even if the unquoted * may expand to many concrete paths,
"file\*" that can match any path that begins with "file" that ever
have existed in the history would not help skipping any commit.

> +	test_bloom_filters_not_used "-- \*"

Ditto.

> +'
> +
> +test_expect_success 'git log with path contains various magic signatures' '
> +	cd A &&
> +	test_bloom_filters_used "-- \:\(top\)B" &&
> +	cd .. &&
> +
> +	test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" &&
> +	test_bloom_filters_not_used "-- \:\(icase\)FILE4" &&
> +	test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" &&
> +
> +	test_when_finished "rm -f .gitattributes" &&
> +	cat >.gitattributes <<-EOF &&
> +	A/file1 text
> +	A/B/file2 -text
> +	EOF
> +	test_bloom_filters_used "-- \:\(attr\:text\)A"
>  '

OK.  Good to see negative test cases here, which we sometimes forget
to test when showing off our shiny new toy.

Taking what I suggested above, here is a possible improvement.

 revision.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git i/revision.c w/revision.c
index 2a5b98390e..2a92bdda84 100644
--- i/revision.c
+++ w/revision.c
@@ -696,14 +696,14 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
 	char *path_alloc = NULL;
 	const char *path;
 	size_t len;
-	int res = 0;
+	int res = -1; /* be pessimistic */
 
 	len = pi->nowildcard_len;
 	if (len != pi->len) {
 		/*
-		 * for path like "/dir/file*", nowildcard part would be
-		 * "/dir/file", but only "/dir" should be used for the
-		 * bloom filter
+		 * for path like "dir/file*", nowildcard part would be
+		 * "dir/file", but only "dir" should be used for the
+		 * bloom filter.
 		 */
 		while (len > 0 && pi->match[len - 1] != '/')
 			len--;
@@ -712,19 +712,17 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
 	if (len > 0 && pi->match[len - 1] == '/')
 		len--;
 
+	if (!len)
+		goto cleanup;
+
 	if (len != pi->len) {
 		path_alloc = xmemdupz(pi->match, len);
 		path = path_alloc;
 	} else
 		path = pi->match;
 
-	len = strlen(path);
-	if (!len) {
-		res = -1;
-		goto cleanup;
-	}
-
 	*out = bloom_keyvec_new(path, len, settings);
+	res = 0;
 
 cleanup:
 	free(path_alloc);

  reply	other threads:[~2025-08-08 15:50 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07  5:12 [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal Lidong Yan
2025-08-07  6:49 ` Patrick Steinhardt
2025-08-07  8:59   ` Lidong Yan
2025-08-07 16:15   ` Junio C Hamano
2025-08-08  6:40     ` Lidong Yan
2025-08-08  6:58 ` [PATCH v2] " Lidong Yan
2025-08-08 15:50   ` Junio C Hamano [this message]
2025-08-09  2:06     ` Lidong Yan
2025-08-09  2:16     ` [PATCH v3] " Lidong Yan
2025-08-09  4:22       ` [PATCH v4] " Lidong Yan
2025-08-09  7:40         ` Lidong Yan
2025-08-11  6:01         ` [PATCH v5] " Lidong Yan
2025-08-11 15:56           ` Junio C Hamano
2025-08-11 16:08             ` Lidong Yan
2025-08-09 23:51       ` [PATCH v3] " Junio C Hamano
2025-08-10  1:57         ` Lidong Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqsei1izhs.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=stolee@gmail.com \
    --cc=ttaylorr@github.com \
    --cc=yldhome2d2@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).