git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] bloom: use bloom filter given multiple pathspec
@ 2025-06-25 12:55 Lidong Yan
  2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
                   ` (5 more replies)
  0 siblings, 6 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-25 12:55 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan, Junio C Hamano, Kai Koponen

git won't use bloom filter for multiple pathspec, which makes the command
  git log -- file1 file2
significantly slower than
  git log -- file1 && git log -- file2

This issue is raised by Kai Koponen at
  https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/

To fix this, revs->bloom_keys[] needs to become an array of bloom_keys[],
one for each literal pathspec element. For convenience, first commit
creates a new struct bloom_keyvec to hold all bloom keys for a single
pathspec. The second commit add for loop to check if any pathspec's keyvec
is contained in a commit's bloom filter, along with code that initialize
destory and test multiple pathspec bloom keyvecs.

With this change, testing on Kai's example shows that
  git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 -- src/clean.bash src/Make.dist
runs as fast as
  git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 -- src/Make.dist && \
  git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 -- src/clean.bash

Lidong Yan (2):
  bloom: replace struct bloom_key * with struct bloom_keyvec
  bloom: enable multiple pathspec bloom keys

 bloom.c              |  47 +++++++++++++++++
 bloom.h              |  14 +++++
 revision.c           | 121 ++++++++++++++++++++++++-------------------
 revision.h           |   5 +-
 t/t4216-log-bloom.sh |  10 ++--
 5 files changed, 137 insertions(+), 60 deletions(-)

-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
@ 2025-06-25 12:55 ` Lidong Yan
  2025-06-25 17:43   ` Junio C Hamano
  2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-06-25 12:55 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan

struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
corresponding to a single pathspec. To allow struct rev_info to store
the bloom keys for multiple pathspecs, a new data structure bloom_keyvec
is introduced. Each struct bloom_keyvec corresponds to a single pathspec.
In rev_info, an array bloom_keyvecs is used to store multiple bloom_keyvec
instances, along with its length bloom_keyvec_nr.

New bloom_keyvec_* functions are added to initialize, free, and access
elements in a keyvec. bloom_filter_contains_vec() is added to check
if all key in struct bloom_keyvec is contained in a bloom filter.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 bloom.h    | 14 ++++++++++++++
 revision.c | 31 ++++++++++++++++---------------
 revision.h |  5 +++--
 4 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..497bb44567 100644
--- a/bloom.c
+++ b/bloom.c
@@ -280,6 +280,36 @@ void deinit_bloom_filters(void)
 	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
 }
 
+void bloom_keyvec_init(struct bloom_keyvec *v, size_t initial_size)
+{
+	ALLOC_ARRAY(v->keys, initial_size);
+	v->nr = initial_size;
+}
+
+void bloom_keyvec_clear(struct bloom_keyvec *v)
+{
+	size_t i;
+	if (!v->keys)
+		return;
+
+	for (i = 0; i < v->nr; i++)
+		clear_bloom_key(&v->keys[i]);
+
+	FREE_AND_NULL(v->keys);
+	v->nr = 0;
+}
+
+struct bloom_key *bloom_keyvec_at(const struct bloom_keyvec *v, size_t i)
+{
+	assert(i < v->nr);
+	return &v->keys[i];
+}
+
+bool bloom_keyvec_empty(const struct bloom_keyvec *v)
+{
+	return v->nr == 0 || !v->keys;
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -540,3 +570,20 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 
 	return 1;
 }
+
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *keys,
+			      const struct bloom_filter_settings *settings)
+{
+	const struct bloom_key *key;
+	int i, ret;
+
+	for (i = 0; i < keys->nr; i++) {
+		key = &keys->keys[i];
+		ret = bloom_filter_contains(filter, key, settings);
+		if (ret <= 0)
+			return ret;
+	}
+
+	return 1;
+}
\ No newline at end of file
diff --git a/bloom.h b/bloom.h
index 6e46489a20..d556af7310 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,11 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+struct bloom_keyvec {
+	struct bloom_key *keys;
+	size_t nr;
+};
+
 int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
@@ -100,6 +105,11 @@ void add_key_to_filter(const struct bloom_key *key,
 void init_bloom_filters(void);
 void deinit_bloom_filters(void);
 
+void bloom_keyvec_init(struct bloom_keyvec *v, size_t initial_size);
+void bloom_keyvec_clear(struct bloom_keyvec *v);
+struct bloom_key *bloom_keyvec_at(const struct bloom_keyvec *v, size_t i);
+bool bloom_keyvec_empty(const struct bloom_keyvec *v);
+
 enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
@@ -137,4 +147,8 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *keys,
+			      const struct bloom_filter_settings *settings);
+
 #endif
diff --git a/revision.c b/revision.c
index afee111196..cf7dc3b3fa 100644
--- a/revision.c
+++ b/revision.c
@@ -688,6 +688,7 @@ static int forbid_bloom_filters(struct pathspec *spec)
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
+	struct bloom_keyvec *bloom_keyvec;
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
@@ -736,10 +737,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		p++;
 	}
 
-	revs->bloom_keys_nr = path_component_nr;
-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+	revs->bloom_keyvecs_nr = 1;
+	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	bloom_keyvec = &revs->bloom_keyvecs[0];
+	bloom_keyvec_init(bloom_keyvec, path_component_nr);
 
-	fill_bloom_key(path, len, &revs->bloom_keys[0],
+	fill_bloom_key(path, len, bloom_keyvec_at(bloom_keyvec, 0),
 		       revs->bloom_filter_settings);
 	path_component_nr = 1;
 
@@ -747,7 +750,8 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	while (p > path) {
 		if (*p == '/')
 			fill_bloom_key(path, p - path,
-				       &revs->bloom_keys[path_component_nr++],
+				       bloom_keyvec_at(bloom_keyvec,
+						       path_component_nr++),
 				       revs->bloom_filter_settings);
 		p--;
 	}
@@ -779,11 +783,8 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
-		result = bloom_filter_contains(filter,
-					       &revs->bloom_keys[j],
-					       revs->bloom_filter_settings);
-	}
+	result = bloom_filter_contains_vec(filter, &revs->bloom_keyvecs[0],
+					   revs->bloom_filter_settings);
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -823,7 +824,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_keys_nr && !nth_parent) {
+	if (revs->bloom_keyvecs_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
@@ -850,7 +851,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
 	if (!t1)
 		return 0;
 
-	if (!nth_parent && revs->bloom_keys_nr) {
+	if (!nth_parent && revs->bloom_keyvecs_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -3230,10 +3231,10 @@ void release_revisions(struct rev_info *revs)
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
 
-	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		clear_bloom_key(&revs->bloom_keys[i]);
-	FREE_AND_NULL(revs->bloom_keys);
-	revs->bloom_keys_nr = 0;
+	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
+		bloom_keyvec_clear(&revs->bloom_keyvecs[i]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..88167b710c 100644
--- a/revision.h
+++ b/revision.h
@@ -63,6 +63,7 @@ struct rev_info;
 struct string_list;
 struct saved_parents;
 struct bloom_key;
+struct bloom_keyvec;
 struct bloom_filter_settings;
 struct option;
 struct parse_opt_ctx_t;
@@ -360,8 +361,8 @@ struct rev_info {
 
 	/* Commit graph bloom filter fields */
 	/* The bloom filter key(s) for the pathspec */
-	struct bloom_key *bloom_keys;
-	int bloom_keys_nr;
+	struct bloom_keyvec *bloom_keyvecs;
+	int bloom_keyvecs_nr;
 
 	/*
 	 * The bloom filter settings used to generate the key.
-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
  2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-06-25 12:55 ` Lidong Yan
  2025-06-27 13:50   ` Junio C Hamano
  2025-06-27 20:39   ` Junio C Hamano
  2025-06-25 17:32 ` [PATCH 0/2] bloom: use bloom filter given multiple pathspec Junio C Hamano
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-25 12:55 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan

Remove `if (spec->nr > 1)` to enable bloom filter given multiple
pathspec. Wrapped for loop around code in prepare_to_use_bloom_filter()
to initialize each pathspec's struct bloom_keyvec. Add for loop
in check_maybe_different_in_bloom_filter() to find if at least one
pathspec's bloom_keyvec is contained in bloom filter.

Add new function release_revisions_bloom_keyvecs() to free all bloom
keyvec owned by rev_info.

Modify t/t4216 to test if bloom filter is still used given multiple
pathspec.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c           | 122 ++++++++++++++++++++++++-------------------
 t/t4216-log-bloom.sh |  10 ++--
 2 files changed, 73 insertions(+), 59 deletions(-)

diff --git a/revision.c b/revision.c
index cf7dc3b3fa..8818f017f3 100644
--- a/revision.c
+++ b/revision.c
@@ -675,8 +675,6 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
 	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
@@ -685,6 +683,8 @@ static int forbid_bloom_filters(struct pathspec *spec)
 	return 0;
 }
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
@@ -692,7 +692,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
-	int path_component_nr = 1;
+	int path_component_nr;
 
 	if (!revs->commits)
 		return;
@@ -709,51 +709,53 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	pi = &revs->pruning.pathspec.items[0];
-
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
-
-	len = strlen(path);
-	if (!len) {
-		revs->bloom_filter_settings = NULL;
-		free(path_alloc);
-		return;
-	}
-
-	p = path;
-	while (*p) {
-		/*
-		 * At this point, the path is normalized to use Unix-style
-		 * path separators. This is required due to how the
-		 * changed-path Bloom filters store the paths.
-		 */
-		if (*p == '/')
-			path_component_nr++;
-		p++;
-	}
-
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	bloom_keyvec = &revs->bloom_keyvecs[0];
-	bloom_keyvec_init(bloom_keyvec, path_component_nr);
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		pi = &revs->pruning.pathspec.items[i];
+		path_component_nr = 1;
+
+		/* remove single trailing slash from path, if needed */
+		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+			path_alloc = xmemdupz(pi->match, pi->len - 1);
+			path = path_alloc;
+		} else
+			path = pi->match;
+
+		len = strlen(path);
+		if (!len)
+			goto fail;
+
+		p = path;
+		while (*p) {
+			/*
+			 * At this point, the path is normalized to use
+			 * Unix-style path separators. This is required due to
+			 * how the changed-path Bloom filters store the paths.
+			 */
+			if (*p == '/')
+				path_component_nr++;
+			p++;
+		}
 
-	fill_bloom_key(path, len, bloom_keyvec_at(bloom_keyvec, 0),
-		       revs->bloom_filter_settings);
-	path_component_nr = 1;
+		bloom_keyvec = &revs->bloom_keyvecs[i];
+		bloom_keyvec_init(bloom_keyvec, path_component_nr);
+
+		fill_bloom_key(path, len, bloom_keyvec_at(bloom_keyvec, 0),
+			       revs->bloom_filter_settings);
+		path_component_nr = 1;
+
+		p = path + len - 1;
+		while (p > path) {
+			if (*p == '/')
+				fill_bloom_key(path, p - path,
+					       bloom_keyvec_at(bloom_keyvec,
+							       path_component_nr++),
+					       revs->bloom_filter_settings);
+			p--;
+		}
 
-	p = path + len - 1;
-	while (p > path) {
-		if (*p == '/')
-			fill_bloom_key(path, p - path,
-				       bloom_keyvec_at(bloom_keyvec,
-						       path_component_nr++),
-				       revs->bloom_filter_settings);
-		p--;
+		FREE_AND_NULL(path_alloc);
 	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
@@ -761,14 +763,19 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		bloom_filter_atexit_registered = 1;
 	}
 
+	return;
+
+fail:
+	revs->bloom_filter_settings = NULL;
 	free(path_alloc);
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result = 1, j;
+	int result = 0;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -783,8 +790,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains_vec(filter, &revs->bloom_keyvecs[0],
-					   revs->bloom_filter_settings);
+	for (int i = 0; !result && i < revs->bloom_keyvecs_nr; i++) {
+		result = bloom_filter_contains_vec(filter,
+						   &revs->bloom_keyvecs[i],
+						   revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -3202,6 +3212,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
 
 static void release_revisions_topo_walk_info(struct topo_walk_info *info);
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
+		bloom_keyvec_clear(&revs->bloom_keyvecs[i]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
+}
+
 static void free_void_commit_list(void *list)
 {
 	free_commit_list(list);
@@ -3230,11 +3248,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
-
-	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
-		bloom_keyvec_clear(&revs->bloom_keyvecs[i]);
-	FREE_AND_NULL(revs->bloom_keyvecs);
-	revs->bloom_keyvecs_nr = 0;
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..46d1900a21 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -138,8 +138,8 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
+test_expect_success 'git log -- multiple path specs use Bloom filters' '
+	test_bloom_filters_used "-- file4 A/file1"
 '
 
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
@@ -151,9 +151,9 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with wildcard that resolves to a multiple paths uses Bloom filters' '
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/2] bloom: use bloom filter given multiple pathspec
  2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
  2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
  2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
@ 2025-06-25 17:32 ` Junio C Hamano
  2025-06-26  3:34   ` Lidong Yan
  2025-06-27  6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-06-25 17:32 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, Lidong Yan, Kai Koponen

Lidong Yan <yldhome2d2@gmail.com> writes:

> git won't use bloom filter for multiple pathspec, which makes the command

Let's get the terminology straight.  A pathspec consists of one or
more pathspec elements (or pathspec items).

Also, "git won't" is overly general.  The series title shares the
same issue ("given multiple pathspec" does not even hint that this
is about revision traversal---you are not making filter used with
pathspec with more than one element in other code paths).

Perhaps like:

    The revision traversal limited by pathspec has optimization when
    the pathspec has only one element, it does not use any pathspec
    magic (other than literal), and there is no wildcard.

    While it is much harder to lift the latter two limitations,
    supporting a pathspec with multiple elements is relatively easy.
    Just make sure we hash each of them separately and ask the bloom
    filter about them, and if we see none of them can possibly be
    affected by the commit, we can skip without tree comparison.

or something along that line?

>   git log -- file1 file2
> significantly slower than
>   git log -- file1 && git log -- file2
>
> This issue is raised by Kai Koponen at
>   https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/
>
> To fix this, revs->bloom_keys[] needs to become an array of bloom_keys[],
> one for each literal pathspec element. For convenience, first commit
> creates a new struct bloom_keyvec to hold all bloom keys for a single
> pathspec. The second commit add for loop to check if any pathspec's keyvec
> is contained in a commit's bloom filter, along with code that initialize
> destory and test multiple pathspec bloom keyvecs.

It is nice to outline an approach to the solution one day, and see
an almost exact implementation of it appear on the list a few days
later.  I wish all the development would go like this ;-)

>  bloom.c              |  47 +++++++++++++++++
>  bloom.h              |  14 +++++
>  revision.c           | 121 ++++++++++++++++++++++++-------------------
>  revision.h           |   5 +-
>  t/t4216-log-bloom.sh |  10 ++--

Can we have a set of real tests to make sure that the updated filter
code still identifies commits that touch the files without false
negatives?  False positives are OK as we will follow them with real
tree comparison to determine what exactly got changed, but false
negatives are absolute no-no.

Testing to see that the filter code path is activated is much less
interesting than the filter code path still functions correctly with
these changes presented here.  I have a feeling that with the
changes to the test in this series, you wouldn't even find a bug
where you simply added subpaths for all pathspec elements into a
single array and use the original "bloom has to say 'possibly yes'
to all array elements" logic (which would incorrectly require that
both file1 and file2 must be modified).

Thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-06-25 17:43   ` Junio C Hamano
  2025-06-26  3:44     ` Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-06-25 17:43 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, Lidong Yan

Lidong Yan <yldhome2d2@gmail.com> writes:

> +void bloom_keyvec_init(struct bloom_keyvec *v, size_t initial_size)
> +{
> +	ALLOC_ARRAY(v->keys, initial_size);
> +	v->nr = initial_size;
> +}

Hmph, does this ever grow once initialized?  When I outlined the
solution in the earlier discussion, I was wondering a structure that
looks more like

	struct bloom_keyvec {
		size_t count;
		struct bloom_key key[FLEX_ARRAY];
	};

Also when your primary use of an array is to use one element at a
time (as opposed to the entire array as a single "collection"), name
it singular, so that key[4] is more naturally understood as "4-th
key", not keys[4].

> +void bloom_keyvec_clear(struct bloom_keyvec *v)
> +{
> +	size_t i;
> +	if (!v->keys)
> +		return;
> +
> +	for (i = 0; i < v->nr; i++)
> +		clear_bloom_key(&v->keys[i]);

By doing

	for (size_t nr; nr < v->nr; nr++)

you can

 - lose the separate local variable definition at the beginning;
 - avoid confusing "i", which hints to be an "int", to be of type "size_t"
 - limit the scope of "nr" a bit tigher.

If you make keyvec an fixed flex-array, the below would become
unnecessary (and the check for NULL-ness of .keys[] array).

> +
> +	FREE_AND_NULL(v->keys);
> +	v->nr = 0;
> +}

> +struct bloom_key *bloom_keyvec_at(const struct bloom_keyvec *v, size_t i)
> +{

Ditto about abusing the name 'i'.

> +			return ret;
> +	}
> +
> +	return 1;
> +}
> \ No newline at end of file

Tell your editor to be more careful, perhaps?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/2] bloom: use bloom filter given multiple pathspec
  2025-06-25 17:32 ` [PATCH 0/2] bloom: use bloom filter given multiple pathspec Junio C Hamano
@ 2025-06-26  3:34   ` Lidong Yan
  2025-06-26 14:15     ` Junio C Hamano
  0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-06-26  3:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Kai Koponen

Junio C Hamano <gitster@pobox.com> writes:
> 
> Lidong Yan <yldhome2d2@gmail.com> writes:
> 
>> git won't use bloom filter for multiple pathspec, which makes the command
> 
> Let's get the terminology straight.  A pathspec consists of one or
> more pathspec elements (or pathspec items).

Thanks for the clarification. I will be more precise with the terminology in v2.

> Also, "git won't" is overly general.  The series title shares the
> same issue ("given multiple pathspec" does not even hint that this
> is about revision traversal---you are not making filter used with
> pathspec with more than one element in other code paths).
> 
> Perhaps like:
> 
>    The revision traversal limited by pathspec has optimization when
>    the pathspec has only one element, it does not use any pathspec
>    magic (other than literal), and there is no wildcard.
> 
>    While it is much harder to lift the latter two limitations,
>    supporting a pathspec with multiple elements is relatively easy.
>    Just make sure we hash each of them separately and ask the bloom
>    filter about them, and if we see none of them can possibly be
>    affected by the commit, we can skip without tree comparison.
> 
> or something along that line?
> 

What you wrote makes perfect sense to me, I’ll just copy and paste
those paragraphs into my cover letter. And the title would be
  "bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal"

> Can we have a set of real tests to make sure that the updated filter
> code still identifies commits that touch the files without false
> negatives?  False positives are OK as we will follow them with real
> tree comparison to determine what exactly got changed, but false
> negatives are absolute no-no.
> 
> Testing to see that the filter code path is activated is much less
> interesting than the filter code path still functions correctly with
> these changes presented here.  I have a feeling that with the
> changes to the test in this series, you wouldn't even find a bug
> where you simply added subpaths for all pathspec elements into a
> single array and use the original "bloom has to say 'possibly yes'
> to all array elements" logic (which would incorrectly require that
> both file1 and file2 must be modified).

I assume that t4216/test_bloom_filters_used has already verified that
using bloom filters with multiple pathspec elements produces the same
results as when bloom filters are not used. But I would love to add more
test cases to check no false negative happened.

Thanks,
Lidong

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-06-25 17:43   ` Junio C Hamano
@ 2025-06-26  3:44     ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-26  3:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes
> 
> Lidong Yan <yldhome2d2@gmail.com> writes:
> 
>> +void bloom_keyvec_init(struct bloom_keyvec *v, size_t initial_size)
>> +{
>> + ALLOC_ARRAY(v->keys, initial_size);
>> + v->nr = initial_size;
>> +}
> 
> Hmph, does this ever grow once initialized?  When I outlined the
> solution in the earlier discussion, I was wondering a structure that
> looks more like
> 
> struct bloom_keyvec {
> size_t count;
> struct bloom_key key[FLEX_ARRAY];
> };

I understand. If we don't need to resize the array dynamically, we should
allocate the array elements and the array length together to gain benefits
such as spatial locality.

> Also when your primary use of an array is to use one element at a
> time (as opposed to the entire array as a single "collection"), name
> it singular, so that key[4] is more naturally understood as "4-th
> key", not keys[4].

Got it.

> 
>> +void bloom_keyvec_clear(struct bloom_keyvec *v)
>> +{
>> + size_t i;
>> + if (!v->keys)
>> + return;
>> +
>> + for (i = 0; i < v->nr; i++)
>> + clear_bloom_key(&v->keys[i]);
> 
> By doing
> 
> for (size_t nr; nr < v->nr; nr++)
> 
> you can
> 
> - lose the separate local variable definition at the beginning;
> - avoid confusing "i", which hints to be an "int", to be of type "size_t"
> - limit the scope of "nr" a bit tigher.

Got it.

> 
> If you make keyvec an fixed flex-array, the below would become
> unnecessary (and the check for NULL-ness of .keys[] array).
> 
>> +
>> + FREE_AND_NULL(v->keys);
>> + v->nr = 0;
>> +}

Another benefit to use flex-array.

> 
>> +struct bloom_key *bloom_keyvec_at(const struct bloom_keyvec *v, size_t i)
>> +{
> 
> Ditto about abusing the name 'i'.
> 
>> + return ret;
>> + }
>> +
>> + return 1;
>> +}
>> \ No newline at end of file
> 
> Tell your editor to be more careful, perhaps?

Seems git clang-format doesn’t add newline for me, I will try
.editconfig next time.

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/2] bloom: use bloom filter given multiple pathspec
  2025-06-26  3:34   ` Lidong Yan
@ 2025-06-26 14:15     ` Junio C Hamano
  0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-06-26 14:15 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, Kai Koponen

Lidong Yan <502024330056@smail.nju.edu.cn> writes:

> I assume that t4216/test_bloom_filters_used has already verified that
> using bloom filters with multiple pathspec elements produces the same
> results as when bloom filters are not used. But I would love to add more
> test cases to check no false negative happened.

If we are sure we have enough test coverage, then it is great. If
not, it is superb if you can add some test coverage there.

Thanks.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
                   ` (2 preceding siblings ...)
  2025-06-25 17:32 ` [PATCH 0/2] bloom: use bloom filter given multiple pathspec Junio C Hamano
@ 2025-06-27  6:21 ` Lidong Yan
  2025-06-28  4:21   ` [PATCH v3 " Lidong Yan
                     ` (2 more replies)
  2025-06-27  6:21 ` [PATCH v2 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
  2025-06-27  6:21 ` [PATCH v2 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
  5 siblings, 3 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-27  6:21 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan, Junio C Hamano, Kai Koponen

The revision traversal limited by pathspec has optimization when
the pathspec has only one element, it does not use any pathspec
magic (other than literal), and there is no wildcard. The absence
of optimization for multiple pathspec elements in revision traversal
cause an issue raised by Kai Koponen at
  https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/

While it is much harder to lift the latter two limitations,
supporting a pathspec with multiple elements is relatively easy.
Just make sure we hash each of them separately and ask the bloom
filter about them, and if we see none of them can possibly be
affected by the commit, we can skip without tree comparison.

First commit creates a new data structure `struct bloom_keyvec` to hold
all bloom keys for a single pathspec item. Second commit add for loop
to check if any pathspec item's keyvec is contained in a commit's bloom
filter.

With this change, testing on Kai's example shows that
  git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 -- src/clean.bash src/Make.dist
runs as fast as
  git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 -- src/Make.dist && \
  git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 -- src/clean.bash

Lidong Yan (2):
  bloom: replace struct bloom_key * with struct bloom_keyvec
  bloom: optimize multiple pathspec items in revision traversal

 bloom.c              |  31 +++++++++++
 bloom.h              |  20 +++++++
 revision.c           | 121 ++++++++++++++++++++++++-------------------
 revision.h           |   6 +--
 t/t4216-log-bloom.sh |  10 ++--
 5 files changed, 127 insertions(+), 61 deletions(-)

-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
                   ` (3 preceding siblings ...)
  2025-06-27  6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
@ 2025-06-27  6:21 ` Lidong Yan
  2025-06-27  6:21 ` [PATCH v2 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-27  6:21 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan

The revision traversal limited by pathspec has optimization when
the pathspec has only one element. To support optimization for
multiple pathspec items, we need to modify the data structures
in struct rev_info.

struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
corresponding to a single pathspec item. To allow struct rev_info
to store bloom keys for multiple pathspec items, a new data structure
`struct bloom_keyvec` is introduced. Each `struct bloom_keyvec`
corresponds to a single pathspec item.

In `struct rev_info`, replace bloom_keys and bloom_nr with bloom_keyvecs
and bloom_keyvec_nr. This commit still optimize one pathspec item, thus
bloom_keyvec_nr can only be 0 or 1.

New *_bloom_keyvec functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter. fill_bloom_keyvec_key() is added to
initialize a key in keyvec.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c    | 31 +++++++++++++++++++++++++++++++
 bloom.h    | 20 ++++++++++++++++++++
 revision.c | 36 ++++++++++++++++++------------------
 revision.h |  6 +++---
 4 files changed, 72 insertions(+), 21 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..8259cfce51 100644
--- a/bloom.c
+++ b/bloom.c
@@ -280,6 +280,25 @@ void deinit_bloom_filters(void)
 	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
 }
 
+struct bloom_keyvec *create_bloom_keyvec(size_t count)
+{
+	struct bloom_keyvec *vec;
+	size_t sz = sizeof(struct bloom_keyvec);
+	sz += count * sizeof(struct bloom_key);
+	vec = (struct bloom_keyvec *)xcalloc(1, sz);
+	vec->count = count;
+	return vec;
+}
+
+void destroy_bloom_keyvec(struct bloom_keyvec *vec)
+{
+	if (!vec)
+		return;
+	for (size_t nr = 0; nr < vec->count; nr++)
+		clear_bloom_key(&vec->key[nr]);
+	free(vec);
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -540,3 +559,15 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 
 	return 1;
 }
+
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *vec,
+			      const struct bloom_filter_settings *settings)
+{
+	int ret = 1;
+
+	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+	return ret;
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..9e4e832c8c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,11 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+struct bloom_keyvec {
+	size_t count;
+	struct bloom_key key[FLEX_ARRAY];
+};
+
 int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
@@ -100,6 +105,17 @@ void add_key_to_filter(const struct bloom_key *key,
 void init_bloom_filters(void);
 void deinit_bloom_filters(void);
 
+struct bloom_keyvec *create_bloom_keyvec(size_t count);
+void destroy_bloom_keyvec(struct bloom_keyvec *vec);
+
+static inline void fill_bloom_keyvec_key(const char *data, size_t len,
+					 struct bloom_keyvec *vec, size_t nr,
+					 const struct bloom_filter_settings *settings)
+{
+	assert(nr < vec->count);
+	fill_bloom_key(data, len, &vec->key[nr], settings);
+}
+
 enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
@@ -137,4 +153,8 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *v,
+			      const struct bloom_filter_settings *settings);
+
 #endif
diff --git a/revision.c b/revision.c
index afee111196..3aa544c137 100644
--- a/revision.c
+++ b/revision.c
@@ -688,6 +688,7 @@ static int forbid_bloom_filters(struct pathspec *spec)
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
+	struct bloom_keyvec *bloom_keyvec;
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
@@ -736,19 +737,21 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		p++;
 	}
 
-	revs->bloom_keys_nr = path_component_nr;
-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+	revs->bloom_keyvecs_nr = 1;
+	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	bloom_keyvec = create_bloom_keyvec(path_component_nr);
+	revs->bloom_keyvecs[0] = bloom_keyvec;
 
-	fill_bloom_key(path, len, &revs->bloom_keys[0],
-		       revs->bloom_filter_settings);
+	fill_bloom_keyvec_key(path, len, bloom_keyvec, 0,
+			      revs->bloom_filter_settings);
 	path_component_nr = 1;
 
 	p = path + len - 1;
 	while (p > path) {
 		if (*p == '/')
-			fill_bloom_key(path, p - path,
-				       &revs->bloom_keys[path_component_nr++],
-				       revs->bloom_filter_settings);
+			fill_bloom_keyvec_key(path, p - path, bloom_keyvec,
+					      path_component_nr++,
+					      revs->bloom_filter_settings);
 		p--;
 	}
 
@@ -779,11 +782,8 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
-		result = bloom_filter_contains(filter,
-					       &revs->bloom_keys[j],
-					       revs->bloom_filter_settings);
-	}
+	result = bloom_filter_contains_vec(filter, revs->bloom_keyvecs[0],
+					   revs->bloom_filter_settings);
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -823,7 +823,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_keys_nr && !nth_parent) {
+	if (revs->bloom_keyvecs_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
@@ -850,7 +850,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
 	if (!t1)
 		return 0;
 
-	if (!nth_parent && revs->bloom_keys_nr) {
+	if (!nth_parent && revs->bloom_keyvecs_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -3230,10 +3230,10 @@ void release_revisions(struct rev_info *revs)
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
 
-	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		clear_bloom_key(&revs->bloom_keys[i]);
-	FREE_AND_NULL(revs->bloom_keys);
-	revs->bloom_keys_nr = 0;
+	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
+		destroy_bloom_keyvec(revs->bloom_keyvecs[i]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
 struct bloom_filter_settings;
 struct option;
 struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
 
 	/* Commit graph bloom filter fields */
 	/* The bloom filter key(s) for the pathspec */
-	struct bloom_key *bloom_keys;
-	int bloom_keys_nr;
+	struct bloom_keyvec **bloom_keyvecs;
+	int bloom_keyvecs_nr;
 
 	/*
 	 * The bloom filter settings used to generate the key.
-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 2/2] bloom: optimize multiple pathspec items in revision traversal
  2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
                   ` (4 preceding siblings ...)
  2025-06-27  6:21 ` [PATCH v2 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-06-27  6:21 ` Lidong Yan
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-27  6:21 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan

Remove `if (spec->nr > 1)` to enable bloom filter given multiple
pathspec items. Add for loop in prepare_to_use_bloom_filter()
to initialize each pathspec item's struct bloom_keyvec. Add for
loop in check_maybe_different_in_bloom_filter() to find if at least one
bloom_keyvec is contained in bloom filter.

Add new function release_revisions_bloom_keyvecs() to free all bloom
keyvec owned by rev_info.

Modify t/t4216 to ensure consistent results between the optimization
for multiple pathspec items using bloom filters and the case without
bloom filter optimization.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c           | 121 ++++++++++++++++++++++++-------------------
 t/t4216-log-bloom.sh |  10 ++--
 2 files changed, 73 insertions(+), 58 deletions(-)

diff --git a/revision.c b/revision.c
index 3aa544c137..5606f6c7f6 100644
--- a/revision.c
+++ b/revision.c
@@ -675,8 +675,6 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
 	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
@@ -685,6 +683,8 @@ static int forbid_bloom_filters(struct pathspec *spec)
 	return 0;
 }
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
@@ -692,7 +692,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
-	int path_component_nr = 1;
+	int path_component_nr;
 
 	if (!revs->commits)
 		return;
@@ -709,50 +709,53 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	pi = &revs->pruning.pathspec.items[0];
-
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
-
-	len = strlen(path);
-	if (!len) {
-		revs->bloom_filter_settings = NULL;
-		free(path_alloc);
-		return;
-	}
-
-	p = path;
-	while (*p) {
-		/*
-		 * At this point, the path is normalized to use Unix-style
-		 * path separators. This is required due to how the
-		 * changed-path Bloom filters store the paths.
-		 */
-		if (*p == '/')
-			path_component_nr++;
-		p++;
-	}
-
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	bloom_keyvec = create_bloom_keyvec(path_component_nr);
-	revs->bloom_keyvecs[0] = bloom_keyvec;
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		pi = &revs->pruning.pathspec.items[i];
+		path_component_nr = 1;
+
+		/* remove single trailing slash from path, if needed */
+		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+			path_alloc = xmemdupz(pi->match, pi->len - 1);
+			path = path_alloc;
+		} else
+			path = pi->match;
+
+		len = strlen(path);
+		if (!len)
+			goto fail;
+
+		p = path;
+		while (*p) {
+			/*
+			 * At this point, the path is normalized to use
+			 * Unix-style path separators. This is required due to
+			 * how the changed-path Bloom filters store the paths.
+			 */
+			if (*p == '/')
+				path_component_nr++;
+			p++;
+		}
 
-	fill_bloom_keyvec_key(path, len, bloom_keyvec, 0,
-			      revs->bloom_filter_settings);
-	path_component_nr = 1;
+		bloom_keyvec = create_bloom_keyvec(path_component_nr);
+		revs->bloom_keyvecs[i] = bloom_keyvec;
+
+		fill_bloom_keyvec_key(path, len, bloom_keyvec, 0,
+			       revs->bloom_filter_settings);
+		path_component_nr = 1;
+
+		p = path + len - 1;
+		while (p > path) {
+			if (*p == '/')
+				fill_bloom_keyvec_key(path, p - path,
+					       bloom_keyvec,
+						   path_component_nr++,
+					       revs->bloom_filter_settings);
+			p--;
+		}
 
-	p = path + len - 1;
-	while (p > path) {
-		if (*p == '/')
-			fill_bloom_keyvec_key(path, p - path, bloom_keyvec,
-					      path_component_nr++,
-					      revs->bloom_filter_settings);
-		p--;
+		FREE_AND_NULL(path_alloc);
 	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
@@ -760,14 +763,19 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		bloom_filter_atexit_registered = 1;
 	}
 
+	return;
+
+fail:
+	revs->bloom_filter_settings = NULL;
 	free(path_alloc);
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result = 1, j;
+	int result = 0;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -782,8 +790,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains_vec(filter, revs->bloom_keyvecs[0],
-					   revs->bloom_filter_settings);
+	for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+		result = bloom_filter_contains_vec(filter,
+						   revs->bloom_keyvecs[nr],
+						   revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -3201,6 +3212,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
 
 static void release_revisions_topo_walk_info(struct topo_walk_info *info);
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+		destroy_bloom_keyvec(revs->bloom_keyvecs[nr]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
+}
+
 static void free_void_commit_list(void *list)
 {
 	free_commit_list(list);
@@ -3229,11 +3248,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
-
-	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
-		destroy_bloom_keyvec(revs->bloom_keyvecs[i]);
-	FREE_AND_NULL(revs->bloom_keyvecs);
-	revs->bloom_keyvecs_nr = 0;
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..46d1900a21 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -138,8 +138,8 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
+test_expect_success 'git log -- multiple path specs use Bloom filters' '
+	test_bloom_filters_used "-- file4 A/file1"
 '
 
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
@@ -151,9 +151,9 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with wildcard that resolves to a multiple paths uses Bloom filters' '
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
@ 2025-06-27 13:50   ` Junio C Hamano
  2025-06-27 14:24     ` Lidong Yan
                       ` (2 more replies)
  2025-06-27 20:39   ` Junio C Hamano
  1 sibling, 3 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-06-27 13:50 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, Lidong Yan

Lidong Yan <yldhome2d2@gmail.com> writes:

> Remove `if (spec->nr > 1)` to enable bloom filter given multiple
> pathspec. Wrapped for loop around code in prepare_to_use_bloom_filter()
> to initialize each pathspec's struct bloom_keyvec. Add for loop
> in check_maybe_different_in_bloom_filter() to find if at least one
> pathspec's bloom_keyvec is contained in bloom filter.

Oy.  That's too dense enumeration but I suspect are all "what the
patch does" that can be read from the diff.  The first sentence
gives "why", which is excellent.

You'd need to check in forbid_bloom_filters() that none of the
pathspec items have magic (other than literal), not just the first
one, no?

Totally outside the topic, but I wonder if we can further optimize
by adding an early rejection using .nowildcard_len?  Instead of
allowing a wildcarded "dir/*" pathspec element from disabling the
Bloom filter altogether, we could say "dir/ is not possibly altered,
so there may be dir/A, dir/B, etc., in the directory, nothing that
would match dir/* wildcard would have been modified", couldn't we?

> diff --git a/revision.c b/revision.c
> index cf7dc3b3fa..8818f017f3 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -675,8 +675,6 @@ static int forbid_bloom_filters(struct pathspec *spec)
>  {
>  	if (spec->has_wildcard)
>  		return 1;
> -	if (spec->nr > 1)
> -		return 1;
>  	if (spec->magic & ~PATHSPEC_LITERAL)
>  		return 1;
>  	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))

This last check only looks at the first item.  It used to be OK
because we didn't look at a pathspec with more than one element, but
now shouldn't we care?

Thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-27 13:50   ` Junio C Hamano
@ 2025-06-27 14:24     ` Lidong Yan
  2025-06-27 18:09       ` Junio C Hamano
  2025-07-01  5:52     ` Lidong Yan
  2025-07-01  8:50     ` SZEDER Gábor
  2 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-06-27 14:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:
> Lidong Yan <yldhome2d2@gmail.com> writes:
> 
>> Remove `if (spec->nr > 1)` to enable bloom filter given multiple
>> pathspec. Wrapped for loop around code in prepare_to_use_bloom_filter()
>> to initialize each pathspec's struct bloom_keyvec. Add for loop
>> in check_maybe_different_in_bloom_filter() to find if at least one
>> pathspec's bloom_keyvec is contained in bloom filter.
> 
> Oy.  That's too dense enumeration but I suspect are all "what the
> patch does" that can be read from the diff.  The first sentence
> gives "why", which is excellent.

Got it. I will simply this log message in next version.

> You'd need to check in forbid_bloom_filters() that none of the
> pathspec items have magic (other than literal), not just the first
> one, no?

Yeah, I never notice that. I would add checks in forbid_bloom_filters().
And add test to ensure we don’t use bloom filter if any pathspec item is
not literal.

> Totally outside the topic, but I wonder if we can further optimize
> by adding an early rejection using .nowildcard_len?  Instead of
> allowing a wildcarded "dir/*" pathspec element from disabling the
> Bloom filter altogether, we could say "dir/ is not possibly altered,
> so there may be dir/A, dir/B, etc., in the directory, nothing that
> would match dir/* wildcard would have been modified", couldn't we?
> 

I think it's feasible. In that case, we would need to add a condition
.nowildcard_len > 0 to forbid_bloom_filter. I'm happy to write a new
patch to address this issue.

>> diff --git a/revision.c b/revision.c
>> index cf7dc3b3fa..8818f017f3 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -675,8 +675,6 @@ static int forbid_bloom_filters(struct pathspec *spec)
>> {
>> if (spec->has_wildcard)
>> return 1;
>> - if (spec->nr > 1)
>> - return 1;
>> if (spec->magic & ~PATHSPEC_LITERAL)
>> return 1;
>> if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> 
> This last check only looks at the first item.  It used to be OK
> because we didn't look at a pathspec with more than one element, but
> now shouldn't we care?

Definitely, I would fix that in 3rd version.

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-27 14:24     ` Lidong Yan
@ 2025-06-27 18:09       ` Junio C Hamano
  0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-06-27 18:09 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git

Lidong Yan <502024330056@smail.nju.edu.cn> writes:

>> You'd need to check in forbid_bloom_filters() that none of the
>> pathspec items have magic (other than literal), not just the first
>> one, no?
>
> Yeah, I never notice that. I would add checks in forbid_bloom_filters().
> And add test to ensure we don’t use bloom filter if any pathspec item is
> not literal.

Sounds great.  I was wondering why your tests did not catch it.

>> Totally outside the topic, but I wonder if we can further optimize
>> by adding an early rejection using .nowildcard_len?  Instead of
>> allowing a wildcarded "dir/*" pathspec element from disabling the
>> Bloom filter altogether, we could say "dir/ is not possibly altered,
>> so there may be dir/A, dir/B, etc., in the directory, nothing that
>> would match dir/* wildcard would have been modified", couldn't we?
>
> I think it's feasible. In that case, we would need to add a condition
> .nowildcard_len > 0 to forbid_bloom_filter. I'm happy to write a new
> patch to address this issue.

Let's leave it outside the topic and concentrate on the problem at
hand first.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
  2025-06-27 13:50   ` Junio C Hamano
@ 2025-06-27 20:39   ` Junio C Hamano
  2025-06-28  2:54     ` Lidong Yan
  1 sibling, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-06-27 20:39 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, Lidong Yan

Lidong Yan <yldhome2d2@gmail.com> writes:

This is a tangent, but I have to say that whoever wrote the original
test does not understand shells very well.  When you have files A
and B in your working tree, to your $command, the following two does
not make any difference:

	$command ?
	$command A B

In fact it cannot even tell which form was used when composing the
command line.  So this original test ...

> -test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
> -	test_bloom_filters_not_used "-- *" &&
> -	test_bloom_filters_not_used "-- file*"

... is misleading to say the least.

> +test_expect_success 'git log with wildcard that resolves to a multiple paths uses Bloom filters' '
> +	test_bloom_filters_used "-- *" &&
> +	test_bloom_filters_used "-- file*"
>  '

I think you should just retitle this to say

	git log with multiple literal paths use Bloom filter

or something.

Also the setup helper test_bloom_filters_{not_,}used helpers call is
written in a way to make it impossible to pass a real wildcard and
see how "$git log" would behave, because it does this:

	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&

It probably should use 'eval' so that the caller can pass a quoted
wildcard, perhaps like

    eval git -c core.commitgraph=false \
	     log --pretty=format:%s "$1" >log_wo_bloom &&

Then a test we can add to see how wildcards prevent Bloom from
kicking in would look like

	test_bloom_filters_used "-- file*" &&
	test_bloom_filters_not_used "-- file4 file\*" &&

The former lets the shell expand file* when the above "eval"
evaluates its (concatenated) strings, while the latter leaves the
backslash before the asterisk in the strings fed to "eval", so the
"log" will see a pathspec with wildcard.

If we were to fix that setup() thing, we of course need to be
a bit careful about existing tests.

Thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-27 20:39   ` Junio C Hamano
@ 2025-06-28  2:54     ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-28  2:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:
> Also the setup helper test_bloom_filters_{not_,}used helpers call is
> written in a way to make it impossible to pass a real wildcard and
> see how "$git log" would behave, because it does this:
> 
> git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&

Yeah, if $1 contains * and because $1 is not quotes, * would trigger file name
expansion.

> It probably should use 'eval' so that the caller can pass a quoted
> wildcard, perhaps like
> 
>    eval git -c core.commitgraph=false \
>     log --pretty=format:%s "$1" >log_wo_bloom &&
> 
> Then a test we can add to see how wildcards prevent Bloom from
> kicking in would look like
> 
> test_bloom_filters_used "-- file*" &&
> test_bloom_filters_not_used "-- file4 file\*" &&
> 
> The former lets the shell expand file* when the above "eval"
> evaluates its (concatenated) strings, while the latter leaves the
> backslash before the asterisk in the strings fed to "eval", so the
> "log" will see a pathspec with wildcard.

Wow, this solution is very clever.

> If we were to fix that setup() thing, we of course need to be
> a bit careful about existing tests.

Though the uses of test_bloom_filters_(not_)used are not too much,
I think replace
  git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
with
  git -c core.commitGraph=false log --pretty="format:%s” “$*" >log_wo_bloom &&

is not better than add the “eval …” solution, I will just use eval

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v3 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-06-27  6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
@ 2025-06-28  4:21   ` Lidong Yan
  2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
  2025-06-28  4:21   ` [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
  2025-06-28  4:21   ` [PATCH v3 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
  2 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-06-28  4:21 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan, Junio C Hamano

This series enables bloom filter optimization for multiple pathspec
elements. Compared to v2, v3 fixed bugs in forbid_bloom_filter() and
add one more test case in t/t4216-log-bloom.sh.

Lidong Yan (2):
  bloom: replace struct bloom_key * with struct bloom_keyvec
  bloom: optimize multiple pathspec items in revision traversal

 bloom.c              |  31 +++++++++++
 bloom.h              |  20 +++++++
 revision.c           | 126 ++++++++++++++++++++++++-------------------
 revision.h           |   6 +--
 t/t4216-log-bloom.sh |  23 ++++----
 5 files changed, 139 insertions(+), 67 deletions(-)

-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-06-27  6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
  2025-06-28  4:21   ` [PATCH v3 " Lidong Yan
@ 2025-06-28  4:21   ` Lidong Yan
  2025-07-02 15:08     ` Patrick Steinhardt
  2025-06-28  4:21   ` [PATCH v3 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
  2 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-06-28  4:21 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan

The revision traversal limited by pathspec has optimization when
the pathspec has only one element. To support optimization for
multiple pathspec items, we need to modify the data structures
in struct rev_info.

struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
corresponding to a single pathspec item. To allow struct rev_info
to store bloom keys for multiple pathspec items, a new data structure
`struct bloom_keyvec` is introduced. Each `struct bloom_keyvec`
corresponds to a single pathspec item.

In `struct rev_info`, replace bloom_keys and bloom_nr with bloom_keyvecs
and bloom_keyvec_nr. This commit still optimize one pathspec item, thus
bloom_keyvec_nr can only be 0 or 1.

New *_bloom_keyvec functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter. fill_bloom_keyvec_key() is added to
initialize a key in keyvec.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c    | 31 +++++++++++++++++++++++++++++++
 bloom.h    | 20 ++++++++++++++++++++
 revision.c | 36 ++++++++++++++++++------------------
 revision.h |  6 +++---
 4 files changed, 72 insertions(+), 21 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..8259cfce51 100644
--- a/bloom.c
+++ b/bloom.c
@@ -280,6 +280,25 @@ void deinit_bloom_filters(void)
 	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
 }
 
+struct bloom_keyvec *create_bloom_keyvec(size_t count)
+{
+	struct bloom_keyvec *vec;
+	size_t sz = sizeof(struct bloom_keyvec);
+	sz += count * sizeof(struct bloom_key);
+	vec = (struct bloom_keyvec *)xcalloc(1, sz);
+	vec->count = count;
+	return vec;
+}
+
+void destroy_bloom_keyvec(struct bloom_keyvec *vec)
+{
+	if (!vec)
+		return;
+	for (size_t nr = 0; nr < vec->count; nr++)
+		clear_bloom_key(&vec->key[nr]);
+	free(vec);
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -540,3 +559,15 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 
 	return 1;
 }
+
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *vec,
+			      const struct bloom_filter_settings *settings)
+{
+	int ret = 1;
+
+	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+	return ret;
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..9e4e832c8c 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,11 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+struct bloom_keyvec {
+	size_t count;
+	struct bloom_key key[FLEX_ARRAY];
+};
+
 int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
@@ -100,6 +105,17 @@ void add_key_to_filter(const struct bloom_key *key,
 void init_bloom_filters(void);
 void deinit_bloom_filters(void);
 
+struct bloom_keyvec *create_bloom_keyvec(size_t count);
+void destroy_bloom_keyvec(struct bloom_keyvec *vec);
+
+static inline void fill_bloom_keyvec_key(const char *data, size_t len,
+					 struct bloom_keyvec *vec, size_t nr,
+					 const struct bloom_filter_settings *settings)
+{
+	assert(nr < vec->count);
+	fill_bloom_key(data, len, &vec->key[nr], settings);
+}
+
 enum bloom_filter_computed {
 	BLOOM_NOT_COMPUTED = (1 << 0),
 	BLOOM_COMPUTED     = (1 << 1),
@@ -137,4 +153,8 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *v,
+			      const struct bloom_filter_settings *settings);
+
 #endif
diff --git a/revision.c b/revision.c
index afee111196..3aa544c137 100644
--- a/revision.c
+++ b/revision.c
@@ -688,6 +688,7 @@ static int forbid_bloom_filters(struct pathspec *spec)
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
+	struct bloom_keyvec *bloom_keyvec;
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
@@ -736,19 +737,21 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		p++;
 	}
 
-	revs->bloom_keys_nr = path_component_nr;
-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+	revs->bloom_keyvecs_nr = 1;
+	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	bloom_keyvec = create_bloom_keyvec(path_component_nr);
+	revs->bloom_keyvecs[0] = bloom_keyvec;
 
-	fill_bloom_key(path, len, &revs->bloom_keys[0],
-		       revs->bloom_filter_settings);
+	fill_bloom_keyvec_key(path, len, bloom_keyvec, 0,
+			      revs->bloom_filter_settings);
 	path_component_nr = 1;
 
 	p = path + len - 1;
 	while (p > path) {
 		if (*p == '/')
-			fill_bloom_key(path, p - path,
-				       &revs->bloom_keys[path_component_nr++],
-				       revs->bloom_filter_settings);
+			fill_bloom_keyvec_key(path, p - path, bloom_keyvec,
+					      path_component_nr++,
+					      revs->bloom_filter_settings);
 		p--;
 	}
 
@@ -779,11 +782,8 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
-		result = bloom_filter_contains(filter,
-					       &revs->bloom_keys[j],
-					       revs->bloom_filter_settings);
-	}
+	result = bloom_filter_contains_vec(filter, revs->bloom_keyvecs[0],
+					   revs->bloom_filter_settings);
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -823,7 +823,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_keys_nr && !nth_parent) {
+	if (revs->bloom_keyvecs_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
@@ -850,7 +850,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
 	if (!t1)
 		return 0;
 
-	if (!nth_parent && revs->bloom_keys_nr) {
+	if (!nth_parent && revs->bloom_keyvecs_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -3230,10 +3230,10 @@ void release_revisions(struct rev_info *revs)
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
 
-	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		clear_bloom_key(&revs->bloom_keys[i]);
-	FREE_AND_NULL(revs->bloom_keys);
-	revs->bloom_keys_nr = 0;
+	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
+		destroy_bloom_keyvec(revs->bloom_keyvecs[i]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
 struct bloom_filter_settings;
 struct option;
 struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
 
 	/* Commit graph bloom filter fields */
 	/* The bloom filter key(s) for the pathspec */
-	struct bloom_key *bloom_keys;
-	int bloom_keys_nr;
+	struct bloom_keyvec **bloom_keyvecs;
+	int bloom_keyvecs_nr;
 
 	/*
 	 * The bloom filter settings used to generate the key.
-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 2/2] bloom: optimize multiple pathspec items in revision traversal
  2025-06-27  6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
  2025-06-28  4:21   ` [PATCH v3 " Lidong Yan
  2025-06-28  4:21   ` [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-06-28  4:21   ` Lidong Yan
  2 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-06-28  4:21 UTC (permalink / raw)
  To: git; +Cc: Lidong Yan

To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add code to initialize and check each pathspec item's bloom_keyvec.

Add new function release_revisions_bloom_keyvecs() to free all bloom
keyvec owned by rev_info.

Add new test cases in t/t4216-log-bloom.sh to ensure
  - consistent results between the optimization for multiple pathspec
    items using bloom filter and the case without bloom filter
    optimization.
  - does not use bloom filter if any pathspec item is not literal.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c           | 126 ++++++++++++++++++++++++-------------------
 t/t4216-log-bloom.sh |  23 ++++----
 2 files changed, 85 insertions(+), 64 deletions(-)

diff --git a/revision.c b/revision.c
index 3aa544c137..8d73395f26 100644
--- a/revision.c
+++ b/revision.c
@@ -675,16 +675,17 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
@@ -692,7 +693,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
-	int path_component_nr = 1;
+	int path_component_nr;
 
 	if (!revs->commits)
 		return;
@@ -709,50 +710,53 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	pi = &revs->pruning.pathspec.items[0];
-
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
-
-	len = strlen(path);
-	if (!len) {
-		revs->bloom_filter_settings = NULL;
-		free(path_alloc);
-		return;
-	}
-
-	p = path;
-	while (*p) {
-		/*
-		 * At this point, the path is normalized to use Unix-style
-		 * path separators. This is required due to how the
-		 * changed-path Bloom filters store the paths.
-		 */
-		if (*p == '/')
-			path_component_nr++;
-		p++;
-	}
-
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	bloom_keyvec = create_bloom_keyvec(path_component_nr);
-	revs->bloom_keyvecs[0] = bloom_keyvec;
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		pi = &revs->pruning.pathspec.items[i];
+		path_component_nr = 1;
+
+		/* remove single trailing slash from path, if needed */
+		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+			path_alloc = xmemdupz(pi->match, pi->len - 1);
+			path = path_alloc;
+		} else
+			path = pi->match;
+
+		len = strlen(path);
+		if (!len)
+			goto fail;
+
+		p = path;
+		while (*p) {
+			/*
+			 * At this point, the path is normalized to use
+			 * Unix-style path separators. This is required due to
+			 * how the changed-path Bloom filters store the paths.
+			 */
+			if (*p == '/')
+				path_component_nr++;
+			p++;
+		}
 
-	fill_bloom_keyvec_key(path, len, bloom_keyvec, 0,
-			      revs->bloom_filter_settings);
-	path_component_nr = 1;
+		bloom_keyvec = create_bloom_keyvec(path_component_nr);
+		revs->bloom_keyvecs[i] = bloom_keyvec;
+
+		fill_bloom_keyvec_key(path, len, bloom_keyvec, 0,
+			       revs->bloom_filter_settings);
+		path_component_nr = 1;
+
+		p = path + len - 1;
+		while (p > path) {
+			if (*p == '/')
+				fill_bloom_keyvec_key(path, p - path,
+					       bloom_keyvec,
+						   path_component_nr++,
+					       revs->bloom_filter_settings);
+			p--;
+		}
 
-	p = path + len - 1;
-	while (p > path) {
-		if (*p == '/')
-			fill_bloom_keyvec_key(path, p - path, bloom_keyvec,
-					      path_component_nr++,
-					      revs->bloom_filter_settings);
-		p--;
+		FREE_AND_NULL(path_alloc);
 	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
@@ -760,14 +764,19 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		bloom_filter_atexit_registered = 1;
 	}
 
+	return;
+
+fail:
+	revs->bloom_filter_settings = NULL;
 	free(path_alloc);
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result = 1, j;
+	int result = 0;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -782,8 +791,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains_vec(filter, revs->bloom_keyvecs[0],
-					   revs->bloom_filter_settings);
+	for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+		result = bloom_filter_contains_vec(filter,
+						   revs->bloom_keyvecs[nr],
+						   revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -3201,6 +3213,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
 
 static void release_revisions_topo_walk_info(struct topo_walk_info *info);
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+		destroy_bloom_keyvec(revs->bloom_keyvecs[nr]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
+}
+
 static void free_void_commit_list(void *list)
 {
 	free_commit_list(list);
@@ -3229,11 +3249,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
-
-	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
-		destroy_bloom_keyvec(revs->bloom_keyvecs[i]);
-	FREE_AND_NULL(revs->bloom_keyvecs);
-	revs->bloom_keyvecs_nr = 0;
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.50.0.108.g6ae0c543ae


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-27 13:50   ` Junio C Hamano
  2025-06-27 14:24     ` Lidong Yan
@ 2025-07-01  5:52     ` Lidong Yan
  2025-07-01 15:19       ` Junio C Hamano
  2025-07-01  8:50     ` SZEDER Gábor
  2 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-01  5:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:
> Totally outside the topic, but I wonder if we can further optimize
> by adding an early rejection using .nowildcard_len?  Instead of
> allowing a wildcarded "dir/*" pathspec element from disabling the
> Bloom filter altogether, we could say "dir/ is not possibly altered,
> so there may be dir/A, dir/B, etc., in the directory, nothing that
> would match dir/* wildcard would have been modified", couldn't we?

I think, except for PATHSPEC_EXCLUDE, all other pathspec magic flags
could potentially be optimized using .nowildcard_len by restricting checks to
just the dir/ part of each pathspec item.

Here;s are all possible pathspec magic
#define PATHSPEC_FROMTOP	(1<<0)
#define PATHSPEC_MAXDEPTH	(1<<1)
#define PATHSPEC_LITERAL	(1<<2)
#define PATHSPEC_GLOB		(1<<3)
#define PATHSPEC_ICASE		(1<<4)
#define PATHSPEC_EXCLUDE	(1<<5)
#define PATHSPEC_ATTR		(1<<6)



^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-06-27 13:50   ` Junio C Hamano
  2025-06-27 14:24     ` Lidong Yan
  2025-07-01  5:52     ` Lidong Yan
@ 2025-07-01  8:50     ` SZEDER Gábor
  2025-07-01 11:40       ` Lidong Yan
  2025-07-01 15:43       ` Junio C Hamano
  2 siblings, 2 replies; 72+ messages in thread
From: SZEDER Gábor @ 2025-07-01  8:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Lidong Yan, git, Lidong Yan

On Fri, Jun 27, 2025 at 06:50:22AM -0700, Junio C Hamano wrote:
> Totally outside the topic, but I wonder if we can further optimize
> by adding an early rejection using .nowildcard_len?  Instead of
> allowing a wildcarded "dir/*" pathspec element from disabling the
> Bloom filter altogether, we could say "dir/ is not possibly altered,
> so there may be dir/A, dir/B, etc., in the directory, nothing that
> would match dir/* wildcard would have been modified", couldn't we?

Indeed, that's what I demonstrated back in:

  https://public-inbox.org/git/20200529085038.26008-35-szeder.dev@gmail.com/


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-01  8:50     ` SZEDER Gábor
@ 2025-07-01 11:40       ` Lidong Yan
  2025-07-01 15:43       ` Junio C Hamano
  1 sibling, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-01 11:40 UTC (permalink / raw)
  To: git

SZEDER Gábor <szeder.dev@gmail.com> writes:
> 
> On Fri, Jun 27, 2025 at 06:50:22AM -0700, Junio C Hamano wrote:
>> Totally outside the topic, but I wonder if we can further optimize
>> by adding an early rejection using .nowildcard_len?  Instead of
>> allowing a wildcarded "dir/*" pathspec element from disabling the
>> Bloom filter altogether, we could say "dir/ is not possibly altered,
>> so there may be dir/A, dir/B, etc., in the directory, nothing that
>> would match dir/* wildcard would have been modified", couldn't we?
> 
> Indeed, that's what I demonstrated back in:
> 
>  https://public-inbox.org/git/20200529085038.26008-35-szeder.dev@gmail.com/

That's interesting. Though I find the bloom part of code changed and I can't
reuse your patch.

Have you ever considered to optimize other kind of pathspec magic? I am
not perfectly sure whether it is feasible to use the same trick (using bloom filter on dir/ path)
on patchspec magic except PATHSPEC_EXCLUDE.

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-01  5:52     ` Lidong Yan
@ 2025-07-01 15:19       ` Junio C Hamano
  2025-07-02  7:14         ` Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-07-01 15:19 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git

Lidong Yan <yldhome2d2@gmail.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>> Totally outside the topic, but I wonder if we can further optimize
>> by adding an early rejection using .nowildcard_len?  Instead of
>> allowing a wildcarded "dir/*" pathspec element from disabling the
>> Bloom filter altogether, we could say "dir/ is not possibly altered,
>> so there may be dir/A, dir/B, etc., in the directory, nothing that
>> would match dir/* wildcard would have been modified", couldn't we?
>
> I think, except for PATHSPEC_EXCLUDE, all other pathspec magic flags
> could potentially be optimized using .nowildcard_len by restricting checks to
> just the dir/ part of each pathspec item.

A good observation.

I do not know about icase; though.  Asking about "Dir/Path" and
getting "Dir/ or Dir/Path cannot possibly be in the set of paths
that were modified" from the changed-path Bloom filter would not
help us optimize the tree comparison out, when we do not want to
miss modifications for "dir/path".

> Here;s are all possible pathspec magic
> #define PATHSPEC_FROMTOP	(1<<0)
> #define PATHSPEC_MAXDEPTH	(1<<1)
> #define PATHSPEC_LITERAL	(1<<2)
> #define PATHSPEC_GLOB		(1<<3)
> #define PATHSPEC_ICASE		(1<<4)
> #define PATHSPEC_EXCLUDE	(1<<5)
> #define PATHSPEC_ATTR		(1<<6)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-01  8:50     ` SZEDER Gábor
  2025-07-01 11:40       ` Lidong Yan
@ 2025-07-01 15:43       ` Junio C Hamano
  1 sibling, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-01 15:43 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: Lidong Yan, git, Lidong Yan

SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Fri, Jun 27, 2025 at 06:50:22AM -0700, Junio C Hamano wrote:
>> Totally outside the topic, but I wonder if we can further optimize
>> by adding an early rejection using .nowildcard_len?  Instead of
>> allowing a wildcarded "dir/*" pathspec element from disabling the
>> Bloom filter altogether, we could say "dir/ is not possibly altered,
>> so there may be dir/A, dir/B, etc., in the directory, nothing that
>> would match dir/* wildcard would have been modified", couldn't we?
>
> Indeed, that's what I demonstrated back in:
>
>   https://public-inbox.org/git/20200529085038.26008-35-szeder.dev@gmail.com/

Heh, I am always impressed that some people seem to have infinitely
long scrollback buffer ;-)  It is curious why nobody else noticed
and advocated for your patch back then.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-01 15:19       ` Junio C Hamano
@ 2025-07-02  7:14         ` Lidong Yan
  2025-07-02 15:48           ` Junio C Hamano
  0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-02  7:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, SZEDER Gábor

Junio C Hamano <gitster@pobox.com> writes:
> 
> I do not know about icase; though.  Asking about "Dir/Path" and
> getting "Dir/ or Dir/Path cannot possibly be in the set of paths
> that were modified" from the changed-path Bloom filter would not
> help us optimize the tree comparison out, when we do not want to
> miss modifications for "dir/path".

Make sense, both PATHSPEC_EXCLUDE and PATHSPEC_ICASE shouldn’t be
optimized by bloom filter.

I found that my [PATCH v3 2/2] contains two unaligned parameters. Should I reroll
this patch and introduce the nowildcard_len change in a separate commit?

> 
>> Here;s are all possible pathspec magic
>> #define PATHSPEC_FROMTOP (1<<0)
>> #define PATHSPEC_MAXDEPTH (1<<1)
>> #define PATHSPEC_LITERAL (1<<2)
>> #define PATHSPEC_GLOB (1<<3)
>> #define PATHSPEC_ICASE (1<<4)
>> #define PATHSPEC_EXCLUDE (1<<5)
>> #define PATHSPEC_ATTR (1<<6)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-06-28  4:21   ` [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-02 15:08     ` Patrick Steinhardt
  2025-07-02 15:49       ` Lidong Yan
  2025-07-02 18:28       ` Junio C Hamano
  0 siblings, 2 replies; 72+ messages in thread
From: Patrick Steinhardt @ 2025-07-02 15:08 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, Lidong Yan

On Sat, Jun 28, 2025 at 12:21:39PM +0800, Lidong Yan wrote:
> diff --git a/bloom.h b/bloom.h
> index 6e46489a20..9e4e832c8c 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -74,6 +74,11 @@ struct bloom_key {
>  	uint32_t *hashes;
>  };
>  
> +struct bloom_keyvec {
> +	size_t count;
> +	struct bloom_key key[FLEX_ARRAY];
> +};
> +

A short comment would help readers understand what the intent of this
data structure is.

>  int load_bloom_filter_from_graph(struct commit_graph *g,
>  				 struct bloom_filter *filter,
>  				 uint32_t graph_pos);
> @@ -100,6 +105,17 @@ void add_key_to_filter(const struct bloom_key *key,
>  void init_bloom_filters(void);
>  void deinit_bloom_filters(void);
>  
> +struct bloom_keyvec *create_bloom_keyvec(size_t count);
> +void destroy_bloom_keyvec(struct bloom_keyvec *vec);

These functions are named very unusually for us -- the first version of
this patch series was following our coding guidelines, but this version
here isn't anymore.

 - The primary data structure that a subsystem 'S' deals with is called
   `struct S`. Functions that operate on `struct S` are named
   `S_<verb>()` and should generally receive a pointer to `struct S` as
   first parameter. E.g.

Second, the functions should probably be called `*_new()` and `*_free()`
instead of `create_*()` and `destroy_*()`.

> +static inline void fill_bloom_keyvec_key(const char *data, size_t len,
> +					 struct bloom_keyvec *vec, size_t nr,
> +					 const struct bloom_filter_settings *settings)
> +{
> +	assert(nr < vec->count);
> +	fill_bloom_key(data, len, &vec->key[nr], settings);
> +}
> +

Similarly, this should probably be called `bloom_keyvec_fill_key()`.

>  enum bloom_filter_computed {
>  	BLOOM_NOT_COMPUTED = (1 << 0),
>  	BLOOM_COMPUTED     = (1 << 1),
> @@ -137,4 +153,8 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>  			  const struct bloom_key *key,
>  			  const struct bloom_filter_settings *settings);
>  
> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
> +			      const struct bloom_keyvec *v,
> +			      const struct bloom_filter_settings *settings);
> +
>  #endif

This one looks alright though.

> diff --git a/revision.c b/revision.c
> index afee111196..3aa544c137 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -779,11 +782,8 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>  
> -	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> -		result = bloom_filter_contains(filter,
> -					       &revs->bloom_keys[j],
> -					       revs->bloom_filter_settings);
> -	}
> +	result = bloom_filter_contains_vec(filter, revs->bloom_keyvecs[0],
> +					   revs->bloom_filter_settings);
>  
>  	if (result)
>  		count_bloom_filter_maybe++;

This conversion feels wrong to me. Why don't we end up iterating through
`revs->bloom_keyvecs_nr` here?  We do indeed change it back in the next
patch to use a for loop.

> @@ -3230,10 +3230,10 @@ void release_revisions(struct rev_info *revs)
>  	line_log_free(revs);
>  	oidset_clear(&revs->missing_commits);
>  
> -	for (int i = 0; i < revs->bloom_keys_nr; i++)
> -		clear_bloom_key(&revs->bloom_keys[i]);
> -	FREE_AND_NULL(revs->bloom_keys);
> -	revs->bloom_keys_nr = 0;
> +	for (int i = 0; i < revs->bloom_keyvecs_nr; i++)

It's puzzling that the number of keys is declared as `int`. It's not an
issue introduced by you, but can we maybe fix it while at it?

Patrick

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-02  7:14         ` Lidong Yan
@ 2025-07-02 15:48           ` Junio C Hamano
  2025-07-03  1:52             ` Lidong Yan
  2025-07-04 12:09             ` Lidong Yan
  0 siblings, 2 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-02 15:48 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, SZEDER Gábor

Lidong Yan <yldhome2d2@gmail.com> writes:

> Junio C Hamano <gitster@pobox.com> writes:
>> 
>> I do not know about icase; though.  Asking about "Dir/Path" and
>> getting "Dir/ or Dir/Path cannot possibly be in the set of paths
>> that were modified" from the changed-path Bloom filter would not
>> help us optimize the tree comparison out, when we do not want to
>> miss modifications for "dir/path".
>
> Make sense, both PATHSPEC_EXCLUDE and PATHSPEC_ICASE shouldn’t be
> optimized by bloom filter.

Before concluding so, we may want to double check how Bloom filters
are built on case insensitive systems, though.  If we normalize the
string by downcasing before murmuring the string, the resulting
Bloom filter may have more false positives for those who want to
(ab)use it to optimize case sensitive queries (without affecting
correctness), but case insensitive queries would be helped.  I do
not think we support (or want to support) a repository that spans
across two filesystems with different case sensitivity, so those who
worked on our changed-path Bloom filter subsystem may have already
placed such an optimization, based on the case sensitivity recorded
in the repository (core.ignorecase).

> I found that my [PATCH v3 2/2] contains two unaligned parameters. Should I reroll
> this patch and introduce the nowildcard_len change in a separate commit?

Updating a patch with a fix to obvious known problems is good.

Extending the scope of the series should be left out for a new
separate commit.  It may even be a better idea to hold it while
the current set of patches are still being polished, and then sent
out as a new series after the dust settles (even if you internally
developed that part as a direct extension to the current effort).

Thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-02 15:08     ` Patrick Steinhardt
@ 2025-07-02 15:49       ` Lidong Yan
  2025-07-02 18:28       ` Junio C Hamano
  1 sibling, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-02 15:49 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:
> 
> On Sat, Jun 28, 2025 at 12:21:39PM +0800, Lidong Yan wrote:
>> diff --git a/bloom.h b/bloom.h
>> index 6e46489a20..9e4e832c8c 100644
>> --- a/bloom.h
>> +++ b/bloom.h
>> @@ -74,6 +74,11 @@ struct bloom_key {
>> uint32_t *hashes;
>> };
>> 
>> +struct bloom_keyvec {
>> + size_t count;
>> + struct bloom_key key[FLEX_ARRAY];
>> +};
>> +
> 
> A short comment would help readers understand what the intent of this
> data structure is.

Understood, will add comment for struct bloom_keyvec in v4.

> 
>> int load_bloom_filter_from_graph(struct commit_graph *g,
>> struct bloom_filter *filter,
>> uint32_t graph_pos);
>> @@ -100,6 +105,17 @@ void add_key_to_filter(const struct bloom_key *key,
>> void init_bloom_filters(void);
>> void deinit_bloom_filters(void);
>> 
>> +struct bloom_keyvec *create_bloom_keyvec(size_t count);
>> +void destroy_bloom_keyvec(struct bloom_keyvec *vec);
> 
> These functions are named very unusually for us -- the first version of
> this patch series was following our coding guidelines, but this version
> here isn't anymore.
> 
> - The primary data structure that a subsystem 'S' deals with is called
>   `struct S`. Functions that operate on `struct S` are named
>   `S_<verb>()` and should generally receive a pointer to `struct S` as
>   first parameter. E.g.
> 
> Second, the functions should probably be called `*_new()` and `*_free()`
> instead of `create_*()` and `destroy_*()`.

Though I think create and destroy doesn’t match the 'operate on struct S’
definition, I will rename these function to *_verb in v4.

> 
>> +static inline void fill_bloom_keyvec_key(const char *data, size_t len,
>> + struct bloom_keyvec *vec, size_t nr,
>> + const struct bloom_filter_settings *settings)
>> +{
>> + assert(nr < vec->count);
>> + fill_bloom_key(data, len, &vec->key[nr], settings);
>> +}
>> +
> 
> Similarly, this should probably be called `bloom_keyvec_fill_key()`.

I initially wanted the new function to have a name similar to fill_bloom_key,
but perhaps I should consider renaming it.

> 
>> enum bloom_filter_computed {
>> BLOOM_NOT_COMPUTED = (1 << 0),
>> BLOOM_COMPUTED     = (1 << 1),
>> @@ -137,4 +153,8 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>>  const struct bloom_key *key,
>>  const struct bloom_filter_settings *settings);
>> 
>> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
>> +      const struct bloom_keyvec *v,
>> +      const struct bloom_filter_settings *settings);
>> +
>> #endif
> 
> This one looks alright though.
> 
>> diff --git a/revision.c b/revision.c
>> index afee111196..3aa544c137 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -779,11 +782,8 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>> return -1;
>> }
>> 
>> - for (j = 0; result && j < revs->bloom_keys_nr; j++) {
>> - result = bloom_filter_contains(filter,
>> -       &revs->bloom_keys[j],
>> -       revs->bloom_filter_settings);
>> - }
>> + result = bloom_filter_contains_vec(filter, revs->bloom_keyvecs[0],
>> +   revs->bloom_filter_settings);
>> 
>> if (result)
>> count_bloom_filter_maybe++;
> 
> This conversion feels wrong to me. Why don't we end up iterating through
> `revs->bloom_keyvecs_nr` here?  We do indeed change it back in the next
> patch to use a for loop.

My original intention was to include all the loop-related logic in [PATCH 2/2].
However, the lack of a loop here does make the code look error-prone. I will
add the loop in v4.

> 
>> @@ -3230,10 +3230,10 @@ void release_revisions(struct rev_info *revs)
>> line_log_free(revs);
>> oidset_clear(&revs->missing_commits);
>> 
>> - for (int i = 0; i < revs->bloom_keys_nr; i++)
>> - clear_bloom_key(&revs->bloom_keys[i]);
>> - FREE_AND_NULL(revs->bloom_keys);
>> - revs->bloom_keys_nr = 0;
>> + for (int i = 0; i < revs->bloom_keyvecs_nr; i++)
> 
> It's puzzling that the number of keys is declared as `int`. It's not an
> issue introduced by you, but can we maybe fix it while at it?

Yes, of course.

Thanks for your review,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-02 15:08     ` Patrick Steinhardt
  2025-07-02 15:49       ` Lidong Yan
@ 2025-07-02 18:28       ` Junio C Hamano
  2025-07-03  1:41         ` Lidong Yan
  1 sibling, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-07-02 18:28 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Lidong Yan, git, Lidong Yan

Patrick Steinhardt <ps@pks.im> writes:

>> +static inline void fill_bloom_keyvec_key(const char *data, size_t len,
>> +					 struct bloom_keyvec *vec, size_t nr,
>> +					 const struct bloom_filter_settings *settings)
>> +{
>> +	assert(nr < vec->count);
>> +	fill_bloom_key(data, len, &vec->key[nr], settings);
>> +}
>> +
>
> Similarly, this should probably be called `bloom_keyvec_fill_key()`.

If so, a preliminary clean-up patch in front of the series is in
order, as the <bloom.h> header file, without these patches, already
has the follwoing external API functions and structures declared,
that do not follow your naming rules at all (I have removed the ones
that begin with "bloom_" from the below):

    int load_bloom_filter_from_graph()
    uint32_t murmur3_seeded_v2();
    void fill_bloom_key();
    void clear_bloom_key(s);
    void add_key_to_filter();
    void init_bloom_filters(void);
    void deinit_bloom_filters(void);
    struct bloom_filter *get_or_compute_bloom_filter();
    struct bloom_filter *get_bloom_filter();

It is very dubious that murmur3_seeded_v2() is exposed (nobody would
know it is for Bloom filter subsystem from that name); as far as I
can tell, it is only needed for t/helper testing, and makes me
wonder if we can come up with a better division between the
production code and t/helper/ code around there.

Thanks.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-02 18:28       ` Junio C Hamano
@ 2025-07-03  1:41         ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-03  1:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Patrick Steinhardt, git

Junio C Hamano <gitster@pobox.com> writes:

> It is very dubious that murmur3_seeded_v2() is exposed (nobody would
> know it is for Bloom filter subsystem from that name); as far as I
> can tell, it is only needed for t/helper testing, and makes me
> wonder if we can come up with a better division between the
> production code and t/helper/ code around there.
> 
> Thanks.


Maybe we can do something like this:
    struct bloom_filter_settings settings;
    struct bloom_key key;
    uint32_t hash0;

    settings->num_hashes = 1;
    settings->hash_version = 2;
    fill_bloom_key(argv[2], strlen(argv[2]), &key, &setting);
    hash0 = key->hash[0];
    clear_bloom_key(&key);

    return hash0;

In t/helper, so that we don’t need to export murmur3_seeded_v2().

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-02 15:48           ` Junio C Hamano
@ 2025-07-03  1:52             ` Lidong Yan
  2025-07-04 12:09             ` Lidong Yan
  1 sibling, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-03  1:52 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:
> 
> Before concluding so, we may want to double check how Bloom filters
> are built on case insensitive systems, though.  If we normalize the
> string by downcasing before murmuring the string, the resulting
> Bloom filter may have more false positives for those who want to
> (ab)use it to optimize case sensitive queries (without affecting
> correctness), but case insensitive queries would be helped.  I do
> not think we support (or want to support) a repository that spans
> across two filesystems with different case sensitivity, so those who
> worked on our changed-path Bloom filter subsystem may have already
> placed such an optimization, based on the case sensitivity recorded
> in the repository (core.ignorecase).

I understand. I should check whether commit graph file's change path
bloom filter is case sensitive. If the change path bloom filter is case insensitive,
we could optimize PATHSPEC_ICASE as well.

> 
> Updating a patch with a fix to obvious known problems is good.
> 
> Extending the scope of the series should be left out for a new
> separate commit.  It may even be a better idea to hold it while
> the current set of patches are still being polished, and then sent
> out as a new series after the dust settles (even if you internally
> developed that part as a direct extension to the current effort).

Got it. So for now I should just polish this current patch. 

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-06-28  4:21   ` [PATCH v3 " Lidong Yan
@ 2025-07-04 11:14     ` Lidong Yan
  2025-07-04 11:14       ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
                         ` (4 more replies)
  0 siblings, 5 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster

This series enables bloom filter optimization for multiple pathspec
elements. Compared to v3, v4 rename *_bloom_key and *_bloom_keyvec methods
to bloom_key_* and bloom_keyvec_*, respectively, to follow the git code
guidelines. Also, it adds a new test helper to return murmur3 hash.

Lidong Yan (4):
  bloom: add test helper to return murmur3 hash
  bloom: rename function operates on bloom_key
  bloom: replace struct bloom_key * with struct bloom_keyvec
  bloom: optimize multiple pathspec items in revision traversal

 blame.c               |   2 +-
 bloom.c               |  52 +++++++++++++++--
 bloom.h               |  41 +++++++++----
 line-log.c            |   4 +-
 revision.c            | 131 +++++++++++++++++++++++-------------------
 revision.h            |   6 +-
 t/helper/test-bloom.c |   8 +--
 t/t4216-log-bloom.sh  |  23 +++++---
 8 files changed, 174 insertions(+), 93 deletions(-)

-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v4 1/4] bloom: add test helper to return murmur3 hash
  2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
@ 2025-07-04 11:14       ` Lidong Yan
  2025-07-04 11:14       ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster

In bloom.h, murmur3_seeded_v2() is exported for the use of test murmur3
hash. To clarify that murmur3_seeded_v2() is exported solely for testing
purposes, a new helper function test_murmur3_seeded() was added instead
of exporting murmur3_seeded_v2() directly.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c               | 13 ++++++++++++-
 bloom.h               | 12 +++---------
 t/helper/test-bloom.c |  4 ++--
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..946c5e8c98 100644
--- a/bloom.c
+++ b/bloom.c
@@ -107,7 +107,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+static uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -540,3 +540,14 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 
 	return 1;
 }
+
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+				   int version)
+{
+	assert(version == 1 || version == 2);
+
+	if (version == 2)
+		return murmur3_seeded_v2(seed, data, len);
+	else
+		return murmur3_seeded_v1(seed, data, len);
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..a9ded1822f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,15 +78,6 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
 
-/*
- * Calculate the murmur3 32-bit hash value for the given data
- * using the given seed.
- * Produces a uniformly distributed hash value.
- * Not considered to be cryptographically secure.
- * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
- */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
-
 void fill_bloom_key(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
@@ -137,4 +128,7 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+				   int version);
+
 #endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9aa2c5a592..6a24b6e0a6 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -61,13 +61,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		hashed = test_bloom_murmur3_seeded(0, argv[2], strlen(argv[2]), 2);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
 	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
 		uint32_t hashed;
-		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+		hashed = test_bloom_murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7, 2);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 2/4] bloom: rename function operates on bloom_key
  2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
  2025-07-04 11:14       ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
@ 2025-07-04 11:14       ` Lidong Yan
  2025-07-04 11:14       ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster

git code style requires that functions operating on a struct S
should be named in the form S_verb. However, the functions operating
on struct bloom_key do not follow this convention. Therefore,
fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill()
and bloom_key_clear(), respectively.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 blame.c               | 2 +-
 bloom.c               | 8 ++++----
 bloom.h               | 4 ++--
 line-log.c            | 4 ++--
 revision.c            | 6 +++---
 t/helper/test-bloom.c | 4 ++--
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/blame.c b/blame.c
index 57daa45e89..459043a511 100644
--- a/blame.c
+++ b/blame.c
@@ -1310,7 +1310,7 @@ static void add_bloom_key(struct blame_bloom_data *bd,
 	}
 
 	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+	bloom_key_fill(path, strlen(path), bd->keys[bd->nr], bd->settings);
 	bd->nr++;
 }
 
diff --git a/bloom.c b/bloom.c
index 946c5e8c98..35ff36c31c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -221,7 +221,7 @@ static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 	return seed;
 }
 
-void fill_bloom_key(const char *data,
+void bloom_key_fill(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
 		    const struct bloom_filter_settings *settings)
@@ -243,7 +243,7 @@ void fill_bloom_key(const char *data,
 		key->hashes[i] = hash0 + i * hash1;
 }
 
-void clear_bloom_key(struct bloom_key *key)
+void bloom_key_clear(struct bloom_key *key)
 {
 	FREE_AND_NULL(key->hashes);
 }
@@ -500,9 +500,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			bloom_key_fill(e->path, strlen(e->path), &key, settings);
 			add_key_to_filter(&key, filter, settings);
-			clear_bloom_key(&key);
+			bloom_key_clear(&key);
 		}
 
 	cleanup:
diff --git a/bloom.h b/bloom.h
index a9ded1822f..edf14fef3e 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,11 +78,11 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
 
-void fill_bloom_key(const char *data,
+void bloom_key_fill(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
 		    const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
 
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 628e3fe3ae..a2aaf869a3 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1172,12 +1172,12 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 0;
 
 	while (!result && range) {
-		fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+		bloom_key_fill(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
 
 		if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
 			result = 1;
 
-		clear_bloom_key(&key);
+		bloom_key_clear(&key);
 		range = range->next;
 	}
 
diff --git a/revision.c b/revision.c
index afee111196..49fc650ac7 100644
--- a/revision.c
+++ b/revision.c
@@ -739,14 +739,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	revs->bloom_keys_nr = path_component_nr;
 	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
 
-	fill_bloom_key(path, len, &revs->bloom_keys[0],
+	bloom_key_fill(path, len, &revs->bloom_keys[0],
 		       revs->bloom_filter_settings);
 	path_component_nr = 1;
 
 	p = path + len - 1;
 	while (p > path) {
 		if (*p == '/')
-			fill_bloom_key(path, p - path,
+			bloom_key_fill(path, p - path,
 				       &revs->bloom_keys[path_component_nr++],
 				       revs->bloom_filter_settings);
 		p--;
@@ -3231,7 +3231,7 @@ void release_revisions(struct rev_info *revs)
 	oidset_clear(&revs->missing_commits);
 
 	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		clear_bloom_key(&revs->bloom_keys[i]);
+		bloom_key_clear(&revs->bloom_keys[i]);
 	FREE_AND_NULL(revs->bloom_keys);
 	revs->bloom_keys_nr = 0;
 }
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6a24b6e0a6..585a107802 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -12,13 +12,13 @@ static struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
 		struct bloom_key key;
 
-		fill_bloom_key(data, strlen(data), &key, &settings);
+		bloom_key_fill(data, strlen(data), &key, &settings);
 		printf("Hashes:");
 		for (size_t i = 0; i < settings.num_hashes; i++)
 			printf("0x%08x|", key.hashes[i]);
 		printf("\n");
 		add_key_to_filter(&key, filter, &settings);
-		clear_bloom_key(&key);
+		bloom_key_clear(&key);
 }
 
 static void print_bloom_filter(struct bloom_filter *filter) {
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
  2025-07-04 11:14       ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
  2025-07-04 11:14       ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
@ 2025-07-04 11:14       ` Lidong Yan
  2025-07-07 11:35         ` Derrick Stolee
  2025-07-04 11:14       ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
  4 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster

The revision traversal limited by pathspec has optimization when
the pathspec has only one element. To support optimization for
multiple pathspec items, we need to modify the data structures
in struct rev_info.

struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
corresponding to a single pathspec item. To allow struct rev_info
to store bloom keys for multiple pathspec items, a new data structure
`struct bloom_keyvec` is introduced. Each `struct bloom_keyvec`
corresponds to a single pathspec item.

In `struct rev_info`, replace bloom_keys and bloom_nr with bloom_keyvecs
and bloom_keyvec_nr. This commit still optimize one pathspec item, thus
bloom_keyvec_nr can only be 0 or 1.

New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter. bloom_keyvec_fill_key() is added to
initialize a key in keyvec.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c    | 31 +++++++++++++++++++++++++++++++
 bloom.h    | 25 +++++++++++++++++++++++++
 revision.c | 39 +++++++++++++++++++++------------------
 revision.h |  6 +++---
 4 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/bloom.c b/bloom.c
index 35ff36c31c..877bda0ef3 100644
--- a/bloom.c
+++ b/bloom.c
@@ -280,6 +280,25 @@ void deinit_bloom_filters(void)
 	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
 }
 
+struct bloom_keyvec *bloom_keyvec_new(size_t count)
+{
+	struct bloom_keyvec *vec;
+	size_t sz = sizeof(struct bloom_keyvec);
+	sz += count * sizeof(struct bloom_key);
+	vec = (struct bloom_keyvec *)xcalloc(1, sz);
+	vec->count = count;
+	return vec;
+}
+
+void bloom_keyvec_free(struct bloom_keyvec *vec)
+{
+	if (!vec)
+		return;
+	for (size_t nr = 0; nr < vec->count; nr++)
+		bloom_key_clear(&vec->key[nr]);
+	free(vec);
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 	return 1;
 }
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *vec,
+			      const struct bloom_filter_settings *settings)
+{
+	int ret = 1;
+
+	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+	return ret;
+}
+
 uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
 				   int version)
 {
diff --git a/bloom.h b/bloom.h
index edf14fef3e..3669074f3a 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,16 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+/*
+ * A bloom_keyvec is a vector of bloom_keys, which
+ * can be used to store multiple keys for a single
+ * pathspec item.
+ */
+struct bloom_keyvec {
+	size_t count;
+	struct bloom_key key[FLEX_ARRAY];
+};
+
 int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
@@ -84,6 +94,17 @@ void bloom_key_fill(const char *data,
 		    const struct bloom_filter_settings *settings);
 void bloom_key_clear(struct bloom_key *key);
 
+struct bloom_keyvec *bloom_keyvec_new(size_t count);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
+
+static inline void bloom_keyvec_fill_key(const char *data, size_t len,
+					 struct bloom_keyvec *vec, size_t nr,
+					 const struct bloom_filter_settings *settings)
+{
+	assert(nr < vec->count);
+	bloom_key_fill(data, len, &vec->key[nr], settings);
+}
+
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
 		       const struct bloom_filter_settings *settings);
@@ -128,6 +149,10 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *v,
+			      const struct bloom_filter_settings *settings);
+
 uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
 				   int version);
 
diff --git a/revision.c b/revision.c
index 49fc650ac7..7cbb49617d 100644
--- a/revision.c
+++ b/revision.c
@@ -688,6 +688,7 @@ static int forbid_bloom_filters(struct pathspec *spec)
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
+	struct bloom_keyvec *bloom_keyvec;
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
@@ -736,19 +737,21 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		p++;
 	}
 
-	revs->bloom_keys_nr = path_component_nr;
-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+	revs->bloom_keyvecs_nr = 1;
+	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	bloom_keyvec = bloom_keyvec_new(path_component_nr);
+	revs->bloom_keyvecs[0] = bloom_keyvec;
 
-	bloom_key_fill(path, len, &revs->bloom_keys[0],
-		       revs->bloom_filter_settings);
+	bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
+			      revs->bloom_filter_settings);
 	path_component_nr = 1;
 
 	p = path + len - 1;
 	while (p > path) {
 		if (*p == '/')
-			bloom_key_fill(path, p - path,
-				       &revs->bloom_keys[path_component_nr++],
-				       revs->bloom_filter_settings);
+			bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
+					      path_component_nr++,
+					      revs->bloom_filter_settings);
 		p--;
 	}
 
@@ -764,7 +767,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result = 1, j;
+	int result = 0;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -779,10 +782,10 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
-		result = bloom_filter_contains(filter,
-					       &revs->bloom_keys[j],
-					       revs->bloom_filter_settings);
+	for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+		result = bloom_filter_contains_vec(filter,
+						   revs->bloom_keyvecs[nr],
+						   revs->bloom_filter_settings);
 	}
 
 	if (result)
@@ -823,7 +826,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_keys_nr && !nth_parent) {
+	if (revs->bloom_keyvecs_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
@@ -850,7 +853,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
 	if (!t1)
 		return 0;
 
-	if (!nth_parent && revs->bloom_keys_nr) {
+	if (!nth_parent && revs->bloom_keyvecs_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -3230,10 +3233,10 @@ void release_revisions(struct rev_info *revs)
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
 
-	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		bloom_key_clear(&revs->bloom_keys[i]);
-	FREE_AND_NULL(revs->bloom_keys);
-	revs->bloom_keys_nr = 0;
+	for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
+		bloom_keyvec_free(revs->bloom_keyvecs[i]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
 struct bloom_filter_settings;
 struct option;
 struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
 
 	/* Commit graph bloom filter fields */
 	/* The bloom filter key(s) for the pathspec */
-	struct bloom_key *bloom_keys;
-	int bloom_keys_nr;
+	struct bloom_keyvec **bloom_keyvecs;
+	int bloom_keyvecs_nr;
 
 	/*
 	 * The bloom filter settings used to generate the key.
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
  2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
                         ` (2 preceding siblings ...)
  2025-07-04 11:14       ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-04 11:14       ` Lidong Yan
  2025-07-07 11:43         ` Derrick Stolee
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
  4 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster

To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add code to initialize and check each pathspec item's bloom_keyvec.

Add new function release_revisions_bloom_keyvecs() to free all bloom
keyvec owned by rev_info.

Add new test cases in t/t4216-log-bloom.sh to ensure
  - consistent results between the optimization for multiple pathspec
    items using bloom filter and the case without bloom filter
    optimization.
  - does not use bloom filter if any pathspec item is not literal.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c           | 118 ++++++++++++++++++++++++-------------------
 t/t4216-log-bloom.sh |  23 +++++----
 2 files changed, 79 insertions(+), 62 deletions(-)

diff --git a/revision.c b/revision.c
index 7cbb49617d..9a77c0d0bc 100644
--- a/revision.c
+++ b/revision.c
@@ -675,16 +675,17 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
@@ -692,7 +693,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
-	int path_component_nr = 1;
+	int path_component_nr;
 
 	if (!revs->commits)
 		return;
@@ -709,50 +710,52 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	pi = &revs->pruning.pathspec.items[0];
-
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
-
-	len = strlen(path);
-	if (!len) {
-		revs->bloom_filter_settings = NULL;
-		free(path_alloc);
-		return;
-	}
-
-	p = path;
-	while (*p) {
-		/*
-		 * At this point, the path is normalized to use Unix-style
-		 * path separators. This is required due to how the
-		 * changed-path Bloom filters store the paths.
-		 */
-		if (*p == '/')
-			path_component_nr++;
-		p++;
-	}
-
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	bloom_keyvec = bloom_keyvec_new(path_component_nr);
-	revs->bloom_keyvecs[0] = bloom_keyvec;
-
-	bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
-			      revs->bloom_filter_settings);
-	path_component_nr = 1;
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		pi = &revs->pruning.pathspec.items[i];
+		path_component_nr = 1;
+
+		/* remove single trailing slash from path, if needed */
+		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+			path_alloc = xmemdupz(pi->match, pi->len - 1);
+			path = path_alloc;
+		} else
+			path = pi->match;
+
+		len = strlen(path);
+		if (!len)
+			goto fail;
+
+		p = path;
+		while (*p) {
+			/*
+			 * At this point, the path is normalized to use
+			 * Unix-style path separators. This is required due to
+			 * how the changed-path Bloom filters store the paths.
+			 */
+			if (*p == '/')
+				path_component_nr++;
+			p++;
+		}
 
-	p = path + len - 1;
-	while (p > path) {
-		if (*p == '/')
-			bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
-					      path_component_nr++,
-					      revs->bloom_filter_settings);
-		p--;
+		bloom_keyvec = bloom_keyvec_new(path_component_nr);
+		revs->bloom_keyvecs[i] = bloom_keyvec;
+
+		bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
+				      revs->bloom_filter_settings);
+		path_component_nr = 1;
+
+		p = path + len - 1;
+		while (p > path) {
+			if (*p == '/')
+				bloom_keyvec_fill_key(path, p - path,
+						      bloom_keyvec,
+						      path_component_nr++,
+						      revs->bloom_filter_settings);
+			p--;
+		}
+		FREE_AND_NULL(path_alloc);
 	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
@@ -760,7 +763,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		bloom_filter_atexit_registered = 1;
 	}
 
+	return;
+
+fail:
+	revs->bloom_filter_settings = NULL;
 	free(path_alloc);
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
@@ -3204,6 +3212,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
 
 static void release_revisions_topo_walk_info(struct topo_walk_info *info);
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+		bloom_keyvec_free(revs->bloom_keyvecs[nr]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
+}
+
 static void free_void_commit_list(void *list)
 {
 	free_commit_list(list);
@@ -3232,11 +3248,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
-
-	for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
-		bloom_keyvec_free(revs->bloom_keyvecs[i]);
-	FREE_AND_NULL(revs->bloom_keyvecs);
-	revs->bloom_keyvecs_nr = 0;
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/2] bloom: enable multiple pathspec bloom keys
  2025-07-02 15:48           ` Junio C Hamano
  2025-07-03  1:52             ` Lidong Yan
@ 2025-07-04 12:09             ` Lidong Yan
  1 sibling, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 12:09 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, SZEDER Gábor

Junio C Hamano <gitster@pobox.com> writes:
> 
> Before concluding so, we may want to double check how Bloom filters
> are built on case insensitive systems, though.  If we normalize the
> string by downcasing before murmuring the string, the resulting
> Bloom filter may have more false positives for those who want to
> (ab)use it to optimize case sensitive queries (without affecting
> correctness), but case insensitive queries would be helped.  I do
> not think we support (or want to support) a repository that spans
> across two filesystems with different case sensitivity, so those who
> worked on our changed-path Bloom filter subsystem may have already
> placed such an optimization, based on the case sensitivity recorded
> in the repository (core.ignorecase).

In bloom.c:get_or_compute_bloom_filter(), the computation of a bloom filter
looks like:
    diff_tree_oid(c’s parent or NULL, &c->object.oid, "", &diffopt);
    diffcore_std(&diffopt);
    struct hashmap path_hashmap;

    for (path : diff_queue_diff) {
        Add all parts of path to path_hashmap;
    }

    for_each(path_hashmap) {
        Add path to filter
    }

All these steps do not check config.ignoreCase, so I believe the Bloom filter we
build in the commit graph is case-sensitive.

To demonstrate this assumption—and since I happen to be a Mac user (where
config.ignoreCase is true by default)—I ran the following commands under the
llvm-project repository:

$ git commit-graph write --split --reachable --changed-paths
$ time git log -5 -t -- README.md > /dev/null
real	0m0.089s
user	0m0.067s
sys	0m0.021s
$ time git log -5 -t -- ':(icase)README.md' > /dev/null
real	0m0.281s
user	0m0.239s
sys	0m0.041s
$ time git log -5 -t -- ‘rEADME.md’ > /dev/null
real	0m0.458s
user	0m0.394s
sys	0m0.061s

And I think it proves that changed-path Bloom filter doesn’t optimize icase
pathspec item in case insensitive file system.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-04 11:14       ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-07 11:35         ` Derrick Stolee
  2025-07-07 14:14           ` Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-07 11:35 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, gitster

On 7/4/2025 7:14 AM, Lidong Yan wrote:
> The revision traversal limited by pathspec has optimization when
> the pathspec has only one element. To support optimization for
> multiple pathspec items, we need to modify the data structures
> in struct rev_info.

You are correct that the revision-walking abandons bloom filters
when there are multiple pathspecs, and fixing this is a valuable
effort.

The need for this change is subtle and could use some extra context
to be sure reviewers understand:

This change is writing over some code that was created in
c525ce95b46 (commit-graph: check all leading directories in changed
path Bloom filters, 2020-07-01) to allow storing multiple bloom
keys during the revision walk. The multiple keys are focusing on
multiple path components of the literal pathspec. 
The point is that after the initialization, the bloom key array is
used directly as a filter for reporting TREESAME commits: a commit
is automatically reported as TREESAME to its first parent if any
bloom key results in a "No, definitely not changed" result with
that commit's bloom filter.

The reason we need a new data structure is that we need to adjust
the conditionals.

BEFORE: "NOT TREESAME if there EXISTS a bloom key that reports NO"

AFTER: "NOT TREESAME if FOR EVERY pathspec there EXISTS a bloom key
        that reports NO."

This "FOR EVERY" condition makes it impossible to use a flat array
of bloom keys for multiple pathspecs, justifying this change.

What is further confusing here is that we already have logic that
deals with arrays of bloom keys, so I expected that the vector was
the single structure storing a list of those arrays. Instead, the
vector is replacing the array itself. This is made clear by using
the vector immediately in the existing implementation.
> +struct bloom_keyvec *bloom_keyvec_new(size_t count)
> +{
> +	struct bloom_keyvec *vec;
> +	size_t sz = sizeof(struct bloom_keyvec);
> +	sz += count * sizeof(struct bloom_key);
> +	vec = (struct bloom_keyvec *)xcalloc(1, sz);
You could use CALLOC_ARRAY() to simplify this and drop
the 'sz' variable.

> +	vec->count = count;
> +	return vec;
> +}
> +
> +void bloom_keyvec_free(struct bloom_keyvec *vec)
> +{
> +	if (!vec)
> +		return;
> +	for (size_t nr = 0; nr < vec->count; nr++)
> +		bloom_key_clear(&vec->key[nr]);
> +	free(vec);
> +}
> +
>  static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
>  		       const struct hashmap_entry *eptr,
>  		       const struct hashmap_entry *entry_or_key,
> @@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>  	return 1;
>  }
>  
> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
> +			      const struct bloom_keyvec *vec,
> +			      const struct bloom_filter_settings *settings)
> +{
> +	int ret = 1;
> +
> +	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
> +		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
> +
> +	return ret;
> +}

This implementation is where the subtle detail comes in. Might be worth
a comment to say "if any key in this list is not contained in the filter,
then the filter doesn't match this vector."


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
  2025-07-04 11:14       ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-07 11:43         ` Derrick Stolee
  2025-07-07 14:18           ` Lidong Yan
  2025-07-07 15:14           ` Junio C Hamano
  0 siblings, 2 replies; 72+ messages in thread
From: Derrick Stolee @ 2025-07-07 11:43 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, gitster

On 7/4/2025 7:14 AM, Lidong Yan wrote:
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add code to initialize and check each pathspec item's bloom_keyvec.
> 
> Add new function release_revisions_bloom_keyvecs() to free all bloom
> keyvec owned by rev_info.
> 
> Add new test cases in t/t4216-log-bloom.sh to ensure
>   - consistent results between the optimization for multiple pathspec
>     items using bloom filter and the case without bloom filter
>     optimization.
>   - does not use bloom filter if any pathspec item is not literal.

This would be a great time to add some performance statistics when
using this feature with multiple pathspecs on some standard repos (git
and the Linux kernel repo are two good examples).

We don't have a great performance script for this, since each test
repo will have different paths to use for comparisons, but you can
use 'hyperfine' to assemble your own comparisons before and after
this change and report them here (and in your cover letter). 
> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> ---
>  revision.c           | 118 ++++++++++++++++++++++++-------------------
>  t/t4216-log-bloom.sh |  23 +++++----
>  2 files changed, 79 insertions(+), 62 deletions(-)
> 
> diff --git a/revision.c b/revision.c
> index 7cbb49617d..9a77c0d0bc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -675,16 +675,17 @@ static int forbid_bloom_filters(struct pathspec *spec)
>  {
>  	if (spec->has_wildcard)
>  		return 1;
> -	if (spec->nr > 1)
> -		return 1;
>  	if (spec->magic & ~PATHSPEC_LITERAL)
>  		return 1;
> -	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> -		return 1;
> +	for (size_t nr = 0; nr < spec->nr; nr++)
> +		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> +			return 1;

This is a good check: if any is non-literal, then we can't use bloom
filters.

> +static void release_revisions_bloom_keyvecs(struct rev_info *revs);
> +
>  static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
> @@ -692,7 +693,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	char *path_alloc = NULL;
>  	const char *path, *p;
>  	size_t len;
> -	int path_component_nr = 1;
> +	int path_component_nr;

We can move this into the interior of the loop, right?

>  	if (!revs->commits)
>  		return;
> @@ -709,50 +710,52 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	if (!revs->pruning.pathspec.nr)
>  		return;
>  
> -	pi = &revs->pruning.pathspec.items[0];
> -
> -	/* remove single trailing slash from path, if needed */
> -	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> -		path_alloc = xmemdupz(pi->match, pi->len - 1);
> -		path = path_alloc;
> -	} else
> -		path = pi->match;
> -
> -	len = strlen(path);
> -	if (!len) {
> -		revs->bloom_filter_settings = NULL;
> -		free(path_alloc);
> -		return;
> -	}
> -
> -	p = path;
> -	while (*p) {
> -		/*
> -		 * At this point, the path is normalized to use Unix-style
> -		 * path separators. This is required due to how the
> -		 * changed-path Bloom filters store the paths.
> -		 */
> -		if (*p == '/')
> -			path_component_nr++;
> -		p++;
> -	}
> -
> -	revs->bloom_keyvecs_nr = 1;
> -	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> -	bloom_keyvec = bloom_keyvec_new(path_component_nr);
> -	revs->bloom_keyvecs[0] = bloom_keyvec;
> -
> -	bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
> -			      revs->bloom_filter_settings);
> -	path_component_nr = 1;

The size of this diff is unfortunate. I wonder if there could first
be an extraction of this logic to operate on a single pathspec and
bloom_keyvec in a way that would be an obvious code move, then this
patch could call that method in a loop now that we have an array of
bloom_keyvecs.

I think it would make a cleaner patch and a cleaner final result.

> +	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> +	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> +	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> +		pi = &revs->pruning.pathspec.items[i];
> +		path_component_nr = 1;
> +
> +		/* remove single trailing slash from path, if needed */
> +		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> +			path_alloc = xmemdupz(pi->match, pi->len - 1);
> +			path = path_alloc;
> +		} else
> +			path = pi->match;
> +
> +		len = strlen(path);
> +		if (!len)
> +			goto fail;
> +
> +		p = path;
> +		while (*p) {
> +			/*
> +			 * At this point, the path is normalized to use
> +			 * Unix-style path separators. This is required due to
> +			 * how the changed-path Bloom filters store the paths.
> +			 */
> +			if (*p == '/')
> +				path_component_nr++;
> +			p++;
> +		}
>  
> -	p = path + len - 1;
> -	while (p > path) {
> -		if (*p == '/')
> -			bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
> -					      path_component_nr++,
> -					      revs->bloom_filter_settings);
> -		p--;
> +		bloom_keyvec = bloom_keyvec_new(path_component_nr);
> +		revs->bloom_keyvecs[i] = bloom_keyvec;
> +
> +		bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
> +				      revs->bloom_filter_settings);
> +		path_component_nr = 1;
> +
> +		p = path + len - 1;
> +		while (p > path) {
> +			if (*p == '/')
> +				bloom_keyvec_fill_key(path, p - path,
> +						      bloom_keyvec,
> +						      path_component_nr++,
> +						      revs->bloom_filter_settings);
> +			p--;
> +		}
> +		FREE_AND_NULL(path_alloc);

...

> -test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
> -	test_bloom_filters_not_used "-- file4 A/file1"
> -'
> -

Love to see this.

>  test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
>  	test_bloom_filters_not_used "-- ."
>  '
> @@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
>  	test_bloom_filters_used "-- *renamed"
>  '
>  
> -test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
> -	test_bloom_filters_not_used "-- *" &&
> -	test_bloom_filters_not_used "-- file*"
> +test_expect_success 'git log with multiple literal paths uses Bloom filter' '
> +	test_bloom_filters_used "-- file4 A/file1" &&
> +	test_bloom_filters_used "-- *" &&
> +	test_bloom_filters_used "-- file*"
> +'
> +
> +test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
> +	test_bloom_filters_not_used "-- file\*" &&
> +	test_bloom_filters_not_used "-- A/\* file4" &&
> +	test_bloom_filters_not_used "-- file4 A/\*" &&
> +	test_bloom_filters_not_used "-- * A/\*"
>  '

And these new test cases are great.

Thanks for this work. I'm happy to see the feature be added and my suggestions
are purely cosmetic as the proof is in your tests.

Thanks,
-Stolee 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-07 11:35         ` Derrick Stolee
@ 2025-07-07 14:14           ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-07 14:14 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster

Derrick Stolee <stolee@gmail.com> wrote:
> 
> BEFORE: "NOT TREESAME if there EXISTS a bloom key that reports NO"
> 
> AFTER: "NOT TREESAME if FOR EVERY pathspec there EXISTS a bloom key
>        that reports NO."
> 
> This "FOR EVERY" condition makes it impossible to use a flat array
> of bloom keys for multiple pathspecs, justifying this change.
> 
> What is further confusing here is that we already have logic that
> deals with arrays of bloom keys, so I expected that the vector was
> the single structure storing a list of those arrays. Instead, the
> vector is replacing the array itself. This is made clear by using
> the vector immediately in the existing implementation.

I think the problem here is that it clearly enough in my comments.
When writing the code, I thought about converting the original
one-dimensional array struct bloom_key *keys into a two-dimensional
array struct bloom_key **. Then I came up with the idea that this
two-dimensional array could be designed as struct bloom_keyvec *keyvecs,
which might be clearer. Each struct bloom_keyvec would represent all the
bloom_key elements for a single pathspec item.

>> +struct bloom_keyvec *bloom_keyvec_new(size_t count)
>> +{
>> + struct bloom_keyvec *vec;
>> + size_t sz = sizeof(struct bloom_keyvec);
>> + sz += count * sizeof(struct bloom_key);
>> + vec = (struct bloom_keyvec *)xcalloc(1, sz);
> You could use CALLOC_ARRAY() to simplify this and drop
> the 'sz' variable.

You are right. I think you suggest to write struct bloom_keyvec like:

———————-
|        count        |
———————
|         *keys       |  ——>     key0 | key1 | key2 | … |
———————

And I am doing here makes struct bloom_keyvec looks like

———————
|       count        |
———————
|        key[0]      |
———————
|        key[1]      |
———————
|        …            |

Although bloom_keyvec_new() appears more complex, the advantage
is that bloom_keyvec_destroy() no longer needs to free keys manually.
And if I understand correctly, junio had suggested to use the second way
[here](https://lore.kernel.org/git/xmqqtt43u36t.fsf@gitster.g/).

> 
>> + vec->count = count;
>> + return vec;
>> +}
>> +
>> +void bloom_keyvec_free(struct bloom_keyvec *vec)
>> +{
>> + if (!vec)
>> + return;
>> + for (size_t nr = 0; nr < vec->count; nr++)
>> + bloom_key_clear(&vec->key[nr]);
>> + free(vec);
>> +}
>> +
>> static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
>>       const struct hashmap_entry *eptr,
>>       const struct hashmap_entry *entry_or_key,
>> @@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>> return 1;
>> }
>> 
>> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
>> +      const struct bloom_keyvec *vec,
>> +      const struct bloom_filter_settings *settings)
>> +{
>> + int ret = 1;
>> +
>> + for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
>> + ret = bloom_filter_contains(filter, &vec->key[nr], settings);
>> +
>> + return ret;
>> +}
> 
> This implementation is where the subtle detail comes in. Might be worth
> a comment to say "if any key in this list is not contained in the filter,
> then the filter doesn't match this vector."

I will add comment in the next version

Thank you for your review,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
  2025-07-07 11:43         ` Derrick Stolee
@ 2025-07-07 14:18           ` Lidong Yan
  2025-07-07 15:14           ` Junio C Hamano
  1 sibling, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-07 14:18 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster

Derrick Stolee <stolee@gmail.com> wrote:
> 
> On 7/4/2025 7:14 AM, Lidong Yan wrote:
>> To enable optimize multiple pathspec items in revision traversal,
>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>> Add code to initialize and check each pathspec item's bloom_keyvec.
>> 
>> Add new function release_revisions_bloom_keyvecs() to free all bloom
>> keyvec owned by rev_info.
>> 
>> Add new test cases in t/t4216-log-bloom.sh to ensure
>>  - consistent results between the optimization for multiple pathspec
>>    items using bloom filter and the case without bloom filter
>>    optimization.
>>  - does not use bloom filter if any pathspec item is not literal.
> 
> This would be a great time to add some performance statistics when
> using this feature with multiple pathspecs on some standard repos (git
> and the Linux kernel repo are two good examples).
> 
> We don't have a great performance script for this, since each test
> repo will have different paths to use for comparisons, but you can
> use 'hyperfine' to assemble your own comparisons before and after
> this change and report them here (and in your cover letter).

Got it, I will add statistics in the next version.

>> + int path_component_nr;
> 
> We can move this into the interior of the loop, right?

Yes, move this into loop looks better.

> 
> The size of this diff is unfortunate. I wonder if there could first
> be an extraction of this logic to operate on a single pathspec and
> bloom_keyvec in a way that would be an obvious code move, then this
> patch could call that method in a loop now that we have an array of
> bloom_keyvecs.

I will try to figure out a way to make diff cleaner.

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
  2025-07-07 11:43         ` Derrick Stolee
  2025-07-07 14:18           ` Lidong Yan
@ 2025-07-07 15:14           ` Junio C Hamano
  1 sibling, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-07 15:14 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Lidong Yan, 502024330056, git

Derrick Stolee <stolee@gmail.com> writes:

> On 7/4/2025 7:14 AM, Lidong Yan wrote:
>> To enable optimize multiple pathspec items in revision traversal,
>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>> Add code to initialize and check each pathspec item's bloom_keyvec.
>> 
>> Add new function release_revisions_bloom_keyvecs() to free all bloom
>> keyvec owned by rev_info.
>> 
>> Add new test cases in t/t4216-log-bloom.sh to ensure
>>   - consistent results between the optimization for multiple pathspec
>>     items using bloom filter and the case without bloom filter
>>     optimization.
>>   - does not use bloom filter if any pathspec item is not literal.
>
> This would be a great time to add some performance statistics when
> using this feature with multiple pathspecs on some standard repos (git
> and the Linux kernel repo are two good examples).
>
> We don't have a great performance script for this, since each test
> repo will have different paths to use for comparisons, but you can
> use 'hyperfine' to assemble your own comparisons before and after
> this change and report them here (and in your cover letter). 

Excellent suggestion.  Very much appreciated.

> The size of this diff is unfortunate. I wonder if there could first
> be an extraction of this logic to operate on a single pathspec and
> bloom_keyvec in a way that would be an obvious code move, then this
> patch could call that method in a loop now that we have an array of
> bloom_keyvecs.
>
> I think it would make a cleaner patch and a cleaner final result.

Hmph.  Didn't think of that extra "preliminary clean-up" step
myself, but I think you have a point.  That would likely make
the resulting series easier to follow.

Thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
                         ` (3 preceding siblings ...)
  2025-07-04 11:14       ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-10  8:48       ` Lidong Yan
  2025-07-10  8:48         ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
                           ` (5 more replies)
  4 siblings, 6 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10  8:48 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon

The revision traversal limited by pathspec has optimization when
the pathspec has only one element, it does not use any pathspec
magic (other than literal), and there is no wildcard. The absence
of optimization for multiple pathspec elements in revision traversal
cause an issue raised by Kai Koponen at
  https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/

While it is much harder to lift the latter two limitations,
supporting a pathspec with multiple elements is relatively easy.
Just make sure we hash each of them separately and ask the bloom
filter about them, and if we see none of them can possibly be
affected by the commit, we can skip without tree comparison.

The difference from v4 is:
  - for the bloom_key_* functions, we now pass struct bloom_key *
    as the first parameter.
  - bloom_keyvec_fill_key() and bloom_keyvec_new() have been merged
    into a single function, so that each key is filled during the
    creation of the bloom_keyvec.

Below is a comparison of the time taken to run git log on Git and
LLVM repositories before and after applying this patch.

Setup commit-graph:
  $ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
  $ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths

Run git log on Git repository
  $ cd ~/src/git
  $ hash -p /usr/bin/git git # my system git binary is 2.43.0
  $ time git log -100 -- commit.c commit-graph.c >/dev/null
  real	0m0.055s
  user	0m0.040s
  sys	  0m0.015s
  $ hash -p ~/bin/git/bin/git git
  $ time git log -100 -- commit.c commit-graph.c >/dev/null
  real	0m0.039s
  user	0m0.020s
  sys	  0m0.020s

Run git log in LLVM repository
  $ cd ~/src/llvm
  $ hash -p /usr/bin/git git # my system git binary is 2.43.0
  $ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
  real	0m2.365s
  user	0m2.252s
  sys	  0m0.113s
  $ hash -p ~/bin/git/bin/git git
  $ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
  real	0m0.240s
  user	0m0.158s
  sys	  0m0.064s

Lidong Yan (4):
  bloom: add test helper to return murmur3 hash
  bloom: rename function operates on bloom_key
  bloom: replace struct bloom_key * with struct bloom_keyvec
  bloom: optimize multiple pathspec items in revision traversal

 blame.c               |  2 +-
 bloom.c               | 84 +++++++++++++++++++++++++++++++++---
 bloom.h               | 54 +++++++++++++++++------
 line-log.c            |  5 ++-
 revision.c            | 99 +++++++++++++++++++------------------------
 revision.h            |  6 +--
 t/helper/test-bloom.c |  8 ++--
 t/t4216-log-bloom.sh  | 23 ++++++----
 8 files changed, 187 insertions(+), 94 deletions(-)

Range-diff against v4:
1:  d6883e9d6c = 1:  31c048dcdb bloom: add test helper to return murmur3 hash
2:  f114556c0f ! 2:  d2603c1752 bloom: rename function operates on bloom_key
    @@ blame.c: static void add_bloom_key(struct blame_bloom_data *bd,
      
      	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
     -	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
    -+	bloom_key_fill(path, strlen(path), bd->keys[bd->nr], bd->settings);
    ++	bloom_key_fill(bd->keys[bd->nr], path, strlen(path), bd->settings);
      	bd->nr++;
      }
      
    @@ bloom.c: static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size
      }
      
     -void fill_bloom_key(const char *data,
    -+void bloom_key_fill(const char *data,
    - 		    size_t len,
    - 		    struct bloom_key *key,
    +-		    size_t len,
    +-		    struct bloom_key *key,
    ++void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
      		    const struct bloom_filter_settings *settings)
    + {
    + 	int i;
     @@ bloom.c: void fill_bloom_key(const char *data,
      		key->hashes[i] = hash0 + i * hash1;
      }
    @@ bloom.c: struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
      		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
      			struct bloom_key key;
     -			fill_bloom_key(e->path, strlen(e->path), &key, settings);
    -+			bloom_key_fill(e->path, strlen(e->path), &key, settings);
    ++			bloom_key_fill(&key, e->path, strlen(e->path), settings);
      			add_key_to_filter(&key, filter, settings);
     -			clear_bloom_key(&key);
     +			bloom_key_clear(&key);
    @@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g,
      				 uint32_t graph_pos);
      
     -void fill_bloom_key(const char *data,
    -+void bloom_key_fill(const char *data,
    - 		    size_t len,
    - 		    struct bloom_key *key,
    +-		    size_t len,
    +-		    struct bloom_key *key,
    ++void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
      		    const struct bloom_filter_settings *settings);
     -void clear_bloom_key(struct bloom_key *key);
     +void bloom_key_clear(struct bloom_key *key);
    @@ line-log.c: static int bloom_filter_check(struct rev_info *rev,
      
      	while (!result && range) {
     -		fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
    -+		bloom_key_fill(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
    ++		bloom_key_fill(&key, range->path, strlen(range->path),
    ++			       rev->bloom_filter_settings);
      
      		if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
      			result = 1;
    @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
      
     -	fill_bloom_key(path, len, &revs->bloom_keys[0],
    -+	bloom_key_fill(path, len, &revs->bloom_keys[0],
    ++	bloom_key_fill(&revs->bloom_keys[0], path, len,
      		       revs->bloom_filter_settings);
      	path_component_nr = 1;
      
    @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      	while (p > path) {
      		if (*p == '/')
     -			fill_bloom_key(path, p - path,
    -+			bloom_key_fill(path, p - path,
    - 				       &revs->bloom_keys[path_component_nr++],
    +-				       &revs->bloom_keys[path_component_nr++],
    ++			bloom_key_fill(&revs->bloom_keys[path_component_nr++],
    ++				       path, p - path,
      				       revs->bloom_filter_settings);
      		p--;
    + 	}
     @@ revision.c: void release_revisions(struct rev_info *revs)
      	oidset_clear(&revs->missing_commits);
      
    @@ t/helper/test-bloom.c: static struct bloom_filter_settings settings = DEFAULT_BL
      		struct bloom_key key;
      
     -		fill_bloom_key(data, strlen(data), &key, &settings);
    -+		bloom_key_fill(data, strlen(data), &key, &settings);
    ++		bloom_key_fill(&key, data, strlen(data), &settings);
      		printf("Hashes:");
      		for (size_t i = 0; i < settings.num_hashes; i++)
      			printf("0x%08x|", key.hashes[i]);
3:  c042907b92 ! 3:  60a3b16bbb bloom: replace struct bloom_key * with struct bloom_keyvec
    @@ Metadata
      ## Commit message ##
         bloom: replace struct bloom_key * with struct bloom_keyvec
     
    -    The revision traversal limited by pathspec has optimization when
    -    the pathspec has only one element. To support optimization for
    -    multiple pathspec items, we need to modify the data structures
    -    in struct rev_info.
    +    Previously, we stored bloom keys in a flat array and marked a commit
    +    as NOT TREESAME if any key reported "definitely not changed".
     
    -    struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
    -    corresponding to a single pathspec item. To allow struct rev_info
    -    to store bloom keys for multiple pathspec items, a new data structure
    -    `struct bloom_keyvec` is introduced. Each `struct bloom_keyvec`
    -    corresponds to a single pathspec item.
    +    To support multiple pathspec items, we now require that for each
    +    pathspec item, there exists a bloom key reporting "definitely not
    +    changed".
     
    -    In `struct rev_info`, replace bloom_keys and bloom_nr with bloom_keyvecs
    -    and bloom_keyvec_nr. This commit still optimize one pathspec item, thus
    -    bloom_keyvec_nr can only be 0 or 1.
    +    This "for every" condition makes a flat array insufficient, so we
    +    introduce a new structure to group keys by a single pathspec item.
    +    `struct bloom_keyvec` is introduced to replace `struct bloom_key *`
    +    and `bloom_key_nr`. And because we want to support multiple pathspec
    +    items, we added a bloom_keyvec * and a bloom_keyvec_nr field to
    +    `struct rev_info` to represent an array of bloom_keyvecs. This commit
    +    still optimize only one pathspec item, thus bloom_keyvec_nr can only
    +    be 0 or 1.
     
         New bloom_keyvec_* functions are added to create and destroy a keyvec.
         bloom_filter_contains_vec() is added to check if all key in keyvec is
    -    contained in a bloom filter. bloom_keyvec_fill_key() is added to
    -    initialize a key in keyvec.
    +    contained in a bloom filter.
     
         Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
     
    @@ bloom.c: void deinit_bloom_filters(void)
      	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
      }
      
    -+struct bloom_keyvec *bloom_keyvec_new(size_t count)
    ++struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
    ++				      const struct bloom_filter_settings *settings)
     +{
     +	struct bloom_keyvec *vec;
    -+	size_t sz = sizeof(struct bloom_keyvec);
    -+	sz += count * sizeof(struct bloom_key);
    ++	const char *p;
    ++	size_t sz;
    ++	size_t nr = 1;
    ++
    ++	p = path;
    ++	while (*p) {
    ++		/*
    ++		 * At this point, the path is normalized to use Unix-style
    ++		 * path separators. This is required due to how the
    ++		 * changed-path Bloom filters store the paths.
    ++		 */
    ++		if (*p == '/')
    ++			nr++;
    ++		p++;
    ++	}
    ++
    ++	sz = sizeof(struct bloom_keyvec);
    ++	sz += nr * sizeof(struct bloom_key);
     +	vec = (struct bloom_keyvec *)xcalloc(1, sz);
    -+	vec->count = count;
    ++	if (!vec)
    ++		return NULL;
    ++	vec->count = nr;
    ++
    ++	bloom_key_fill(&vec->key[0], path, len, settings);
    ++	nr = 1;
    ++	p = path + len - 1;
    ++	while (p > path) {
    ++		if (*p == '/') {
    ++			bloom_key_fill(&vec->key[nr++], path, p - path, settings);
    ++		}
    ++		p--;
    ++	}
    ++	assert(nr == vec->count);
     +	return vec;
     +}
     +
    @@ bloom.h: struct bloom_key {
      int load_bloom_filter_from_graph(struct commit_graph *g,
      				 struct bloom_filter *filter,
      				 uint32_t graph_pos);
    -@@ bloom.h: void bloom_key_fill(const char *data,
    +@@ bloom.h: void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
      		    const struct bloom_filter_settings *settings);
      void bloom_key_clear(struct bloom_key *key);
      
    -+struct bloom_keyvec *bloom_keyvec_new(size_t count);
    ++/*
    ++ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
    ++ * given path.
    ++ *
    ++ * This function splits the input path by '/' and generates a bloom key for each
    ++ * prefix, in reverse order of specificity. For example, given the input
    ++ * "a/b/c", it will generate bloom keys for:
    ++ *   - "a/b/c"
    ++ *   - "a/b"
    ++ *   - "a"
    ++ *
    ++ * The resulting keys are stored in a newly allocated bloom_keyvec.
    ++ */
    ++struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
    ++				      const struct bloom_filter_settings *settings);
     +void bloom_keyvec_free(struct bloom_keyvec *vec);
    -+
    -+static inline void bloom_keyvec_fill_key(const char *data, size_t len,
    -+					 struct bloom_keyvec *vec, size_t nr,
    -+					 const struct bloom_filter_settings *settings)
    -+{
    -+	assert(nr < vec->count);
    -+	bloom_key_fill(data, len, &vec->key[nr], settings);
    -+}
     +
      void add_key_to_filter(const struct bloom_key *key,
      		       struct bloom_filter *filter,
    @@ bloom.h: int bloom_filter_contains(const struct bloom_filter *filter,
      			  const struct bloom_key *key,
      			  const struct bloom_filter_settings *settings);
      
    ++/*
    ++ * bloom_filter_contains_vec - Check if all keys in a key vector are in the
    ++ * Bloom filter.
    ++ *
    ++ * Returns 1 if **all** keys in the vector are present in the filter,
    ++ * 0 if **any** key is not present.
    ++ */
     +int bloom_filter_contains_vec(const struct bloom_filter *filter,
     +			      const struct bloom_keyvec *v,
     +			      const struct bloom_filter_settings *settings);
    @@ bloom.h: int bloom_filter_contains(const struct bloom_filter *filter,
     
      ## revision.c ##
     @@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
    + 	return 0;
    + }
    + 
    ++static void release_revisions_bloom_keyvecs(struct rev_info *revs);
    ++
      static void prepare_to_use_bloom_filter(struct rev_info *revs)
      {
      	struct pathspec_item *pi;
    @@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
      	char *path_alloc = NULL;
      	const char *path, *p;
      	size_t len;
    +-	int path_component_nr = 1;
    + 
    + 	if (!revs->commits)
    + 		return;
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
    - 		p++;
    - 	}
    + 	if (!revs->pruning.pathspec.nr)
    + 		return;
      
    --	revs->bloom_keys_nr = path_component_nr;
    --	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
     +	revs->bloom_keyvecs_nr = 1;
     +	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
    -+	bloom_keyvec = bloom_keyvec_new(path_component_nr);
    -+	revs->bloom_keyvecs[0] = bloom_keyvec;
    + 	pi = &revs->pruning.pathspec.items[0];
    + 
    + 	/* remove single trailing slash from path, if needed */
    +@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
    + 		path = pi->match;
      
    --	bloom_key_fill(path, len, &revs->bloom_keys[0],
    + 	len = strlen(path);
    +-	if (!len) {
    +-		revs->bloom_filter_settings = NULL;
    +-		free(path_alloc);
    +-		return;
    +-	}
    +-
    +-	p = path;
    +-	while (*p) {
    +-		/*
    +-		 * At this point, the path is normalized to use Unix-style
    +-		 * path separators. This is required due to how the
    +-		 * changed-path Bloom filters store the paths.
    +-		 */
    +-		if (*p == '/')
    +-			path_component_nr++;
    +-		p++;
    +-	}
    +-
    +-	revs->bloom_keys_nr = path_component_nr;
    +-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
    ++	if (!len)
    ++		goto fail;
    + 
    +-	bloom_key_fill(&revs->bloom_keys[0], path, len,
     -		       revs->bloom_filter_settings);
    -+	bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
    -+			      revs->bloom_filter_settings);
    - 	path_component_nr = 1;
    - 
    - 	p = path + len - 1;
    - 	while (p > path) {
    - 		if (*p == '/')
    --			bloom_key_fill(path, p - path,
    --				       &revs->bloom_keys[path_component_nr++],
    +-	path_component_nr = 1;
    +-
    +-	p = path + len - 1;
    +-	while (p > path) {
    +-		if (*p == '/')
    +-			bloom_key_fill(&revs->bloom_keys[path_component_nr++],
    +-				       path, p - path,
     -				       revs->bloom_filter_settings);
    -+			bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
    -+					      path_component_nr++,
    -+					      revs->bloom_filter_settings);
    - 		p--;
    +-		p--;
    +-	}
    ++	revs->bloom_keyvecs[0] =
    ++		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
    + 
    + 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
    + 		atexit(trace2_bloom_filter_statistics_atexit);
    + 		bloom_filter_atexit_registered = 1;
      	}
      
    -@@ revision.c: static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
    ++	return;
    ++
    ++fail:
    ++	revs->bloom_filter_settings = NULL;
    + 	free(path_alloc);
    ++	release_revisions_bloom_keyvecs(revs);
    + }
    + 
    + static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
      						 struct commit *commit)
      {
      	struct bloom_filter *filter;
    @@ revision.c: static int rev_same_tree_as_empty(struct rev_info *revs, struct comm
      		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
      		if (!bloom_ret)
      			return 1;
    +@@ revision.c: static void release_revisions_mailmap(struct string_list *mailmap)
    + 
    + static void release_revisions_topo_walk_info(struct topo_walk_info *info);
    + 
    ++static void release_revisions_bloom_keyvecs(struct rev_info *revs)
    ++{
    ++	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
    ++		bloom_keyvec_free(revs->bloom_keyvecs[nr]);
    ++	FREE_AND_NULL(revs->bloom_keyvecs);
    ++	revs->bloom_keyvecs_nr = 0;
    ++}
    ++
    + static void free_void_commit_list(void *list)
    + {
    + 	free_commit_list(list);
     @@ revision.c: void release_revisions(struct rev_info *revs)
    + 	clear_decoration(&revs->treesame, free);
      	line_log_free(revs);
      	oidset_clear(&revs->missing_commits);
    - 
    +-
     -	for (int i = 0; i < revs->bloom_keys_nr; i++)
     -		bloom_key_clear(&revs->bloom_keys[i]);
     -	FREE_AND_NULL(revs->bloom_keys);
     -	revs->bloom_keys_nr = 0;
    -+	for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
    -+		bloom_keyvec_free(revs->bloom_keyvecs[i]);
    -+	FREE_AND_NULL(revs->bloom_keyvecs);
    -+	revs->bloom_keyvecs_nr = 0;
    ++	release_revisions_bloom_keyvecs(revs);
      }
      
      static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
4:  de4d554b55 ! 4:  198a7da17c bloom: optimize multiple pathspec items in revision traversal
    @@ Commit message
     
         To enable optimize multiple pathspec items in revision traversal,
         return 0 if all pathspec item is literal in forbid_bloom_filters().
    -    Add code to initialize and check each pathspec item's bloom_keyvec.
    -
    -    Add new function release_revisions_bloom_keyvecs() to free all bloom
    -    keyvec owned by rev_info.
    +    Add for loops to initialize and check each pathspec item's bloom_keyvec
    +    when optimization is possible.
     
         Add new test cases in t/t4216-log-bloom.sh to ensure
           - consistent results between the optimization for multiple pathspec
    @@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
      
      	return 0;
      }
    - 
    -+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
    -+
    - static void prepare_to_use_bloom_filter(struct rev_info *revs)
    - {
    - 	struct pathspec_item *pi;
    -@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
    - 	char *path_alloc = NULL;
    - 	const char *path, *p;
    - 	size_t len;
    --	int path_component_nr = 1;
    -+	int path_component_nr;
    - 
    - 	if (!revs->commits)
    - 		return;
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      	if (!revs->pruning.pathspec.nr)
      		return;
      
    +-	revs->bloom_keyvecs_nr = 1;
    +-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
     -	pi = &revs->pruning.pathspec.items[0];
    --
    ++	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
    ++	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
    ++	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
    ++		pi = &revs->pruning.pathspec.items[i];
    + 
     -	/* remove single trailing slash from path, if needed */
     -	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
     -		path_alloc = xmemdupz(pi->match, pi->len - 1);
     -		path = path_alloc;
     -	} else
     -		path = pi->match;
    --
    --	len = strlen(path);
    --	if (!len) {
    --		revs->bloom_filter_settings = NULL;
    --		free(path_alloc);
    --		return;
    --	}
    --
    --	p = path;
    --	while (*p) {
    --		/*
    --		 * At this point, the path is normalized to use Unix-style
    --		 * path separators. This is required due to how the
    --		 * changed-path Bloom filters store the paths.
    --		 */
    --		if (*p == '/')
    --			path_component_nr++;
    --		p++;
    --	}
    --
    --	revs->bloom_keyvecs_nr = 1;
    --	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
    --	bloom_keyvec = bloom_keyvec_new(path_component_nr);
    --	revs->bloom_keyvecs[0] = bloom_keyvec;
    --
    --	bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
    --			      revs->bloom_filter_settings);
    --	path_component_nr = 1;
    -+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
    -+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
    -+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
    -+		pi = &revs->pruning.pathspec.items[i];
    -+		path_component_nr = 1;
    -+
     +		/* remove single trailing slash from path, if needed */
     +		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
     +			path_alloc = xmemdupz(pi->match, pi->len - 1);
     +			path = path_alloc;
     +		} else
     +			path = pi->match;
    -+
    + 
    +-	len = strlen(path);
    +-	if (!len)
    +-		goto fail;
     +		len = strlen(path);
     +		if (!len)
     +			goto fail;
    -+
    -+		p = path;
    -+		while (*p) {
    -+			/*
    -+			 * At this point, the path is normalized to use
    -+			 * Unix-style path separators. This is required due to
    -+			 * how the changed-path Bloom filters store the paths.
    -+			 */
    -+			if (*p == '/')
    -+				path_component_nr++;
    -+			p++;
    -+		}
      
    --	p = path + len - 1;
    --	while (p > path) {
    --		if (*p == '/')
    --			bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
    --					      path_component_nr++,
    --					      revs->bloom_filter_settings);
    --		p--;
    -+		bloom_keyvec = bloom_keyvec_new(path_component_nr);
    -+		revs->bloom_keyvecs[i] = bloom_keyvec;
    -+
    -+		bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
    -+				      revs->bloom_filter_settings);
    -+		path_component_nr = 1;
    -+
    -+		p = path + len - 1;
    -+		while (p > path) {
    -+			if (*p == '/')
    -+				bloom_keyvec_fill_key(path, p - path,
    -+						      bloom_keyvec,
    -+						      path_component_nr++,
    -+						      revs->bloom_filter_settings);
    -+			p--;
    -+		}
    +-	revs->bloom_keyvecs[0] =
    +-		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
    ++		revs->bloom_keyvecs[i] =
    ++			bloom_keyvec_new(path, len, revs->bloom_filter_settings);
     +		FREE_AND_NULL(path_alloc);
    - 	}
    ++	}
      
      	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
    -@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
    - 		bloom_filter_atexit_registered = 1;
    - 	}
    - 
    -+	return;
    -+
    -+fail:
    -+	revs->bloom_filter_settings = NULL;
    - 	free(path_alloc);
    -+	release_revisions_bloom_keyvecs(revs);
    - }
    - 
    - static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
    -@@ revision.c: static void release_revisions_mailmap(struct string_list *mailmap)
    - 
    - static void release_revisions_topo_walk_info(struct topo_walk_info *info);
    - 
    -+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
    -+{
    -+	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
    -+		bloom_keyvec_free(revs->bloom_keyvecs[nr]);
    -+	FREE_AND_NULL(revs->bloom_keyvecs);
    -+	revs->bloom_keyvecs_nr = 0;
    -+}
    -+
    - static void free_void_commit_list(void *list)
    - {
    - 	free_commit_list(list);
    -@@ revision.c: void release_revisions(struct rev_info *revs)
    - 	clear_decoration(&revs->treesame, free);
    - 	line_log_free(revs);
    - 	oidset_clear(&revs->missing_commits);
    --
    --	for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
    --		bloom_keyvec_free(revs->bloom_keyvecs[i]);
    --	FREE_AND_NULL(revs->bloom_keyvecs);
    --	revs->bloom_keyvecs_nr = 0;
    -+	release_revisions_bloom_keyvecs(revs);
    - }
    - 
    - static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
    + 		atexit(trace2_bloom_filter_statistics_atexit);
     
      ## t/t4216-log-bloom.sh ##
     @@ t/t4216-log-bloom.sh: sane_unset GIT_TRACE2_CONFIG_PARAMS
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 1/4] bloom: add test helper to return murmur3 hash
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
@ 2025-07-10  8:48         ` Lidong Yan
  2025-07-10  8:48         ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
                           ` (4 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10  8:48 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon

In bloom.h, murmur3_seeded_v2() is exported for the use of test murmur3
hash. To clarify that murmur3_seeded_v2() is exported solely for testing
purposes, a new helper function test_murmur3_seeded() was added instead
of exporting murmur3_seeded_v2() directly.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c               | 13 ++++++++++++-
 bloom.h               | 12 +++---------
 t/helper/test-bloom.c |  4 ++--
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..946c5e8c98 100644
--- a/bloom.c
+++ b/bloom.c
@@ -107,7 +107,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+static uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -540,3 +540,14 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 
 	return 1;
 }
+
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+				   int version)
+{
+	assert(version == 1 || version == 2);
+
+	if (version == 2)
+		return murmur3_seeded_v2(seed, data, len);
+	else
+		return murmur3_seeded_v1(seed, data, len);
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..a9ded1822f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,15 +78,6 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
 
-/*
- * Calculate the murmur3 32-bit hash value for the given data
- * using the given seed.
- * Produces a uniformly distributed hash value.
- * Not considered to be cryptographically secure.
- * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
- */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
-
 void fill_bloom_key(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
@@ -137,4 +128,7 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+				   int version);
+
 #endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9aa2c5a592..6a24b6e0a6 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -61,13 +61,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		hashed = test_bloom_murmur3_seeded(0, argv[2], strlen(argv[2]), 2);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
 	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
 		uint32_t hashed;
-		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+		hashed = test_bloom_murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7, 2);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 2/4] bloom: rename function operates on bloom_key
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
  2025-07-10  8:48         ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
@ 2025-07-10  8:48         ` Lidong Yan
  2025-07-10  8:48         ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10  8:48 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon

git code style requires that functions operating on a struct S
should be named in the form S_verb. However, the functions operating
on struct bloom_key do not follow this convention. Therefore,
fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill()
and bloom_key_clear(), respectively.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 blame.c               |  2 +-
 bloom.c               | 10 ++++------
 bloom.h               |  6 ++----
 line-log.c            |  5 +++--
 revision.c            |  8 ++++----
 t/helper/test-bloom.c |  4 ++--
 6 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/blame.c b/blame.c
index 57daa45e89..811c6d8f9f 100644
--- a/blame.c
+++ b/blame.c
@@ -1310,7 +1310,7 @@ static void add_bloom_key(struct blame_bloom_data *bd,
 	}
 
 	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+	bloom_key_fill(bd->keys[bd->nr], path, strlen(path), bd->settings);
 	bd->nr++;
 }
 
diff --git a/bloom.c b/bloom.c
index 946c5e8c98..5523d198c8 100644
--- a/bloom.c
+++ b/bloom.c
@@ -221,9 +221,7 @@ static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 	return seed;
 }
 
-void fill_bloom_key(const char *data,
-		    size_t len,
-		    struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
 		    const struct bloom_filter_settings *settings)
 {
 	int i;
@@ -243,7 +241,7 @@ void fill_bloom_key(const char *data,
 		key->hashes[i] = hash0 + i * hash1;
 }
 
-void clear_bloom_key(struct bloom_key *key)
+void bloom_key_clear(struct bloom_key *key)
 {
 	FREE_AND_NULL(key->hashes);
 }
@@ -500,9 +498,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			bloom_key_fill(&key, e->path, strlen(e->path), settings);
 			add_key_to_filter(&key, filter, settings);
-			clear_bloom_key(&key);
+			bloom_key_clear(&key);
 		}
 
 	cleanup:
diff --git a/bloom.h b/bloom.h
index a9ded1822f..603bc1f90f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,11 +78,9 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
 
-void fill_bloom_key(const char *data,
-		    size_t len,
-		    struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
 		    const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
 
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 628e3fe3ae..07f2154e84 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1172,12 +1172,13 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 0;
 
 	while (!result && range) {
-		fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+		bloom_key_fill(&key, range->path, strlen(range->path),
+			       rev->bloom_filter_settings);
 
 		if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
 			result = 1;
 
-		clear_bloom_key(&key);
+		bloom_key_clear(&key);
 		range = range->next;
 	}
 
diff --git a/revision.c b/revision.c
index afee111196..a7eadff0a5 100644
--- a/revision.c
+++ b/revision.c
@@ -739,15 +739,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	revs->bloom_keys_nr = path_component_nr;
 	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
 
-	fill_bloom_key(path, len, &revs->bloom_keys[0],
+	bloom_key_fill(&revs->bloom_keys[0], path, len,
 		       revs->bloom_filter_settings);
 	path_component_nr = 1;
 
 	p = path + len - 1;
 	while (p > path) {
 		if (*p == '/')
-			fill_bloom_key(path, p - path,
-				       &revs->bloom_keys[path_component_nr++],
+			bloom_key_fill(&revs->bloom_keys[path_component_nr++],
+				       path, p - path,
 				       revs->bloom_filter_settings);
 		p--;
 	}
@@ -3231,7 +3231,7 @@ void release_revisions(struct rev_info *revs)
 	oidset_clear(&revs->missing_commits);
 
 	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		clear_bloom_key(&revs->bloom_keys[i]);
+		bloom_key_clear(&revs->bloom_keys[i]);
 	FREE_AND_NULL(revs->bloom_keys);
 	revs->bloom_keys_nr = 0;
 }
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6a24b6e0a6..3283544bd3 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -12,13 +12,13 @@ static struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
 		struct bloom_key key;
 
-		fill_bloom_key(data, strlen(data), &key, &settings);
+		bloom_key_fill(&key, data, strlen(data), &settings);
 		printf("Hashes:");
 		for (size_t i = 0; i < settings.num_hashes; i++)
 			printf("0x%08x|", key.hashes[i]);
 		printf("\n");
 		add_key_to_filter(&key, filter, &settings);
-		clear_bloom_key(&key);
+		bloom_key_clear(&key);
 }
 
 static void print_bloom_filter(struct bloom_filter *filter) {
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
  2025-07-10  8:48         ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
  2025-07-10  8:48         ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
@ 2025-07-10  8:48         ` Lidong Yan
  2025-07-10 16:17           ` Junio C Hamano
  2025-07-10  8:48         ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
                           ` (2 subsequent siblings)
  5 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-10  8:48 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon

Previously, we stored bloom keys in a flat array and marked a commit
as NOT TREESAME if any key reported "definitely not changed".

To support multiple pathspec items, we now require that for each
pathspec item, there exists a bloom key reporting "definitely not
changed".

This "for every" condition makes a flat array insufficient, so we
introduce a new structure to group keys by a single pathspec item.
`struct bloom_keyvec` is introduced to replace `struct bloom_key *`
and `bloom_key_nr`. And because we want to support multiple pathspec
items, we added a bloom_keyvec * and a bloom_keyvec_nr field to
`struct rev_info` to represent an array of bloom_keyvecs. This commit
still optimize only one pathspec item, thus bloom_keyvec_nr can only
be 0 or 1.

New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c    | 61 ++++++++++++++++++++++++++++++++++++++++++++
 bloom.h    | 38 +++++++++++++++++++++++++++
 revision.c | 75 ++++++++++++++++++++++--------------------------------
 revision.h |  6 ++---
 4 files changed, 132 insertions(+), 48 deletions(-)

diff --git a/bloom.c b/bloom.c
index 5523d198c8..b86015f6d1 100644
--- a/bloom.c
+++ b/bloom.c
@@ -278,6 +278,55 @@ void deinit_bloom_filters(void)
 	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
 }
 
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+				      const struct bloom_filter_settings *settings)
+{
+	struct bloom_keyvec *vec;
+	const char *p;
+	size_t sz;
+	size_t nr = 1;
+
+	p = path;
+	while (*p) {
+		/*
+		 * At this point, the path is normalized to use Unix-style
+		 * path separators. This is required due to how the
+		 * changed-path Bloom filters store the paths.
+		 */
+		if (*p == '/')
+			nr++;
+		p++;
+	}
+
+	sz = sizeof(struct bloom_keyvec);
+	sz += nr * sizeof(struct bloom_key);
+	vec = (struct bloom_keyvec *)xcalloc(1, sz);
+	if (!vec)
+		return NULL;
+	vec->count = nr;
+
+	bloom_key_fill(&vec->key[0], path, len, settings);
+	nr = 1;
+	p = path + len - 1;
+	while (p > path) {
+		if (*p == '/') {
+			bloom_key_fill(&vec->key[nr++], path, p - path, settings);
+		}
+		p--;
+	}
+	assert(nr == vec->count);
+	return vec;
+}
+
+void bloom_keyvec_free(struct bloom_keyvec *vec)
+{
+	if (!vec)
+		return;
+	for (size_t nr = 0; nr < vec->count; nr++)
+		bloom_key_clear(&vec->key[nr]);
+	free(vec);
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -539,6 +588,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 	return 1;
 }
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *vec,
+			      const struct bloom_filter_settings *settings)
+{
+	int ret = 1;
+
+	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+	return ret;
+}
+
 uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
 				   int version)
 {
diff --git a/bloom.h b/bloom.h
index 603bc1f90f..3602d32054 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,16 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+/*
+ * A bloom_keyvec is a vector of bloom_keys, which
+ * can be used to store multiple keys for a single
+ * pathspec item.
+ */
+struct bloom_keyvec {
+	size_t count;
+	struct bloom_key key[FLEX_ARRAY];
+};
+
 int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
@@ -82,6 +92,23 @@ void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
 		    const struct bloom_filter_settings *settings);
 void bloom_key_clear(struct bloom_key *key);
 
+/*
+ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
+ * given path.
+ *
+ * This function splits the input path by '/' and generates a bloom key for each
+ * prefix, in reverse order of specificity. For example, given the input
+ * "a/b/c", it will generate bloom keys for:
+ *   - "a/b/c"
+ *   - "a/b"
+ *   - "a"
+ *
+ * The resulting keys are stored in a newly allocated bloom_keyvec.
+ */
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+				      const struct bloom_filter_settings *settings);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
+
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
 		       const struct bloom_filter_settings *settings);
@@ -126,6 +153,17 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+/*
+ * bloom_filter_contains_vec - Check if all keys in a key vector are in the
+ * Bloom filter.
+ *
+ * Returns 1 if **all** keys in the vector are present in the filter,
+ * 0 if **any** key is not present.
+ */
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *v,
+			      const struct bloom_filter_settings *settings);
+
 uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
 				   int version);
 
diff --git a/revision.c b/revision.c
index a7eadff0a5..22bcfab7f9 100644
--- a/revision.c
+++ b/revision.c
@@ -685,13 +685,15 @@ static int forbid_bloom_filters(struct pathspec *spec)
 	return 0;
 }
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
+	struct bloom_keyvec *bloom_keyvec;
 	char *path_alloc = NULL;
 	const char *path, *p;
 	size_t len;
-	int path_component_nr = 1;
 
 	if (!revs->commits)
 		return;
@@ -708,6 +710,8 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
+	revs->bloom_keyvecs_nr = 1;
+	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
 	pi = &revs->pruning.pathspec.items[0];
 
 	/* remove single trailing slash from path, if needed */
@@ -718,53 +722,30 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		path = pi->match;
 
 	len = strlen(path);
-	if (!len) {
-		revs->bloom_filter_settings = NULL;
-		free(path_alloc);
-		return;
-	}
-
-	p = path;
-	while (*p) {
-		/*
-		 * At this point, the path is normalized to use Unix-style
-		 * path separators. This is required due to how the
-		 * changed-path Bloom filters store the paths.
-		 */
-		if (*p == '/')
-			path_component_nr++;
-		p++;
-	}
-
-	revs->bloom_keys_nr = path_component_nr;
-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+	if (!len)
+		goto fail;
 
-	bloom_key_fill(&revs->bloom_keys[0], path, len,
-		       revs->bloom_filter_settings);
-	path_component_nr = 1;
-
-	p = path + len - 1;
-	while (p > path) {
-		if (*p == '/')
-			bloom_key_fill(&revs->bloom_keys[path_component_nr++],
-				       path, p - path,
-				       revs->bloom_filter_settings);
-		p--;
-	}
+	revs->bloom_keyvecs[0] =
+		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
 		bloom_filter_atexit_registered = 1;
 	}
 
+	return;
+
+fail:
+	revs->bloom_filter_settings = NULL;
 	free(path_alloc);
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result = 1, j;
+	int result = 0;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -779,10 +760,10 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
-		result = bloom_filter_contains(filter,
-					       &revs->bloom_keys[j],
-					       revs->bloom_filter_settings);
+	for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+		result = bloom_filter_contains_vec(filter,
+						   revs->bloom_keyvecs[nr],
+						   revs->bloom_filter_settings);
 	}
 
 	if (result)
@@ -823,7 +804,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_keys_nr && !nth_parent) {
+	if (revs->bloom_keyvecs_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
@@ -850,7 +831,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
 	if (!t1)
 		return 0;
 
-	if (!nth_parent && revs->bloom_keys_nr) {
+	if (!nth_parent && revs->bloom_keyvecs_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -3201,6 +3182,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
 
 static void release_revisions_topo_walk_info(struct topo_walk_info *info);
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+		bloom_keyvec_free(revs->bloom_keyvecs[nr]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
+}
+
 static void free_void_commit_list(void *list)
 {
 	free_commit_list(list);
@@ -3229,11 +3218,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
-
-	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		bloom_key_clear(&revs->bloom_keys[i]);
-	FREE_AND_NULL(revs->bloom_keys);
-	revs->bloom_keys_nr = 0;
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
 struct bloom_filter_settings;
 struct option;
 struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
 
 	/* Commit graph bloom filter fields */
 	/* The bloom filter key(s) for the pathspec */
-	struct bloom_key *bloom_keys;
-	int bloom_keys_nr;
+	struct bloom_keyvec **bloom_keyvecs;
+	int bloom_keyvecs_nr;
 
 	/*
 	 * The bloom filter settings used to generate the key.
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
                           ` (2 preceding siblings ...)
  2025-07-10  8:48         ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-10  8:48         ` Lidong Yan
  2025-07-10 13:51           ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
  2025-07-10 13:55           ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
  2025-07-10 13:49         ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
  5 siblings, 2 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10  8:48 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon

To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.

Add new test cases in t/t4216-log-bloom.sh to ensure
  - consistent results between the optimization for multiple pathspec
    items using bloom filter and the case without bloom filter
    optimization.
  - does not use bloom filter if any pathspec item is not literal.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c           | 38 ++++++++++++++++++++------------------
 t/t4216-log-bloom.sh | 23 ++++++++++++++---------
 2 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/revision.c b/revision.c
index 22bcfab7f9..f25a61bb6c 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
@@ -710,23 +709,26 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	pi = &revs->pruning.pathspec.items[0];
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		pi = &revs->pruning.pathspec.items[i];
 
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
+		/* remove single trailing slash from path, if needed */
+		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+			path_alloc = xmemdupz(pi->match, pi->len - 1);
+			path = path_alloc;
+		} else
+			path = pi->match;
 
-	len = strlen(path);
-	if (!len)
-		goto fail;
+		len = strlen(path);
+		if (!len)
+			goto fail;
 
-	revs->bloom_keyvecs[0] =
-		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
+		revs->bloom_keyvecs[i] =
+			bloom_keyvec_new(path, len, revs->bloom_filter_settings);
+		FREE_AND_NULL(path_alloc);
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.50.0.107.g33b6ec8c79


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
                           ` (3 preceding siblings ...)
  2025-07-10  8:48         ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-10 13:49         ` Derrick Stolee
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
  5 siblings, 0 replies; 72+ messages in thread
From: Derrick Stolee @ 2025-07-10 13:49 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, gitster, toon

On 7/10/2025 4:48 AM, Lidong Yan wrote:

> The difference from v4 is:
>   - for the bloom_key_* functions, we now pass struct bloom_key *
>     as the first parameter.
>   - bloom_keyvec_fill_key() and bloom_keyvec_new() have been merged
>     into a single function, so that each key is filled during the
>     creation of the bloom_keyvec.
> 
> Below is a comparison of the time taken to run git log on Git and
> LLVM repositories before and after applying this patch.
> 
> Setup commit-graph:
>   $ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
>   $ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths
> 
> Run git log on Git repository
>   $ cd ~/src/git
>   $ hash -p /usr/bin/git git # my system git binary is 2.43.0
>   $ time git log -100 -- commit.c commit-graph.c >/dev/null
>   real	0m0.055s
>   user	0m0.040s
>   sys	  0m0.015s
>   $ hash -p ~/bin/git/bin/git git
>   $ time git log -100 -- commit.c commit-graph.c >/dev/null
>   real	0m0.039s
>   user	0m0.020s
>   sys	  0m0.020s
> 
> Run git log in LLVM repository
>   $ cd ~/src/llvm
>   $ hash -p /usr/bin/git git # my system git binary is 2.43.0
>   $ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
>   real	0m2.365s
>   user	0m2.252s
>   sys	  0m0.113s
>   $ hash -p ~/bin/git/bin/git git
>   $ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
>   real	0m0.240s
>   user	0m0.158s
>   sys	  0m0.064s

Thanks for these stats, though I'd recommend trying hyperfine [1]
which with this setup would let you run these experiments as the
following:

$ $ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- commit.c commit-graph.c' \
> -n 'new' '~/_git/git/git log -100 -- commit.c commit-graph.c'
Benchmark 1: old
  Time (mean ± σ):      73.1 ms ±   2.9 ms    [User: 48.8 ms, System: 23.9 ms]
  Range (min … max):    69.9 ms …  84.5 ms    42 runs
 
Benchmark 2: new
  Time (mean ± σ):      55.1 ms ±   2.9 ms    [User: 30.5 ms, System: 24.4 ms]
  Range (min … max):    51.1 ms …  61.2 ms    52 runs
 
Summary
  'new' ran
    1.33 ± 0.09 times faster than 'old'

And for LLVM:

$ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h' \
> -n 'new' '~/_git/git/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h'

Benchmark 1: old
  Time (mean ± σ):      1.974 s ±  0.006 s    [User: 1.877 s, System: 0.097 s]
  Range (min … max):    1.960 s …  1.983 s    10 runs
 
Benchmark 2: new
  Time (mean ± σ):     262.9 ms ±   2.4 ms    [User: 214.2 ms, System: 48.4 ms]
  Range (min … max):   257.7 ms … 266.2 ms    11 runs
 
Summary
  'new' ran
    7.51 ± 0.07 times faster than 'old'

[1] https://github.com/sharkdp/hyperfine

Finally, putting these performance numbers in the commit message will
make the results permanently findable in the repo history.

> 3:  c042907b92 ! 3:  60a3b16bbb bloom: replace struct bloom_key * with struct bloom_keyvec

>     -+struct bloom_keyvec *bloom_keyvec_new(size_t count)
>     ++struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
>     ++				      const struct bloom_filter_settings *settings)
>      +{
>      +	struct bloom_keyvec *vec;
>     -+	size_t sz = sizeof(struct bloom_keyvec);
>     -+	sz += count * sizeof(struct bloom_key);
>     ++	const char *p;
>     ++	size_t sz;
>     ++	size_t nr = 1;
>     ++
>     ++	p = path;
>     ++	while (*p) {
>     ++		/*
>     ++		 * At this point, the path is normalized to use Unix-style
>     ++		 * path separators. This is required due to how the
>     ++		 * changed-path Bloom filters store the paths.
>     ++		 */
>     ++		if (*p == '/')
>     ++			nr++;
>     ++		p++;
>     ++	}
>     ++
>     ++	sz = sizeof(struct bloom_keyvec);
>     ++	sz += nr * sizeof(struct bloom_key);
>      +	vec = (struct bloom_keyvec *)xcalloc(1, sz);
>     -+	vec->count = count;
>     ++	if (!vec)
>     ++		return NULL;
>     ++	vec->count = nr;
>     ++
>     ++	bloom_key_fill(&vec->key[0], path, len, settings);
>     ++	nr = 1;
>     ++	p = path + len - 1;
>     ++	while (p > path) {
>     ++		if (*p == '/') {
>     ++			bloom_key_fill(&vec->key[nr++], path, p - path, settings);
>     ++		}
>     ++		p--;
>     ++	}
>     ++	assert(nr == vec->count);
>      +	return vec;
>      +}

These additions to bloom_keyvec_new() certainly help simplify some
of the code movement I was talking about, but there is more that
can be done to simplify patch 4. I'll send a couple example patches
in reply to patch 4 with what I mean.

Overall, the patch series is correct. My complaints are stylistic
more than anything.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
  2025-07-10  8:48         ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-10 13:51           ` Derrick Stolee
  2025-07-10 15:42             ` Lidong Yan
  2025-07-10 13:55           ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
  1 sibling, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-10 13:51 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, gitster, toon

On 7/10/2025 4:48 AM, Lidong Yan wrote:
> @@ -710,23 +709,26 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	if (!revs->pruning.pathspec.nr)
>  		return;
>  
> -	revs->bloom_keyvecs_nr = 1;
> -	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> -	pi = &revs->pruning.pathspec.items[0];
> +	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> +	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> +	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> +		pi = &revs->pruning.pathspec.items[i];
>  
> -	/* remove single trailing slash from path, if needed */
> -	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> -		path_alloc = xmemdupz(pi->match, pi->len - 1);
> -		path = path_alloc;
> -	} else
> -		path = pi->match;
> +		/* remove single trailing slash from path, if needed */
> +		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> +			path_alloc = xmemdupz(pi->match, pi->len - 1);
> +			path = path_alloc;
> +		} else
> +			path = pi->match;
>  
> -	len = strlen(path);
> -	if (!len)
> -		goto fail;
> +		len = strlen(path);
> +		if (!len)
> +			goto fail;
>  
> -	revs->bloom_keyvecs[0] =
> -		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> +		revs->bloom_keyvecs[i] =
> +			bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> +		FREE_AND_NULL(path_alloc);
> +	}

This diff is still bigger than I was hoping, so I'm sending a couple
of patches that simplify this code movement. Feel free to ignore
them as being too nit-picky.

--- >8 ---

From 69fa36dc615e140ae842b536f7da792beaebb272 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Thu, 10 Jul 2025 08:06:29 -0400
Subject: [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key

When preparing to use bloom filters in a revision walk, Git populates a
boom_keyvec with an array of bloom keys for the components of a path.
Before we create the ability to map multiple pathspecs to multiple
bloom_keyvecs, extract the conversion from a pathspec to a bloom_keyvec
into its own helper method. This simplifies the state that persists in
prepare_to_use_bloom_filter() as well as makes the next change much
simpler.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 revision.c | 50 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/revision.c b/revision.c
index 22bcfab7f93..4c09b594c55 100644
--- a/revision.c
+++ b/revision.c
@@ -687,14 +687,37 @@ static int forbid_bloom_filters(struct pathspec *spec)
 
 static void release_revisions_bloom_keyvecs(struct rev_info *revs);
 
-static void prepare_to_use_bloom_filter(struct rev_info *revs)
+static int convert_pathspec_to_filter(const struct pathspec_item *pi,
+				      struct bloom_keyvec **bloom_keyvec,
+				      const struct bloom_filter_settings *settings)
 {
-	struct pathspec_item *pi;
-	struct bloom_keyvec *bloom_keyvec;
-	char *path_alloc = NULL;
-	const char *path, *p;
 	size_t len;
+	const char *path;
+	char *path_alloc = NULL;
+	int res = 0;
+
+	/* remove single trailing slash from path, if needed */
+	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+		path_alloc = xmemdupz(pi->match, pi->len - 1);
+		path = path_alloc;
+	} else
+		path = pi->match;
+
+	len = strlen(path);
+	if (!len) {
+		res = -1;
+		goto cleanup;
+	}
+
+	*bloom_keyvec = bloom_keyvec_new(path, len, settings);
 
+cleanup:
+	FREE_AND_NULL(path_alloc);
+	return res;
+}
+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
 	if (!revs->commits)
 		return;
 
@@ -712,22 +735,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	revs->bloom_keyvecs_nr = 1;
 	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	pi = &revs->pruning.pathspec.items[0];
 
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
-
-	len = strlen(path);
-	if (!len)
+	if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
+				       &revs->bloom_keyvecs[0],
+				       revs->bloom_filter_settings))
 		goto fail;
 
-	revs->bloom_keyvecs[0] =
-		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
 		bloom_filter_atexit_registered = 1;
@@ -737,7 +750,6 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 fail:
 	revs->bloom_filter_settings = NULL;
-	free(path_alloc);
 	release_revisions_bloom_keyvecs(revs);
 }
 
-- 
2.47.2.vfs.0.2



^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
  2025-07-10  8:48         ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
  2025-07-10 13:51           ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
@ 2025-07-10 13:55           ` Derrick Stolee
  2025-07-10 15:49             ` Lidong Yan
  1 sibling, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-10 13:55 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, gitster, toon

On 7/10/2025 4:48 AM, Lidong Yan wrote:
> @@ -710,23 +709,26 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  	if (!revs->pruning.pathspec.nr)
>  		return;
>  
> -	revs->bloom_keyvecs_nr = 1;
> -	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> -	pi = &revs->pruning.pathspec.items[0];
> +	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> +	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> +	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> +		pi = &revs->pruning.pathspec.items[i];
>  
> -	/* remove single trailing slash from path, if needed */
> -	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> -		path_alloc = xmemdupz(pi->match, pi->len - 1);
> -		path = path_alloc;
> -	} else
> -		path = pi->match;
> +		/* remove single trailing slash from path, if needed */
> +		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> +			path_alloc = xmemdupz(pi->match, pi->len - 1);
> +			path = path_alloc;
> +		} else
> +			path = pi->match;
>  
> -	len = strlen(path);
> -	if (!len)
> -		goto fail;
> +		len = strlen(path);
> +		if (!len)
> +			goto fail;
>  
> -	revs->bloom_keyvecs[0] =
> -		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> +		revs->bloom_keyvecs[i] =
> +			bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> +		FREE_AND_NULL(path_alloc);
> +	}

Focus on the change to this diff when the patch below is applied on
top of the 3.5/4 I sent earlier, resulting in this diff:

@@ -733,13 +732,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-
-	if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
-				       &revs->bloom_keyvecs[0],
-				       revs->bloom_filter_settings))
-		goto fail;
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[i],
+					       &revs->bloom_keyvecs[i],
+					       revs->bloom_filter_settings))
+			goto fail;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);

Also, I've included the hyperfine performance output in the commit
message.

Thanks,
-Stolee


--- >8 ---

From fe255b1acfbe90fa8e4c335435ae18ee95e6243c Mon Sep 17 00:00:00 2001
From: Lidong Yan <502024330056@smail.nju.edu.cn>
Date: Thu, 10 Jul 2025 08:04:34 -0400
Subject: [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
 traversal

To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.

Add new test cases in t/t4216-log-bloom.sh to ensure
  - consistent results between the optimization for multiple pathspec
    items using bloom filter and the case without bloom filter
    optimization.
  - does not use bloom filter if any pathspec item is not literal.

With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:

Benchmark 1: old
  Time (mean ± σ):      73.1 ms ±   2.9 ms
  Range (min … max):    69.9 ms …  84.5 ms    42 runs

Benchmark 2: new
  Time (mean ± σ):      55.1 ms ±   2.9 ms
  Range (min … max):    51.1 ms …  61.2 ms    52 runs

Summary
  'new' ran
    1.33 ± 0.09 times faster than 'old'

But in a larger repo, such as the LLVM project repo below, we get even
better results:

Benchmark 1: old
  Time (mean ± σ):      1.974 s ±  0.006 s
  Range (min … max):    1.960 s …  1.983 s    10 runs

Benchmark 2: new
  Time (mean ± σ):     262.9 ms ±   2.4 ms
  Range (min … max):   257.7 ms … 266.2 ms    11 runs

Summary
  'new' ran
    7.51 ± 0.07 times faster than 'old'

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 revision.c           | 22 +++++++++++-----------
 t/t4216-log-bloom.sh | 23 ++++++++++++++---------
 2 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/revision.c b/revision.c
index 4c09b594c55..ca8c1dde8ca 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
@@ -733,13 +732,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-
-	if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
-				       &revs->bloom_keyvecs[0],
-				       revs->bloom_filter_settings))
-		goto fail;
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[i],
+					       &revs->bloom_keyvecs[i],
+					       revs->bloom_filter_settings))
+			goto fail;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac1..639868ac562 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.47.2.vfs.0.2



^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
  2025-07-10 13:51           ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
@ 2025-07-10 15:42             ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 15:42 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, toon

Derrick Stolee <stolee@gmail.com> wrote:
> 
> This diff is still bigger than I was hoping, so I'm sending a couple
> of patches that simplify this code movement. Feel free to ignore
> them as being too nit-picky.
> 
> --- >8 ---
> 
> From 69fa36dc615e140ae842b536f7da792beaebb272 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <stolee@gmail.com>
> Date: Thu, 10 Jul 2025 08:06:29 -0400
> Subject: [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
> 
> When preparing to use bloom filters in a revision walk, Git populates a
> boom_keyvec with an array of bloom keys for the components of a path.
> Before we create the ability to map multiple pathspecs to multiple
> bloom_keyvecs, extract the conversion from a pathspec to a bloom_keyvec
> into its own helper method. This simplifies the state that persists in
> prepare_to_use_bloom_filter() as well as makes the next change much
> simpler.
> 
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> revision.c | 50 +++++++++++++++++++++++++++++++-------------------
> 1 file changed, 31 insertions(+), 19 deletions(-)
> 
> diff --git a/revision.c b/revision.c
> index 22bcfab7f93..4c09b594c55 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -687,14 +687,37 @@ static int forbid_bloom_filters(struct pathspec *spec)
> 
> static void release_revisions_bloom_keyvecs(struct rev_info *revs);
> 
> -static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +static int convert_pathspec_to_filter(const struct pathspec_item *pi,
> +      struct bloom_keyvec **bloom_keyvec,
> +      const struct bloom_filter_settings *settings)
> {
> - struct pathspec_item *pi;
> - struct bloom_keyvec *bloom_keyvec;
> - char *path_alloc = NULL;
> - const char *path, *p;
> size_t len;
> + const char *path;
> + char *path_alloc = NULL;
> + int res = 0;
> +
> + /* remove single trailing slash from path, if needed */
> + if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> + path_alloc = xmemdupz(pi->match, pi->len - 1);
> + path = path_alloc;
> + } else
> + path = pi->match;
> +
> + len = strlen(path);
> + if (!len) {
> + res = -1;
> + goto cleanup;
> + }
> +
> + *bloom_keyvec = bloom_keyvec_new(path, len, settings);
> 
> +cleanup:
> + FREE_AND_NULL(path_alloc);

I think we don’t need to NULL path_alloc here, but it doesn’t hurt.

> + return res;
> +}
> +
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> if (!revs->commits)
> return;
> 
> @@ -712,22 +735,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> 
> revs->bloom_keyvecs_nr = 1;
> CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> - pi = &revs->pruning.pathspec.items[0];
> 
> - /* remove single trailing slash from path, if needed */
> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> - path_alloc = xmemdupz(pi->match, pi->len - 1);
> - path = path_alloc;
> - } else
> - path = pi->match;
> -
> - len = strlen(path);
> - if (!len)
> + if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
> +       &revs->bloom_keyvecs[0],
> +       revs->bloom_filter_settings))
> goto fail;
> 
> - revs->bloom_keyvecs[0] =
> - bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> -
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> atexit(trace2_bloom_filter_statistics_atexit);
> bloom_filter_atexit_registered = 1;
> @@ -737,7 +750,6 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> 
> fail:
> revs->bloom_filter_settings = NULL;
> - free(path_alloc);
> release_revisions_bloom_keyvecs(revs);
> }
> 
> -- 
> 2.47.2.vfs.0.2

This looks perfect to me. I would squash patch 3.5 to patch 3 in v6.

Thanks for your patch,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
  2025-07-10 13:55           ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
@ 2025-07-10 15:49             ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 15:49 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, toon

Derrick Stolee <stolee@gmail.com> wrote:
> 
> --- >8 ---
> 
> From fe255b1acfbe90fa8e4c335435ae18ee95e6243c Mon Sep 17 00:00:00 2001
> From: Lidong Yan <502024330056@smail.nju.edu.cn>
> Date: Thu, 10 Jul 2025 08:04:34 -0400
> Subject: [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
> traversal
> 
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add for loops to initialize and check each pathspec item's bloom_keyvec
> when optimization is possible.
> 
> Add new test cases in t/t4216-log-bloom.sh to ensure
>  - consistent results between the optimization for multiple pathspec
>    items using bloom filter and the case without bloom filter
>    optimization.
>  - does not use bloom filter if any pathspec item is not literal.
> 
> With these optimizations, we get some improvements for multi-pathspec runs
> of 'git log'. First, in the Git repository we see these modest results:
> 
> Benchmark 1: old
>  Time (mean ± σ):      73.1 ms ±   2.9 ms
>  Range (min … max):    69.9 ms …  84.5 ms    42 runs
> 
> Benchmark 2: new
>  Time (mean ± σ):      55.1 ms ±   2.9 ms
>  Range (min … max):    51.1 ms …  61.2 ms    52 runs
> 
> Summary
>  'new' ran
>    1.33 ± 0.09 times faster than 'old'
> 
> But in a larger repo, such as the LLVM project repo below, we get even
> better results:
> 
> Benchmark 1: old
>  Time (mean ± σ):      1.974 s ±  0.006 s
>  Range (min … max):    1.960 s …  1.983 s    10 runs
> 
> Benchmark 2: new
>  Time (mean ± σ):     262.9 ms ±   2.4 ms
>  Range (min … max):   257.7 ms … 266.2 ms    11 runs
> 
> Summary
>  'new' ran
>    7.51 ± 0.07 times faster than 'old'

Hyperfine do looks better. I will put this into commit message and
cover letter in v6.

> 
> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> revision.c           | 22 +++++++++++-----------
> t/t4216-log-bloom.sh | 23 ++++++++++++++---------
> 2 files changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/revision.c b/revision.c
> index 4c09b594c55..ca8c1dde8ca 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
> {
> if (spec->has_wildcard)
> return 1;
> - if (spec->nr > 1)
> - return 1;
> if (spec->magic & ~PATHSPEC_LITERAL)
> return 1;
> - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> - return 1;
> + for (size_t nr = 0; nr < spec->nr; nr++)
> + if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> + return 1;
> 
> return 0;
> }
> @@ -733,13 +732,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> if (!revs->pruning.pathspec.nr)
> return;
> 
> - revs->bloom_keyvecs_nr = 1;
> - CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> -
> - if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
> -       &revs->bloom_keyvecs[0],
> -       revs->bloom_filter_settings))
> - goto fail;
> + revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> + CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> + for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> + if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[i],
> +       &revs->bloom_keyvecs[i],
> +       revs->bloom_filter_settings))
> + goto fail;
> + }
> 
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> atexit(trace2_bloom_filter_statistics_atexit);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 8910d53cac1..639868ac562 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
> 
> setup () {
> rm -f "$TRASH_DIRECTORY/trace.perf" &&
> - git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
> - GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
> + eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
> + eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
> + git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
> }
> 
> test_bloom_filters_used () {
> @@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
> test_bloom_filters_not_used "--walk-reflogs -- A"
> '
> 
> -test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
> - test_bloom_filters_not_used "-- file4 A/file1"
> -'
> -
> test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
> test_bloom_filters_not_used "-- ."
> '
> @@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
> test_bloom_filters_used "-- *renamed"
> '
> 
> -test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
> - test_bloom_filters_not_used "-- *" &&
> - test_bloom_filters_not_used "-- file*"
> +test_expect_success 'git log with multiple literal paths uses Bloom filter' '
> + test_bloom_filters_used "-- file4 A/file1" &&
> + test_bloom_filters_used "-- *" &&
> + test_bloom_filters_used "-- file*"
> +'
> +
> +test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
> + test_bloom_filters_not_used "-- file\*" &&
> + test_bloom_filters_not_used "-- A/\* file4" &&
> + test_bloom_filters_not_used "-- file4 A/\*" &&
> + test_bloom_filters_not_used "-- * A/\*"
> '
> 
> test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
> -- 
> 2.47.2.vfs.0.2
> 

Looks great, I will apply this above patch 3.

Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-10  8:48         ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-10 16:17           ` Junio C Hamano
  2025-07-11 12:46             ` Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-07-10 16:17 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, toon

Lidong Yan <yldhome2d2@gmail.com> writes:

>  static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
> +	struct bloom_keyvec *bloom_keyvec;

This new variable is no longer used, since the code to create a new
keyvec is in a helper function and its return value is directly
stored in the array of keyvecs.

>  	char *path_alloc = NULL;
>  	const char *path, *p;

And the "p" variable no longer is used, because the logic it used to
create a new keyvec is moved elsewhere.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-10 16:17           ` Junio C Hamano
@ 2025-07-11 12:46             ` Lidong Yan
  2025-07-11 15:06               ` Junio C Hamano
  0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-11 12:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, toon

Junio C Hamano <gitster@pobox.com> write:
> 
> Lidong Yan <yldhome2d2@gmail.com> writes:
> 
>> static void prepare_to_use_bloom_filter(struct rev_info *revs)
>> {
>> struct pathspec_item *pi;
>> + struct bloom_keyvec *bloom_keyvec;
> 
> This new variable is no longer used, since the code to create a new
> keyvec is in a helper function and its return value is directly
> stored in the array of keyvecs.
> 
>> char *path_alloc = NULL;
>> const char *path, *p;
> 
> And the "p" variable no longer is used, because the logic it used to
> create a new keyvec is moved elsewhere.

Will fix in v6, Thanks,
Lidong


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-11 12:46             ` Lidong Yan
@ 2025-07-11 15:06               ` Junio C Hamano
  0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-11 15:06 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, toon

Lidong Yan <502024330056@smail.nju.edu.cn> writes:

> Junio C Hamano <gitster@pobox.com> write:
>> 
>> Lidong Yan <yldhome2d2@gmail.com> writes:
>> 
>>> static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>> {
>>> struct pathspec_item *pi;
>>> + struct bloom_keyvec *bloom_keyvec;
>> 
>> This new variable is no longer used, since the code to create a new
>> keyvec is in a helper function and its return value is directly
>> stored in the array of keyvecs.
>> 
>>> char *path_alloc = NULL;
>>> const char *path, *p;
>> 
>> And the "p" variable no longer is used, because the logic it used to
>> create a new keyvec is moved elsewhere.
>
> Will fix in v6, Thanks,

FWIW, what I queued have these two already removed from v5.

Thanks.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
                           ` (4 preceding siblings ...)
  2025-07-10 13:49         ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
@ 2025-07-12  9:35         ` Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
                             ` (5 more replies)
  5 siblings, 6 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:35 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee

The revision traversal limited by pathspec has optimization when
the pathspec has only one element, it does not use any pathspec
magic (other than literal), and there is no wildcard. The absence
of optimization for multiple pathspec elements in revision traversal
cause an issue raised by Kai Koponen at
  https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/

While it is much harder to lift the latter two limitations,
supporting a pathspec with multiple elements is relatively easy.
Just make sure we hash each of them separately and ask the bloom
filter about them, and if we see none of them can possibly be
affected by the commit, we can skip without tree comparison.

The difference from v5 is:
  - extract convert pathspec item to bloom_keyvec logic to
    a separate function, which simplifies the prepare_to_use_bloom_filter()
    function.
  - fix few bugs in v5.

Below is a comparison of the time taken to run git log on Git and
LLVM repositories before and after applying this patch. These statistics
are given by Derrick Stolee at
  https://lore.kernel.org/git/afb68948-218b-4b56-9faa-29578ef9c73c@gmail.com/

Setup commit-graph:
  $ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
  $ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths

Running hyperfine [1] on Git repository:

  $ hyperfine --warmup=3 \
  > -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- commit.c commit-graph.c' \
  > -n 'new' '~/_git/git/git log -100 -- commit.c commit-graph.c'

Benchmark 1: old
  Time (mean ± σ):      73.1 ms ±   2.9 ms    [User: 48.8 ms, System: 23.9 ms]
  Range (min … max):    69.9 ms …  84.5 ms    42 runs

Benchmark 2: new
  Time (mean ± σ):      55.1 ms ±   2.9 ms    [User: 30.5 ms, System: 24.4 ms]
  Range (min … max):    51.1 ms …  61.2 ms    52 runs

Summary
  'new' ran
    1.33 ± 0.09 times faster than 'old'

And for LLVM:

  $ hyperfine --warmup=3 \
  > -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h' \
  > -n 'new' '~/_git/git/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h'

Benchmark 1: old
  Time (mean ± σ):      1.974 s ±  0.006 s    [User: 1.877 s, System: 0.097 s]
  Range (min … max):    1.960 s …  1.983 s    10 runs

Benchmark 2: new
  Time (mean ± σ):     262.9 ms ±   2.4 ms    [User: 214.2 ms, System: 48.4 ms]
  Range (min … max):   257.7 ms … 266.2 ms    11 runs

Summary
  'new' ran
    7.51 ± 0.07 times faster than 'old'

[1] https://github.com/sharkdp/hyperfine

Lidong Yan (5):
  bloom: add test helper to return murmur3 hash
  bloom: rename function operates on bloom_key
  bloom: replace struct bloom_key * with struct bloom_keyvec
  revision: make helper for pathspec to bloom keyvec
  To enable optimize multiple pathspec items in revision traversal,
    return 0 if all pathspec item is literal in forbid_bloom_filters().
    Add for loops to initialize and check each pathspec item's
    bloom_keyvec when optimization is possible.

 blame.c               |   2 +-
 bloom.c               |  84 ++++++++++++++++++++++++++---
 bloom.h               |  54 ++++++++++++++-----
 line-log.c            |   5 +-
 revision.c            | 122 +++++++++++++++++++++---------------------
 revision.h            |   6 +--
 t/helper/test-bloom.c |   8 +--
 t/t4216-log-bloom.sh  |  23 ++++----
 8 files changed, 204 insertions(+), 100 deletions(-)

Range-diff against v5:
1:  4d8f60e5ff = 1:  f5ab19063d bloom: add test helper to return murmur3 hash
2:  acee03e397 = 2:  51a180daa6 bloom: rename function operates on bloom_key
3:  d7690bd02c ! 3:  e17249ab4b bloom: replace struct bloom_key * with struct bloom_keyvec
    @@ bloom.h: void bloom_key_fill(struct bloom_key *key, const char *data, size_t len
      void bloom_key_clear(struct bloom_key *key);
      
     +/*
    -+ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
    ++ * bloom_keyvec_new - Allocate and populate a bloom_keyvec with keys for the
     + * given path.
     + *
     + * This function splits the input path by '/' and generates a bloom key for each
    @@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
      static void prepare_to_use_bloom_filter(struct rev_info *revs)
      {
      	struct pathspec_item *pi;
    -+	struct bloom_keyvec *bloom_keyvec;
      	char *path_alloc = NULL;
    - 	const char *path, *p;
    +-	const char *path, *p;
    ++	const char *path;
      	size_t len;
     -	int path_component_nr = 1;
      
-:  ---------- > 4:  b3c1f5bcd1 revision: make helper for pathspec to bloom keyvec
4:  e577aa1bfd ! 5:  785bd43674 bloom: optimize multiple pathspec items in revision traversal
    @@ Metadata
     Author: Lidong Yan <yldhome2d2@gmail.com>
     
      ## Commit message ##
    -    bloom: optimize multiple pathspec items in revision traversal
    -
         To enable optimize multiple pathspec items in revision traversal,
         return 0 if all pathspec item is literal in forbid_bloom_filters().
         Add for loops to initialize and check each pathspec item's bloom_keyvec
         when optimization is possible.
     
         Add new test cases in t/t4216-log-bloom.sh to ensure
    -      - consistent results between the optimization for multiple pathspec
    -        items using bloom filter and the case without bloom filter
    -        optimization.
    -      - does not use bloom filter if any pathspec item is not literal.
    +     - consistent results between the optimization for multiple pathspec
    +       items using bloom filter and the case without bloom filter
    +       optimization.
    +     - does not use bloom filter if any pathspec item is not literal.
    +
    +    With these optimizations, we get some improvements for multi-pathspec runs
    +    of 'git log'. First, in the Git repository we see these modest results:
    +
    +    Benchmark 1: old
    +     Time (mean ± σ):      73.1 ms ±   2.9 ms
    +     Range (min … max):    69.9 ms …  84.5 ms    42 runs
    +
    +    Benchmark 2: new
    +     Time (mean ± σ):      55.1 ms ±   2.9 ms
    +     Range (min … max):    51.1 ms …  61.2 ms    52 runs
    +
    +    Summary
    +     'new' ran
    +       1.33 ± 0.09 times faster than 'old'
    +
    +    But in a larger repo, such as the LLVM project repo below, we get even
    +    better results:
    +
    +    Benchmark 1: old
    +     Time (mean ± σ):      1.974 s ±  0.006 s
    +     Range (min … max):    1.960 s …  1.983 s    10 runs
    +
    +    Benchmark 2: new
    +     Time (mean ± σ):     262.9 ms ±   2.4 ms
    +     Range (min … max):   257.7 ms … 266.2 ms    11 runs
    +
    +    Summary
    +     'new' ran
    +       7.51 ± 0.07 times faster than 'old'
     
         Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
    +    Signed-off-by: Derrick Stolee <stolee@gmail.com>
     
      ## revision.c ##
     @@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
    @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
      
     -	revs->bloom_keyvecs_nr = 1;
     -	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
    --	pi = &revs->pruning.pathspec.items[0];
     +	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
     +	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
    -+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
    -+		pi = &revs->pruning.pathspec.items[i];
      
    --	/* remove single trailing slash from path, if needed */
    --	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
    --		path_alloc = xmemdupz(pi->match, pi->len - 1);
    --		path = path_alloc;
    --	} else
    --		path = pi->match;
    -+		/* remove single trailing slash from path, if needed */
    -+		if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
    -+			path_alloc = xmemdupz(pi->match, pi->len - 1);
    -+			path = path_alloc;
    -+		} else
    -+			path = pi->match;
    - 
    --	len = strlen(path);
    --	if (!len)
    +-	if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
    +-					     &revs->pruning.pathspec.items[0],
    +-					     revs->bloom_filter_settings))
     -		goto fail;
    -+		len = strlen(path);
    -+		if (!len)
    ++	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
    ++		if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
    ++						     &revs->pruning.pathspec.items[i],
    ++						     revs->bloom_filter_settings))
     +			goto fail;
    - 
    --	revs->bloom_keyvecs[0] =
    --		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
    -+		revs->bloom_keyvecs[i] =
    -+			bloom_keyvec_new(path, len, revs->bloom_filter_settings);
    -+		FREE_AND_NULL(path_alloc);
     +	}
      
      	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 1/5] bloom: add test helper to return murmur3 hash
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
@ 2025-07-12  9:35           ` Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
                             ` (4 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:35 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee

In bloom.h, murmur3_seeded_v2() is exported for the use of test murmur3
hash. To clarify that murmur3_seeded_v2() is exported solely for testing
purposes, a new helper function test_murmur3_seeded() was added instead
of exporting murmur3_seeded_v2() directly.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c               | 13 ++++++++++++-
 bloom.h               | 12 +++---------
 t/helper/test-bloom.c |  4 ++--
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..946c5e8c98 100644
--- a/bloom.c
+++ b/bloom.c
@@ -107,7 +107,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
  * Not considered to be cryptographically secure.
  * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
  */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+static uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
 {
 	const uint32_t c1 = 0xcc9e2d51;
 	const uint32_t c2 = 0x1b873593;
@@ -540,3 +540,14 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 
 	return 1;
 }
+
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+				   int version)
+{
+	assert(version == 1 || version == 2);
+
+	if (version == 2)
+		return murmur3_seeded_v2(seed, data, len);
+	else
+		return murmur3_seeded_v1(seed, data, len);
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..a9ded1822f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,15 +78,6 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
 
-/*
- * Calculate the murmur3 32-bit hash value for the given data
- * using the given seed.
- * Produces a uniformly distributed hash value.
- * Not considered to be cryptographically secure.
- * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
- */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
-
 void fill_bloom_key(const char *data,
 		    size_t len,
 		    struct bloom_key *key,
@@ -137,4 +128,7 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+				   int version);
+
 #endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9aa2c5a592..6a24b6e0a6 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -61,13 +61,13 @@ int cmd__bloom(int argc, const char **argv)
 		uint32_t hashed;
 		if (argc < 3)
 			usage(bloom_usage);
-		hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+		hashed = test_bloom_murmur3_seeded(0, argv[2], strlen(argv[2]), 2);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
 	if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
 		uint32_t hashed;
-		hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+		hashed = test_bloom_murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7, 2);
 		printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
 	}
 
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 2/5] bloom: rename function operates on bloom_key
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
@ 2025-07-12  9:35           ` Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
                             ` (3 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:35 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee

git code style requires that functions operating on a struct S
should be named in the form S_verb. However, the functions operating
on struct bloom_key do not follow this convention. Therefore,
fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill()
and bloom_key_clear(), respectively.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 blame.c               |  2 +-
 bloom.c               | 10 ++++------
 bloom.h               |  6 ++----
 line-log.c            |  5 +++--
 revision.c            |  8 ++++----
 t/helper/test-bloom.c |  4 ++--
 6 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/blame.c b/blame.c
index 57daa45e89..811c6d8f9f 100644
--- a/blame.c
+++ b/blame.c
@@ -1310,7 +1310,7 @@ static void add_bloom_key(struct blame_bloom_data *bd,
 	}
 
 	bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+	bloom_key_fill(bd->keys[bd->nr], path, strlen(path), bd->settings);
 	bd->nr++;
 }
 
diff --git a/bloom.c b/bloom.c
index 946c5e8c98..5523d198c8 100644
--- a/bloom.c
+++ b/bloom.c
@@ -221,9 +221,7 @@ static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
 	return seed;
 }
 
-void fill_bloom_key(const char *data,
-		    size_t len,
-		    struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
 		    const struct bloom_filter_settings *settings)
 {
 	int i;
@@ -243,7 +241,7 @@ void fill_bloom_key(const char *data,
 		key->hashes[i] = hash0 + i * hash1;
 }
 
-void clear_bloom_key(struct bloom_key *key)
+void bloom_key_clear(struct bloom_key *key)
 {
 	FREE_AND_NULL(key->hashes);
 }
@@ -500,9 +498,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 
 		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
 			struct bloom_key key;
-			fill_bloom_key(e->path, strlen(e->path), &key, settings);
+			bloom_key_fill(&key, e->path, strlen(e->path), settings);
 			add_key_to_filter(&key, filter, settings);
-			clear_bloom_key(&key);
+			bloom_key_clear(&key);
 		}
 
 	cleanup:
diff --git a/bloom.h b/bloom.h
index a9ded1822f..603bc1f90f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,11 +78,9 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
 
-void fill_bloom_key(const char *data,
-		    size_t len,
-		    struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
 		    const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
 
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 628e3fe3ae..07f2154e84 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1172,12 +1172,13 @@ static int bloom_filter_check(struct rev_info *rev,
 		return 0;
 
 	while (!result && range) {
-		fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+		bloom_key_fill(&key, range->path, strlen(range->path),
+			       rev->bloom_filter_settings);
 
 		if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
 			result = 1;
 
-		clear_bloom_key(&key);
+		bloom_key_clear(&key);
 		range = range->next;
 	}
 
diff --git a/revision.c b/revision.c
index afee111196..a7eadff0a5 100644
--- a/revision.c
+++ b/revision.c
@@ -739,15 +739,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	revs->bloom_keys_nr = path_component_nr;
 	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
 
-	fill_bloom_key(path, len, &revs->bloom_keys[0],
+	bloom_key_fill(&revs->bloom_keys[0], path, len,
 		       revs->bloom_filter_settings);
 	path_component_nr = 1;
 
 	p = path + len - 1;
 	while (p > path) {
 		if (*p == '/')
-			fill_bloom_key(path, p - path,
-				       &revs->bloom_keys[path_component_nr++],
+			bloom_key_fill(&revs->bloom_keys[path_component_nr++],
+				       path, p - path,
 				       revs->bloom_filter_settings);
 		p--;
 	}
@@ -3231,7 +3231,7 @@ void release_revisions(struct rev_info *revs)
 	oidset_clear(&revs->missing_commits);
 
 	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		clear_bloom_key(&revs->bloom_keys[i]);
+		bloom_key_clear(&revs->bloom_keys[i]);
 	FREE_AND_NULL(revs->bloom_keys);
 	revs->bloom_keys_nr = 0;
 }
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6a24b6e0a6..3283544bd3 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -12,13 +12,13 @@ static struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
 		struct bloom_key key;
 
-		fill_bloom_key(data, strlen(data), &key, &settings);
+		bloom_key_fill(&key, data, strlen(data), &settings);
 		printf("Hashes:");
 		for (size_t i = 0; i < settings.num_hashes; i++)
 			printf("0x%08x|", key.hashes[i]);
 		printf("\n");
 		add_key_to_filter(&key, filter, &settings);
-		clear_bloom_key(&key);
+		bloom_key_clear(&key);
 }
 
 static void print_bloom_filter(struct bloom_filter *filter) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
@ 2025-07-12  9:35           ` Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
                             ` (2 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:35 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee

Previously, we stored bloom keys in a flat array and marked a commit
as NOT TREESAME if any key reported "definitely not changed".

To support multiple pathspec items, we now require that for each
pathspec item, there exists a bloom key reporting "definitely not
changed".

This "for every" condition makes a flat array insufficient, so we
introduce a new structure to group keys by a single pathspec item.
`struct bloom_keyvec` is introduced to replace `struct bloom_key *`
and `bloom_key_nr`. And because we want to support multiple pathspec
items, we added a bloom_keyvec * and a bloom_keyvec_nr field to
`struct rev_info` to represent an array of bloom_keyvecs. This commit
still optimize only one pathspec item, thus bloom_keyvec_nr can only
be 0 or 1.

New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter.

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 bloom.c    | 61 +++++++++++++++++++++++++++++++++++++++++++
 bloom.h    | 38 +++++++++++++++++++++++++++
 revision.c | 76 +++++++++++++++++++++---------------------------------
 revision.h |  6 ++---
 4 files changed, 132 insertions(+), 49 deletions(-)

diff --git a/bloom.c b/bloom.c
index 5523d198c8..b86015f6d1 100644
--- a/bloom.c
+++ b/bloom.c
@@ -278,6 +278,55 @@ void deinit_bloom_filters(void)
 	deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
 }
 
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+				      const struct bloom_filter_settings *settings)
+{
+	struct bloom_keyvec *vec;
+	const char *p;
+	size_t sz;
+	size_t nr = 1;
+
+	p = path;
+	while (*p) {
+		/*
+		 * At this point, the path is normalized to use Unix-style
+		 * path separators. This is required due to how the
+		 * changed-path Bloom filters store the paths.
+		 */
+		if (*p == '/')
+			nr++;
+		p++;
+	}
+
+	sz = sizeof(struct bloom_keyvec);
+	sz += nr * sizeof(struct bloom_key);
+	vec = (struct bloom_keyvec *)xcalloc(1, sz);
+	if (!vec)
+		return NULL;
+	vec->count = nr;
+
+	bloom_key_fill(&vec->key[0], path, len, settings);
+	nr = 1;
+	p = path + len - 1;
+	while (p > path) {
+		if (*p == '/') {
+			bloom_key_fill(&vec->key[nr++], path, p - path, settings);
+		}
+		p--;
+	}
+	assert(nr == vec->count);
+	return vec;
+}
+
+void bloom_keyvec_free(struct bloom_keyvec *vec)
+{
+	if (!vec)
+		return;
+	for (size_t nr = 0; nr < vec->count; nr++)
+		bloom_key_clear(&vec->key[nr]);
+	free(vec);
+}
+
 static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
 		       const struct hashmap_entry *eptr,
 		       const struct hashmap_entry *entry_or_key,
@@ -539,6 +588,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 	return 1;
 }
 
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *vec,
+			      const struct bloom_filter_settings *settings)
+{
+	int ret = 1;
+
+	for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+		ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+	return ret;
+}
+
 uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
 				   int version)
 {
diff --git a/bloom.h b/bloom.h
index 603bc1f90f..92ab2100d3 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,16 @@ struct bloom_key {
 	uint32_t *hashes;
 };
 
+/*
+ * A bloom_keyvec is a vector of bloom_keys, which
+ * can be used to store multiple keys for a single
+ * pathspec item.
+ */
+struct bloom_keyvec {
+	size_t count;
+	struct bloom_key key[FLEX_ARRAY];
+};
+
 int load_bloom_filter_from_graph(struct commit_graph *g,
 				 struct bloom_filter *filter,
 				 uint32_t graph_pos);
@@ -82,6 +92,23 @@ void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
 		    const struct bloom_filter_settings *settings);
 void bloom_key_clear(struct bloom_key *key);
 
+/*
+ * bloom_keyvec_new - Allocate and populate a bloom_keyvec with keys for the
+ * given path.
+ *
+ * This function splits the input path by '/' and generates a bloom key for each
+ * prefix, in reverse order of specificity. For example, given the input
+ * "a/b/c", it will generate bloom keys for:
+ *   - "a/b/c"
+ *   - "a/b"
+ *   - "a"
+ *
+ * The resulting keys are stored in a newly allocated bloom_keyvec.
+ */
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+				      const struct bloom_filter_settings *settings);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
+
 void add_key_to_filter(const struct bloom_key *key,
 		       struct bloom_filter *filter,
 		       const struct bloom_filter_settings *settings);
@@ -126,6 +153,17 @@ int bloom_filter_contains(const struct bloom_filter *filter,
 			  const struct bloom_key *key,
 			  const struct bloom_filter_settings *settings);
 
+/*
+ * bloom_filter_contains_vec - Check if all keys in a key vector are in the
+ * Bloom filter.
+ *
+ * Returns 1 if **all** keys in the vector are present in the filter,
+ * 0 if **any** key is not present.
+ */
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+			      const struct bloom_keyvec *v,
+			      const struct bloom_filter_settings *settings);
+
 uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
 				   int version);
 
diff --git a/revision.c b/revision.c
index a7eadff0a5..e4e0c83b0c 100644
--- a/revision.c
+++ b/revision.c
@@ -685,13 +685,14 @@ static int forbid_bloom_filters(struct pathspec *spec)
 	return 0;
 }
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
 static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
 	char *path_alloc = NULL;
-	const char *path, *p;
+	const char *path;
 	size_t len;
-	int path_component_nr = 1;
 
 	if (!revs->commits)
 		return;
@@ -708,6 +709,8 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
+	revs->bloom_keyvecs_nr = 1;
+	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
 	pi = &revs->pruning.pathspec.items[0];
 
 	/* remove single trailing slash from path, if needed */
@@ -718,53 +721,30 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		path = pi->match;
 
 	len = strlen(path);
-	if (!len) {
-		revs->bloom_filter_settings = NULL;
-		free(path_alloc);
-		return;
-	}
-
-	p = path;
-	while (*p) {
-		/*
-		 * At this point, the path is normalized to use Unix-style
-		 * path separators. This is required due to how the
-		 * changed-path Bloom filters store the paths.
-		 */
-		if (*p == '/')
-			path_component_nr++;
-		p++;
-	}
-
-	revs->bloom_keys_nr = path_component_nr;
-	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+	if (!len)
+		goto fail;
 
-	bloom_key_fill(&revs->bloom_keys[0], path, len,
-		       revs->bloom_filter_settings);
-	path_component_nr = 1;
-
-	p = path + len - 1;
-	while (p > path) {
-		if (*p == '/')
-			bloom_key_fill(&revs->bloom_keys[path_component_nr++],
-				       path, p - path,
-				       revs->bloom_filter_settings);
-		p--;
-	}
+	revs->bloom_keyvecs[0] =
+		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
 		bloom_filter_atexit_registered = 1;
 	}
 
+	return;
+
+fail:
+	revs->bloom_filter_settings = NULL;
 	free(path_alloc);
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result = 1, j;
+	int result = 0;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -779,10 +759,10 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
-		result = bloom_filter_contains(filter,
-					       &revs->bloom_keys[j],
-					       revs->bloom_filter_settings);
+	for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+		result = bloom_filter_contains_vec(filter,
+						   revs->bloom_keyvecs[nr],
+						   revs->bloom_filter_settings);
 	}
 
 	if (result)
@@ -823,7 +803,7 @@ static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_keys_nr && !nth_parent) {
+	if (revs->bloom_keyvecs_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
@@ -850,7 +830,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
 	if (!t1)
 		return 0;
 
-	if (!nth_parent && revs->bloom_keys_nr) {
+	if (!nth_parent && revs->bloom_keyvecs_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -3201,6 +3181,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
 
 static void release_revisions_topo_walk_info(struct topo_walk_info *info);
 
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+	for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+		bloom_keyvec_free(revs->bloom_keyvecs[nr]);
+	FREE_AND_NULL(revs->bloom_keyvecs);
+	revs->bloom_keyvecs_nr = 0;
+}
+
 static void free_void_commit_list(void *list)
 {
 	free_commit_list(list);
@@ -3229,11 +3217,7 @@ void release_revisions(struct rev_info *revs)
 	clear_decoration(&revs->treesame, free);
 	line_log_free(revs);
 	oidset_clear(&revs->missing_commits);
-
-	for (int i = 0; i < revs->bloom_keys_nr; i++)
-		bloom_key_clear(&revs->bloom_keys[i]);
-	FREE_AND_NULL(revs->bloom_keys);
-	revs->bloom_keys_nr = 0;
+	release_revisions_bloom_keyvecs(revs);
 }
 
 static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
 struct rev_info;
 struct string_list;
 struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
 struct bloom_filter_settings;
 struct option;
 struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
 
 	/* Commit graph bloom filter fields */
 	/* The bloom filter key(s) for the pathspec */
-	struct bloom_key *bloom_keys;
-	int bloom_keys_nr;
+	struct bloom_keyvec **bloom_keyvecs;
+	int bloom_keyvecs_nr;
 
 	/*
 	 * The bloom filter settings used to generate the key.
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
                             ` (2 preceding siblings ...)
  2025-07-12  9:35           ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-12  9:35           ` Lidong Yan
  2025-07-12  9:35           ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
  2025-07-14 16:53           ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
  5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:35 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee

When preparing to use bloom filters in a revision walk, Git populates a
boom_keyvec with an array of bloom keys for the components of a path.
Before we create the ability to map multiple pathspecs to multiple
bloom_keyvecs, extract the conversion from a pathspec to a bloom_keyvec
into its own helper method. This simplifies the state that persists in
prepare_to_use_bloom_filter() as well as makes the future change much
simpler.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/revision.c b/revision.c
index e4e0c83b0c..1614c6ce0d 100644
--- a/revision.c
+++ b/revision.c
@@ -687,13 +687,37 @@ static int forbid_bloom_filters(struct pathspec *spec)
 
 static void release_revisions_bloom_keyvecs(struct rev_info *revs);
 
-static void prepare_to_use_bloom_filter(struct rev_info *revs)
+static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
+					    const struct pathspec_item *pi,
+					    const struct bloom_filter_settings *settings)
 {
-	struct pathspec_item *pi;
 	char *path_alloc = NULL;
 	const char *path;
 	size_t len;
+	int res = 0;
+
+	/* remove single trailing slash from path, if needed */
+	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+		path_alloc = xmemdupz(pi->match, pi->len - 1);
+		path = path_alloc;
+	} else
+		path = pi->match;
+
+	len = strlen(path);
+	if (!len) {
+		res = -1;
+		goto cleanup;
+	}
 
+	*out = bloom_keyvec_new(path, len, settings);
+
+cleanup:
+	free(path_alloc);
+	return res;
+}
+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
 	if (!revs->commits)
 		return;
 
@@ -711,22 +735,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	revs->bloom_keyvecs_nr = 1;
 	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-	pi = &revs->pruning.pathspec.items[0];
 
-	/* remove single trailing slash from path, if needed */
-	if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-		path_alloc = xmemdupz(pi->match, pi->len - 1);
-		path = path_alloc;
-	} else
-		path = pi->match;
-
-	len = strlen(path);
-	if (!len)
+	if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
+					     &revs->pruning.pathspec.items[0],
+					     revs->bloom_filter_settings))
 		goto fail;
 
-	revs->bloom_keyvecs[0] =
-		bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
 		bloom_filter_atexit_registered = 1;
@@ -736,7 +750,6 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 fail:
 	revs->bloom_filter_settings = NULL;
-	free(path_alloc);
 	release_revisions_bloom_keyvecs(revs);
 }
 
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible.
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
                             ` (3 preceding siblings ...)
  2025-07-12  9:35           ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
@ 2025-07-12  9:35           ` Lidong Yan
  2025-07-12  9:47             ` Lidong Yan
  2025-07-14 16:53           ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
  5 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:35 UTC (permalink / raw)
  To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee

Add new test cases in t/t4216-log-bloom.sh to ensure
 - consistent results between the optimization for multiple pathspec
   items using bloom filter and the case without bloom filter
   optimization.
 - does not use bloom filter if any pathspec item is not literal.

With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:

Benchmark 1: old
 Time (mean ± σ):      73.1 ms ±   2.9 ms
 Range (min … max):    69.9 ms …  84.5 ms    42 runs

Benchmark 2: new
 Time (mean ± σ):      55.1 ms ±   2.9 ms
 Range (min … max):    51.1 ms …  61.2 ms    52 runs

Summary
 'new' ran
   1.33 ± 0.09 times faster than 'old'

But in a larger repo, such as the LLVM project repo below, we get even
better results:

Benchmark 1: old
 Time (mean ± σ):      1.974 s ±  0.006 s
 Range (min … max):    1.960 s …  1.983 s    10 runs

Benchmark 2: new
 Time (mean ± σ):     262.9 ms ±   2.4 ms
 Range (min … max):   257.7 ms … 266.2 ms    11 runs

Summary
 'new' ran
   7.51 ± 0.07 times faster than 'old'

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 revision.c           | 21 +++++++++++----------
 t/t4216-log-bloom.sh | 23 ++++++++++++++---------
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/revision.c b/revision.c
index 1614c6ce0d..cf7198c0ea 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
@@ -733,13 +732,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
 
-	if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
-					     &revs->pruning.pathspec.items[0],
-					     revs->bloom_filter_settings))
-		goto fail;
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
+						     &revs->pruning.pathspec.items[i],
+						     revs->bloom_filter_settings))
+			goto fail;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible.
  2025-07-12  9:35           ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
@ 2025-07-12  9:47             ` Lidong Yan
  2025-07-12  9:51               ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:47 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, gitster, stolee

Lidong Yan <yldhome2d2@gmail.com> writes:
> 
> Add new test cases in t/t4216-log-bloom.sh to ensure
> - consistent results between the optimization for multiple pathspec
>   items using bloom filter and the case without bloom filter
>   optimization.
> - does not use bloom filter if any pathspec item is not literal.
> 
> With these optimizations, we get some improvements for multi-pathspec runs
> of 'git log'. First, in the Git repository we see these modest results:

Sorry, seems like I wrote bad commit message, I will resend patch 5/5 soon.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
  2025-07-12  9:47             ` Lidong Yan
@ 2025-07-12  9:51               ` Lidong Yan
  2025-07-14 16:51                 ` Derrick Stolee
  0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-12  9:51 UTC (permalink / raw)
  To: yldhome2d2; +Cc: git, gitster, stolee, Lidong Yan

To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.

Add new test cases in t/t4216-log-bloom.sh to ensure
 - consistent results between the optimization for multiple pathspec
   items using bloom filter and the case without bloom filter
   optimization.
 - does not use bloom filter if any pathspec item is not literal.

With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:

Benchmark 1: old
 Time (mean ± σ):      73.1 ms ±   2.9 ms
 Range (min … max):    69.9 ms …  84.5 ms    42 runs

Benchmark 2: new
 Time (mean ± σ):      55.1 ms ±   2.9 ms
 Range (min … max):    51.1 ms …  61.2 ms    52 runs

Summary
 'new' ran
   1.33 ± 0.09 times faster than 'old'

But in a larger repo, such as the LLVM project repo below, we get even
better results:

Benchmark 1: old
 Time (mean ± σ):      1.974 s ±  0.006 s
 Range (min … max):    1.960 s …  1.983 s    10 runs

Benchmark 2: new
 Time (mean ± σ):     262.9 ms ±   2.4 ms
 Range (min … max):   257.7 ms … 266.2 ms    11 runs

Summary
 'new' ran
   7.51 ± 0.07 times faster than 'old'

Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 revision.c           | 21 +++++++++++----------
 t/t4216-log-bloom.sh | 23 ++++++++++++++---------
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/revision.c b/revision.c
index 1614c6ce0d..cf7198c0ea 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
@@ -733,13 +732,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
 
-	if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
-					     &revs->pruning.pathspec.items[0],
-					     revs->bloom_filter_settings))
-		goto fail;
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
+						     &revs->pruning.pathspec.items[i],
+						     revs->bloom_filter_settings))
+			goto fail;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
  2025-07-12  9:51               ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
@ 2025-07-14 16:51                 ` Derrick Stolee
  2025-07-14 17:01                   ` Junio C Hamano
  0 siblings, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-14 16:51 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, gitster, Lidong Yan

On 7/12/2025 5:51 AM, Lidong Yan wrote:
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add for loops to initialize and check each pathspec item's bloom_keyvec
> when optimization is possible.

The patch itself is good.

> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>

Here, I'll just point out that your sign-off should follow mine
because you were the last to touch the patch. In this way, the
sign-off gives a kind of timestamp to who made the most-recent
changes (and that those changes have that person's sign-off,
and may not have been vetted by previous signers).

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
                             ` (4 preceding siblings ...)
  2025-07-12  9:35           ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
@ 2025-07-14 16:53           ` Derrick Stolee
  2025-07-14 17:02             ` Junio C Hamano
  2025-07-15  1:34             ` Lidong Yan
  5 siblings, 2 replies; 72+ messages in thread
From: Derrick Stolee @ 2025-07-14 16:53 UTC (permalink / raw)
  To: Lidong Yan; +Cc: 502024330056, git, gitster, toon

On 7/12/2025 5:35 AM, Lidong Yan wrote:

> The difference from v5 is:
>   - extract convert pathspec item to bloom_keyvec logic to
>     a separate function, which simplifies the prepare_to_use_bloom_filter()
>     function.
>   - fix few bugs in v5.

Thanks for making these changes. Including your fixed patch 5, this
version looks ready to me.

I wouldn't say "fix a few bugs" but instead "fix some compile-time
linting complaints when using DEVELOPER=1" to be clear that the
functionality hasn't changed but the code is cleaner.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
  2025-07-14 16:51                 ` Derrick Stolee
@ 2025-07-14 17:01                   ` Junio C Hamano
  2025-07-15  1:37                     ` Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-07-14 17:01 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Lidong Yan, git, Lidong Yan

Derrick Stolee <stolee@gmail.com> writes:

> On 7/12/2025 5:51 AM, Lidong Yan wrote:
>> To enable optimize multiple pathspec items in revision traversal,
>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>> Add for loops to initialize and check each pathspec item's bloom_keyvec
>> when optimization is possible.
>
> The patch itself is good.
>
>> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>
> Here, I'll just point out that your sign-off should follow mine
> because you were the last to touch the patch. In this way, the
> sign-off gives a kind of timestamp to who made the most-recent
> changes (and that those changes have that person's sign-off,
> and may not have been vetted by previous signers).

Thanks for pointing it out.  Also perhaps a single-liner attribution
to clarify who did what, e.g.

	Signed-off-by: Derrick
	[ly: did this and that to derrick's code to adjust]
	Signed-off-by: Lidong

would be more helpful.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-14 16:53           ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
@ 2025-07-14 17:02             ` Junio C Hamano
  2025-07-15  1:34             ` Lidong Yan
  1 sibling, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-14 17:02 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Lidong Yan, 502024330056, git, toon

Derrick Stolee <stolee@gmail.com> writes:

> On 7/12/2025 5:35 AM, Lidong Yan wrote:
>
>> The difference from v5 is:
>>   - extract convert pathspec item to bloom_keyvec logic to
>>     a separate function, which simplifies the prepare_to_use_bloom_filter()
>>     function.
>>   - fix few bugs in v5.
>
> Thanks for making these changes. Including your fixed patch 5, this
> version looks ready to me.
>
> I wouldn't say "fix a few bugs" but instead "fix some compile-time
> linting complaints when using DEVELOPER=1" to be clear that the
> functionality hasn't changed but the code is cleaner.

Thanks both, for polishing and reviewing.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-14 16:53           ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
  2025-07-14 17:02             ` Junio C Hamano
@ 2025-07-15  1:34             ` Lidong Yan
  2025-07-15  2:48               ` Derrick Stolee
  1 sibling, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-15  1:34 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, gitster, toon

Derrick Stolee <stolee@gmail.com> wrote:
> 
> On 7/12/2025 5:35 AM, Lidong Yan wrote:
> 
>> The difference from v5 is:
>>  - extract convert pathspec item to bloom_keyvec logic to
>>    a separate function, which simplifies the prepare_to_use_bloom_filter()
>>    function.
>>  - fix few bugs in v5.
> 
> Thanks for making these changes. Including your fixed patch 5, this
> version looks ready to me.
> 
> I wouldn't say "fix a few bugs" but instead "fix some compile-time
> linting complaints when using DEVELOPER=1" to be clear that the
> functionality hasn't changed but the code is cleaner.

I just learned that `make DEVELOPER=1` treats warnings as errors.
Since this is just a cover letter issue, I feel it might not be worth rerolling
the patch again.

Thanks,
Lidong

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
  2025-07-14 17:01                   ` Junio C Hamano
@ 2025-07-15  1:37                     ` Lidong Yan
  2025-07-15  2:56                       ` [RESEND][PATCH " Lidong Yan
  0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-15  1:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git

Junio C Hamano <gitster@pobox.com> writes:
> 
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/12/2025 5:51 AM, Lidong Yan wrote:
>>> To enable optimize multiple pathspec items in revision traversal,
>>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>>> Add for loops to initialize and check each pathspec item's bloom_keyvec
>>> when optimization is possible.
>> 
>> The patch itself is good.
>> 
>>> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
>>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>> 
>> Here, I'll just point out that your sign-off should follow mine
>> because you were the last to touch the patch. In this way, the
>> sign-off gives a kind of timestamp to who made the most-recent
>> changes (and that those changes have that person's sign-off,
>> and may not have been vetted by previous signers).
> 
> Thanks for pointing it out.  Also perhaps a single-liner attribution
> to clarify who did what, e.g.
> 
> Signed-off-by: Derrick
> [ly: did this and that to derrick's code to adjust]
> Signed-off-by: Lidong

I will fix the order of the sign-offs, add the attributes, and resend
PATCH v6 5/5.

Thanks,
Lidong

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-15  1:34             ` Lidong Yan
@ 2025-07-15  2:48               ` Derrick Stolee
  2025-07-15 15:09                 ` Junio C Hamano
  0 siblings, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-15  2:48 UTC (permalink / raw)
  To: Lidong Yan; +Cc: git, gitster, toon

On 7/14/2025 9:34 PM, Lidong Yan wrote:
> Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 7/12/2025 5:35 AM, Lidong Yan wrote:
>>
>>> The difference from v5 is:
>>>  - extract convert pathspec item to bloom_keyvec logic to
>>>    a separate function, which simplifies the prepare_to_use_bloom_filter()
>>>    function.
>>>  - fix few bugs in v5.
>>
>> Thanks for making these changes. Including your fixed patch 5, this
>> version looks ready to me.
>>
>> I wouldn't say "fix a few bugs" but instead "fix some compile-time
>> linting complaints when using DEVELOPER=1" to be clear that the
>> functionality hasn't changed but the code is cleaner.
> 
> I just learned that `make DEVELOPER=1` treats warnings as errors.
> Since this is just a cover letter issue, I feel it might not be worth rerolling
> the patch again.

No need to reroll anything, I think. Junio's got the right
fixups in place.

This was just a comment to help you next time.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [RESEND][PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
  2025-07-15  1:37                     ` Lidong Yan
@ 2025-07-15  2:56                       ` Lidong Yan
  0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-15  2:56 UTC (permalink / raw)
  To: 502024330056; +Cc: git, gitster, stolee, Lidong Yan

To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.

Add new test cases in t/t4216-log-bloom.sh to ensure
 - consistent results between the optimization for multiple pathspec
   items using bloom filter and the case without bloom filter
   optimization.
 - does not use bloom filter if any pathspec item is not literal.

With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:

Benchmark 1: old
 Time (mean ± σ):      73.1 ms ±   2.9 ms
 Range (min … max):    69.9 ms …  84.5 ms    42 runs

Benchmark 2: new
 Time (mean ± σ):      55.1 ms ±   2.9 ms
 Range (min … max):    51.1 ms …  61.2 ms    52 runs

Summary
 'new' ran
   1.33 ± 0.09 times faster than 'old'

But in a larger repo, such as the LLVM project repo below, we get even
better results:

Benchmark 1: old
 Time (mean ± σ):      1.974 s ±  0.006 s
 Range (min … max):    1.960 s …  1.983 s    10 runs

Benchmark 2: new
 Time (mean ± σ):     262.9 ms ±   2.4 ms
 Range (min … max):   257.7 ms … 266.2 ms    11 runs

Summary
 'new' ran
   7.51 ± 0.07 times faster than 'old'

Signed-off-by: Derrick Stolee <stolee@gmail.com>
[ly: rename convert_pathspec_to_filter() to convert_pathspec_to_bloom_keyvec()]
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
 revision.c           | 21 +++++++++++----------
 t/t4216-log-bloom.sh | 23 ++++++++++++++---------
 2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/revision.c b/revision.c
index 1614c6ce0d..cf7198c0ea 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
 {
 	if (spec->has_wildcard)
 		return 1;
-	if (spec->nr > 1)
-		return 1;
 	if (spec->magic & ~PATHSPEC_LITERAL)
 		return 1;
-	if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
-		return 1;
+	for (size_t nr = 0; nr < spec->nr; nr++)
+		if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+			return 1;
 
 	return 0;
 }
@@ -733,13 +732,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 	if (!revs->pruning.pathspec.nr)
 		return;
 
-	revs->bloom_keyvecs_nr = 1;
-	CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+	revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+	CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
 
-	if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
-					     &revs->pruning.pathspec.items[0],
-					     revs->bloom_filter_settings))
-		goto fail;
+	for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+		if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
+						     &revs->pruning.pathspec.items[i],
+						     revs->bloom_filter_settings))
+			goto fail;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
 
 setup () {
 	rm -f "$TRASH_DIRECTORY/trace.perf" &&
-	git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
-	GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+	eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+	eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+		git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
 }
 
 test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
 	test_bloom_filters_not_used "--walk-reflogs -- A"
 '
 
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
-	test_bloom_filters_not_used "-- file4 A/file1"
-'
-
 test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
 	test_bloom_filters_not_used "-- ."
 '
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
 	test_bloom_filters_used "-- *renamed"
 '
 
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
-	test_bloom_filters_not_used "-- *" &&
-	test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+	test_bloom_filters_used "-- file4 A/file1" &&
+	test_bloom_filters_used "-- *" &&
+	test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+	test_bloom_filters_not_used "-- file\*" &&
+	test_bloom_filters_not_used "-- A/\* file4" &&
+	test_bloom_filters_not_used "-- file4 A/\*" &&
+	test_bloom_filters_not_used "-- * A/\*"
 '
 
 test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
  2025-07-15  2:48               ` Derrick Stolee
@ 2025-07-15 15:09                 ` Junio C Hamano
  0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-15 15:09 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Lidong Yan, git, toon

Derrick Stolee <stolee@gmail.com> writes:

> No need to reroll anything, I think. Junio's got the right
> fixups in place.
>
> This was just a comment to help you next time.

Thanks.

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2025-07-15 15:09 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-06-25 17:43   ` Junio C Hamano
2025-06-26  3:44     ` Lidong Yan
2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
2025-06-27 13:50   ` Junio C Hamano
2025-06-27 14:24     ` Lidong Yan
2025-06-27 18:09       ` Junio C Hamano
2025-07-01  5:52     ` Lidong Yan
2025-07-01 15:19       ` Junio C Hamano
2025-07-02  7:14         ` Lidong Yan
2025-07-02 15:48           ` Junio C Hamano
2025-07-03  1:52             ` Lidong Yan
2025-07-04 12:09             ` Lidong Yan
2025-07-01  8:50     ` SZEDER Gábor
2025-07-01 11:40       ` Lidong Yan
2025-07-01 15:43       ` Junio C Hamano
2025-06-27 20:39   ` Junio C Hamano
2025-06-28  2:54     ` Lidong Yan
2025-06-25 17:32 ` [PATCH 0/2] bloom: use bloom filter given multiple pathspec Junio C Hamano
2025-06-26  3:34   ` Lidong Yan
2025-06-26 14:15     ` Junio C Hamano
2025-06-27  6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
2025-06-28  4:21   ` [PATCH v3 " Lidong Yan
2025-07-04 11:14     ` [PATCH v4 0/4] " Lidong Yan
2025-07-04 11:14       ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-04 11:14       ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
2025-07-04 11:14       ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-07 11:35         ` Derrick Stolee
2025-07-07 14:14           ` Lidong Yan
2025-07-04 11:14       ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-07 11:43         ` Derrick Stolee
2025-07-07 14:18           ` Lidong Yan
2025-07-07 15:14           ` Junio C Hamano
2025-07-10  8:48       ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
2025-07-10  8:48         ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-10  8:48         ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
2025-07-10  8:48         ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-10 16:17           ` Junio C Hamano
2025-07-11 12:46             ` Lidong Yan
2025-07-11 15:06               ` Junio C Hamano
2025-07-10  8:48         ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-10 13:51           ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
2025-07-10 15:42             ` Lidong Yan
2025-07-10 13:55           ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
2025-07-10 15:49             ` Lidong Yan
2025-07-10 13:49         ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-12  9:35         ` [PATCH v6 0/5] " Lidong Yan
2025-07-12  9:35           ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-12  9:35           ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
2025-07-12  9:35           ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-12  9:35           ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
2025-07-12  9:35           ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
2025-07-12  9:47             ` Lidong Yan
2025-07-12  9:51               ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
2025-07-14 16:51                 ` Derrick Stolee
2025-07-14 17:01                   ` Junio C Hamano
2025-07-15  1:37                     ` Lidong Yan
2025-07-15  2:56                       ` [RESEND][PATCH " Lidong Yan
2025-07-14 16:53           ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-14 17:02             ` Junio C Hamano
2025-07-15  1:34             ` Lidong Yan
2025-07-15  2:48               ` Derrick Stolee
2025-07-15 15:09                 ` Junio C Hamano
2025-06-28  4:21   ` [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-02 15:08     ` Patrick Steinhardt
2025-07-02 15:49       ` Lidong Yan
2025-07-02 18:28       ` Junio C Hamano
2025-07-03  1:41         ` Lidong Yan
2025-06-28  4:21   ` [PATCH v3 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-06-27  6:21 ` [PATCH v2 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-06-27  6:21 ` [PATCH v2 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).