* [PATCH v4 1/4] bloom: add test helper to return murmur3 hash
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
@ 2025-07-04 11:14 ` Lidong Yan
2025-07-04 11:14 ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
` (3 subsequent siblings)
4 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster
In bloom.h, murmur3_seeded_v2() is exported for the use of test murmur3
hash. To clarify that murmur3_seeded_v2() is exported solely for testing
purposes, a new helper function test_murmur3_seeded() was added instead
of exporting murmur3_seeded_v2() directly.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
bloom.c | 13 ++++++++++++-
bloom.h | 12 +++---------
t/helper/test-bloom.c | 4 ++--
3 files changed, 17 insertions(+), 12 deletions(-)
diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..946c5e8c98 100644
--- a/bloom.c
+++ b/bloom.c
@@ -107,7 +107,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
* Not considered to be cryptographically secure.
* Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
*/
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+static uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
{
const uint32_t c1 = 0xcc9e2d51;
const uint32_t c2 = 0x1b873593;
@@ -540,3 +540,14 @@ int bloom_filter_contains(const struct bloom_filter *filter,
return 1;
}
+
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+ int version)
+{
+ assert(version == 1 || version == 2);
+
+ if (version == 2)
+ return murmur3_seeded_v2(seed, data, len);
+ else
+ return murmur3_seeded_v1(seed, data, len);
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..a9ded1822f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,15 +78,6 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-/*
- * Calculate the murmur3 32-bit hash value for the given data
- * using the given seed.
- * Produces a uniformly distributed hash value.
- * Not considered to be cryptographically secure.
- * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
- */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
-
void fill_bloom_key(const char *data,
size_t len,
struct bloom_key *key,
@@ -137,4 +128,7 @@ int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+ int version);
+
#endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9aa2c5a592..6a24b6e0a6 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -61,13 +61,13 @@ int cmd__bloom(int argc, const char **argv)
uint32_t hashed;
if (argc < 3)
usage(bloom_usage);
- hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+ hashed = test_bloom_murmur3_seeded(0, argv[2], strlen(argv[2]), 2);
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
uint32_t hashed;
- hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+ hashed = test_bloom_murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7, 2);
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v4 2/4] bloom: rename function operates on bloom_key
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
2025-07-04 11:14 ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
@ 2025-07-04 11:14 ` Lidong Yan
2025-07-04 11:14 ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
` (2 subsequent siblings)
4 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster
git code style requires that functions operating on a struct S
should be named in the form S_verb. However, the functions operating
on struct bloom_key do not follow this convention. Therefore,
fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill()
and bloom_key_clear(), respectively.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
blame.c | 2 +-
bloom.c | 8 ++++----
bloom.h | 4 ++--
line-log.c | 4 ++--
revision.c | 6 +++---
t/helper/test-bloom.c | 4 ++--
6 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/blame.c b/blame.c
index 57daa45e89..459043a511 100644
--- a/blame.c
+++ b/blame.c
@@ -1310,7 +1310,7 @@ static void add_bloom_key(struct blame_bloom_data *bd,
}
bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
- fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+ bloom_key_fill(path, strlen(path), bd->keys[bd->nr], bd->settings);
bd->nr++;
}
diff --git a/bloom.c b/bloom.c
index 946c5e8c98..35ff36c31c 100644
--- a/bloom.c
+++ b/bloom.c
@@ -221,7 +221,7 @@ static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
return seed;
}
-void fill_bloom_key(const char *data,
+void bloom_key_fill(const char *data,
size_t len,
struct bloom_key *key,
const struct bloom_filter_settings *settings)
@@ -243,7 +243,7 @@ void fill_bloom_key(const char *data,
key->hashes[i] = hash0 + i * hash1;
}
-void clear_bloom_key(struct bloom_key *key)
+void bloom_key_clear(struct bloom_key *key)
{
FREE_AND_NULL(key->hashes);
}
@@ -500,9 +500,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
hashmap_for_each_entry(&pathmap, &iter, e, entry) {
struct bloom_key key;
- fill_bloom_key(e->path, strlen(e->path), &key, settings);
+ bloom_key_fill(e->path, strlen(e->path), &key, settings);
add_key_to_filter(&key, filter, settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
}
cleanup:
diff --git a/bloom.h b/bloom.h
index a9ded1822f..edf14fef3e 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,11 +78,11 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-void fill_bloom_key(const char *data,
+void bloom_key_fill(const char *data,
size_t len,
struct bloom_key *key,
const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 628e3fe3ae..a2aaf869a3 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1172,12 +1172,12 @@ static int bloom_filter_check(struct rev_info *rev,
return 0;
while (!result && range) {
- fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+ bloom_key_fill(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
result = 1;
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
range = range->next;
}
diff --git a/revision.c b/revision.c
index afee111196..49fc650ac7 100644
--- a/revision.c
+++ b/revision.c
@@ -739,14 +739,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_keys_nr = path_component_nr;
ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
- fill_bloom_key(path, len, &revs->bloom_keys[0],
+ bloom_key_fill(path, len, &revs->bloom_keys[0],
revs->bloom_filter_settings);
path_component_nr = 1;
p = path + len - 1;
while (p > path) {
if (*p == '/')
- fill_bloom_key(path, p - path,
+ bloom_key_fill(path, p - path,
&revs->bloom_keys[path_component_nr++],
revs->bloom_filter_settings);
p--;
@@ -3231,7 +3231,7 @@ void release_revisions(struct rev_info *revs)
oidset_clear(&revs->missing_commits);
for (int i = 0; i < revs->bloom_keys_nr; i++)
- clear_bloom_key(&revs->bloom_keys[i]);
+ bloom_key_clear(&revs->bloom_keys[i]);
FREE_AND_NULL(revs->bloom_keys);
revs->bloom_keys_nr = 0;
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6a24b6e0a6..585a107802 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -12,13 +12,13 @@ static struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
struct bloom_key key;
- fill_bloom_key(data, strlen(data), &key, &settings);
+ bloom_key_fill(data, strlen(data), &key, &settings);
printf("Hashes:");
for (size_t i = 0; i < settings.num_hashes; i++)
printf("0x%08x|", key.hashes[i]);
printf("\n");
add_key_to_filter(&key, filter, &settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
}
static void print_bloom_filter(struct bloom_filter *filter) {
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
2025-07-04 11:14 ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-04 11:14 ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
@ 2025-07-04 11:14 ` Lidong Yan
2025-07-07 11:35 ` Derrick Stolee
2025-07-04 11:14 ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
4 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster
The revision traversal limited by pathspec has optimization when
the pathspec has only one element. To support optimization for
multiple pathspec items, we need to modify the data structures
in struct rev_info.
struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
corresponding to a single pathspec item. To allow struct rev_info
to store bloom keys for multiple pathspec items, a new data structure
`struct bloom_keyvec` is introduced. Each `struct bloom_keyvec`
corresponds to a single pathspec item.
In `struct rev_info`, replace bloom_keys and bloom_nr with bloom_keyvecs
and bloom_keyvec_nr. This commit still optimize one pathspec item, thus
bloom_keyvec_nr can only be 0 or 1.
New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter. bloom_keyvec_fill_key() is added to
initialize a key in keyvec.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
bloom.c | 31 +++++++++++++++++++++++++++++++
bloom.h | 25 +++++++++++++++++++++++++
revision.c | 39 +++++++++++++++++++++------------------
revision.h | 6 +++---
4 files changed, 80 insertions(+), 21 deletions(-)
diff --git a/bloom.c b/bloom.c
index 35ff36c31c..877bda0ef3 100644
--- a/bloom.c
+++ b/bloom.c
@@ -280,6 +280,25 @@ void deinit_bloom_filters(void)
deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
}
+struct bloom_keyvec *bloom_keyvec_new(size_t count)
+{
+ struct bloom_keyvec *vec;
+ size_t sz = sizeof(struct bloom_keyvec);
+ sz += count * sizeof(struct bloom_key);
+ vec = (struct bloom_keyvec *)xcalloc(1, sz);
+ vec->count = count;
+ return vec;
+}
+
+void bloom_keyvec_free(struct bloom_keyvec *vec)
+{
+ if (!vec)
+ return;
+ for (size_t nr = 0; nr < vec->count; nr++)
+ bloom_key_clear(&vec->key[nr]);
+ free(vec);
+}
+
static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
const struct hashmap_entry *eptr,
const struct hashmap_entry *entry_or_key,
@@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
return 1;
}
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *vec,
+ const struct bloom_filter_settings *settings)
+{
+ int ret = 1;
+
+ for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+ ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+ return ret;
+}
+
uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
int version)
{
diff --git a/bloom.h b/bloom.h
index edf14fef3e..3669074f3a 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,16 @@ struct bloom_key {
uint32_t *hashes;
};
+/*
+ * A bloom_keyvec is a vector of bloom_keys, which
+ * can be used to store multiple keys for a single
+ * pathspec item.
+ */
+struct bloom_keyvec {
+ size_t count;
+ struct bloom_key key[FLEX_ARRAY];
+};
+
int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
@@ -84,6 +94,17 @@ void bloom_key_fill(const char *data,
const struct bloom_filter_settings *settings);
void bloom_key_clear(struct bloom_key *key);
+struct bloom_keyvec *bloom_keyvec_new(size_t count);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
+
+static inline void bloom_keyvec_fill_key(const char *data, size_t len,
+ struct bloom_keyvec *vec, size_t nr,
+ const struct bloom_filter_settings *settings)
+{
+ assert(nr < vec->count);
+ bloom_key_fill(data, len, &vec->key[nr], settings);
+}
+
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
const struct bloom_filter_settings *settings);
@@ -128,6 +149,10 @@ int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *v,
+ const struct bloom_filter_settings *settings);
+
uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
int version);
diff --git a/revision.c b/revision.c
index 49fc650ac7..7cbb49617d 100644
--- a/revision.c
+++ b/revision.c
@@ -688,6 +688,7 @@ static int forbid_bloom_filters(struct pathspec *spec)
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
+ struct bloom_keyvec *bloom_keyvec;
char *path_alloc = NULL;
const char *path, *p;
size_t len;
@@ -736,19 +737,21 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
p++;
}
- revs->bloom_keys_nr = path_component_nr;
- ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+ revs->bloom_keyvecs_nr = 1;
+ CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+ bloom_keyvec = bloom_keyvec_new(path_component_nr);
+ revs->bloom_keyvecs[0] = bloom_keyvec;
- bloom_key_fill(path, len, &revs->bloom_keys[0],
- revs->bloom_filter_settings);
+ bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
+ revs->bloom_filter_settings);
path_component_nr = 1;
p = path + len - 1;
while (p > path) {
if (*p == '/')
- bloom_key_fill(path, p - path,
- &revs->bloom_keys[path_component_nr++],
- revs->bloom_filter_settings);
+ bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
+ path_component_nr++,
+ revs->bloom_filter_settings);
p--;
}
@@ -764,7 +767,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
struct commit *commit)
{
struct bloom_filter *filter;
- int result = 1, j;
+ int result = 0;
if (!revs->repo->objects->commit_graph)
return -1;
@@ -779,10 +782,10 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
return -1;
}
- for (j = 0; result && j < revs->bloom_keys_nr; j++) {
- result = bloom_filter_contains(filter,
- &revs->bloom_keys[j],
- revs->bloom_filter_settings);
+ for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+ result = bloom_filter_contains_vec(filter,
+ revs->bloom_keyvecs[nr],
+ revs->bloom_filter_settings);
}
if (result)
@@ -823,7 +826,7 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
- if (revs->bloom_keys_nr && !nth_parent) {
+ if (revs->bloom_keyvecs_nr && !nth_parent) {
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (bloom_ret == 0)
@@ -850,7 +853,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
if (!t1)
return 0;
- if (!nth_parent && revs->bloom_keys_nr) {
+ if (!nth_parent && revs->bloom_keyvecs_nr) {
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (!bloom_ret)
return 1;
@@ -3230,10 +3233,10 @@ void release_revisions(struct rev_info *revs)
line_log_free(revs);
oidset_clear(&revs->missing_commits);
- for (int i = 0; i < revs->bloom_keys_nr; i++)
- bloom_key_clear(&revs->bloom_keys[i]);
- FREE_AND_NULL(revs->bloom_keys);
- revs->bloom_keys_nr = 0;
+ for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
+ bloom_keyvec_free(revs->bloom_keyvecs[i]);
+ FREE_AND_NULL(revs->bloom_keyvecs);
+ revs->bloom_keyvecs_nr = 0;
}
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
struct rev_info;
struct string_list;
struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
struct bloom_filter_settings;
struct option;
struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
/* Commit graph bloom filter fields */
/* The bloom filter key(s) for the pathspec */
- struct bloom_key *bloom_keys;
- int bloom_keys_nr;
+ struct bloom_keyvec **bloom_keyvecs;
+ int bloom_keyvecs_nr;
/*
* The bloom filter settings used to generate the key.
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-04 11:14 ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-07 11:35 ` Derrick Stolee
2025-07-07 14:14 ` Lidong Yan
0 siblings, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-07 11:35 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, gitster
On 7/4/2025 7:14 AM, Lidong Yan wrote:
> The revision traversal limited by pathspec has optimization when
> the pathspec has only one element. To support optimization for
> multiple pathspec items, we need to modify the data structures
> in struct rev_info.
You are correct that the revision-walking abandons bloom filters
when there are multiple pathspecs, and fixing this is a valuable
effort.
The need for this change is subtle and could use some extra context
to be sure reviewers understand:
This change is writing over some code that was created in
c525ce95b46 (commit-graph: check all leading directories in changed
path Bloom filters, 2020-07-01) to allow storing multiple bloom
keys during the revision walk. The multiple keys are focusing on
multiple path components of the literal pathspec.
The point is that after the initialization, the bloom key array is
used directly as a filter for reporting TREESAME commits: a commit
is automatically reported as TREESAME to its first parent if any
bloom key results in a "No, definitely not changed" result with
that commit's bloom filter.
The reason we need a new data structure is that we need to adjust
the conditionals.
BEFORE: "NOT TREESAME if there EXISTS a bloom key that reports NO"
AFTER: "NOT TREESAME if FOR EVERY pathspec there EXISTS a bloom key
that reports NO."
This "FOR EVERY" condition makes it impossible to use a flat array
of bloom keys for multiple pathspecs, justifying this change.
What is further confusing here is that we already have logic that
deals with arrays of bloom keys, so I expected that the vector was
the single structure storing a list of those arrays. Instead, the
vector is replacing the array itself. This is made clear by using
the vector immediately in the existing implementation.
> +struct bloom_keyvec *bloom_keyvec_new(size_t count)
> +{
> + struct bloom_keyvec *vec;
> + size_t sz = sizeof(struct bloom_keyvec);
> + sz += count * sizeof(struct bloom_key);
> + vec = (struct bloom_keyvec *)xcalloc(1, sz);
You could use CALLOC_ARRAY() to simplify this and drop
the 'sz' variable.
> + vec->count = count;
> + return vec;
> +}
> +
> +void bloom_keyvec_free(struct bloom_keyvec *vec)
> +{
> + if (!vec)
> + return;
> + for (size_t nr = 0; nr < vec->count; nr++)
> + bloom_key_clear(&vec->key[nr]);
> + free(vec);
> +}
> +
> static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
> const struct hashmap_entry *eptr,
> const struct hashmap_entry *entry_or_key,
> @@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
> return 1;
> }
>
> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
> + const struct bloom_keyvec *vec,
> + const struct bloom_filter_settings *settings)
> +{
> + int ret = 1;
> +
> + for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
> + ret = bloom_filter_contains(filter, &vec->key[nr], settings);
> +
> + return ret;
> +}
This implementation is where the subtle detail comes in. Might be worth
a comment to say "if any key in this list is not contained in the filter,
then the filter doesn't match this vector."
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-07 11:35 ` Derrick Stolee
@ 2025-07-07 14:14 ` Lidong Yan
0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-07 14:14 UTC (permalink / raw)
To: Derrick Stolee; +Cc: git, gitster
Derrick Stolee <stolee@gmail.com> wrote:
>
> BEFORE: "NOT TREESAME if there EXISTS a bloom key that reports NO"
>
> AFTER: "NOT TREESAME if FOR EVERY pathspec there EXISTS a bloom key
> that reports NO."
>
> This "FOR EVERY" condition makes it impossible to use a flat array
> of bloom keys for multiple pathspecs, justifying this change.
>
> What is further confusing here is that we already have logic that
> deals with arrays of bloom keys, so I expected that the vector was
> the single structure storing a list of those arrays. Instead, the
> vector is replacing the array itself. This is made clear by using
> the vector immediately in the existing implementation.
I think the problem here is that it clearly enough in my comments.
When writing the code, I thought about converting the original
one-dimensional array struct bloom_key *keys into a two-dimensional
array struct bloom_key **. Then I came up with the idea that this
two-dimensional array could be designed as struct bloom_keyvec *keyvecs,
which might be clearer. Each struct bloom_keyvec would represent all the
bloom_key elements for a single pathspec item.
>> +struct bloom_keyvec *bloom_keyvec_new(size_t count)
>> +{
>> + struct bloom_keyvec *vec;
>> + size_t sz = sizeof(struct bloom_keyvec);
>> + sz += count * sizeof(struct bloom_key);
>> + vec = (struct bloom_keyvec *)xcalloc(1, sz);
> You could use CALLOC_ARRAY() to simplify this and drop
> the 'sz' variable.
You are right. I think you suggest to write struct bloom_keyvec like:
———————-
| count |
———————
| *keys | ——> key0 | key1 | key2 | … |
———————
And I am doing here makes struct bloom_keyvec looks like
———————
| count |
———————
| key[0] |
———————
| key[1] |
———————
| … |
Although bloom_keyvec_new() appears more complex, the advantage
is that bloom_keyvec_destroy() no longer needs to free keys manually.
And if I understand correctly, junio had suggested to use the second way
[here](https://lore.kernel.org/git/xmqqtt43u36t.fsf@gitster.g/).
>
>> + vec->count = count;
>> + return vec;
>> +}
>> +
>> +void bloom_keyvec_free(struct bloom_keyvec *vec)
>> +{
>> + if (!vec)
>> + return;
>> + for (size_t nr = 0; nr < vec->count; nr++)
>> + bloom_key_clear(&vec->key[nr]);
>> + free(vec);
>> +}
>> +
>> static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
>> const struct hashmap_entry *eptr,
>> const struct hashmap_entry *entry_or_key,
>> @@ -541,6 +560,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
>> return 1;
>> }
>>
>> +int bloom_filter_contains_vec(const struct bloom_filter *filter,
>> + const struct bloom_keyvec *vec,
>> + const struct bloom_filter_settings *settings)
>> +{
>> + int ret = 1;
>> +
>> + for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
>> + ret = bloom_filter_contains(filter, &vec->key[nr], settings);
>> +
>> + return ret;
>> +}
>
> This implementation is where the subtle detail comes in. Might be worth
> a comment to say "if any key in this list is not contained in the filter,
> then the filter doesn't match this vector."
I will add comment in the next version
Thank you for your review,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
` (2 preceding siblings ...)
2025-07-04 11:14 ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-04 11:14 ` Lidong Yan
2025-07-07 11:43 ` Derrick Stolee
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
4 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-04 11:14 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add code to initialize and check each pathspec item's bloom_keyvec.
Add new function release_revisions_bloom_keyvecs() to free all bloom
keyvec owned by rev_info.
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
items using bloom filter and the case without bloom filter
optimization.
- does not use bloom filter if any pathspec item is not literal.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
revision.c | 118 ++++++++++++++++++++++++-------------------
t/t4216-log-bloom.sh | 23 +++++----
2 files changed, 79 insertions(+), 62 deletions(-)
diff --git a/revision.c b/revision.c
index 7cbb49617d..9a77c0d0bc 100644
--- a/revision.c
+++ b/revision.c
@@ -675,16 +675,17 @@ static int forbid_bloom_filters(struct pathspec *spec)
{
if (spec->has_wildcard)
return 1;
- if (spec->nr > 1)
- return 1;
if (spec->magic & ~PATHSPEC_LITERAL)
return 1;
- if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
- return 1;
+ for (size_t nr = 0; nr < spec->nr; nr++)
+ if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+ return 1;
return 0;
}
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
@@ -692,7 +693,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
char *path_alloc = NULL;
const char *path, *p;
size_t len;
- int path_component_nr = 1;
+ int path_component_nr;
if (!revs->commits)
return;
@@ -709,50 +710,52 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- pi = &revs->pruning.pathspec.items[0];
-
- /* remove single trailing slash from path, if needed */
- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
- path_alloc = xmemdupz(pi->match, pi->len - 1);
- path = path_alloc;
- } else
- path = pi->match;
-
- len = strlen(path);
- if (!len) {
- revs->bloom_filter_settings = NULL;
- free(path_alloc);
- return;
- }
-
- p = path;
- while (*p) {
- /*
- * At this point, the path is normalized to use Unix-style
- * path separators. This is required due to how the
- * changed-path Bloom filters store the paths.
- */
- if (*p == '/')
- path_component_nr++;
- p++;
- }
-
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
- bloom_keyvec = bloom_keyvec_new(path_component_nr);
- revs->bloom_keyvecs[0] = bloom_keyvec;
-
- bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
- revs->bloom_filter_settings);
- path_component_nr = 1;
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ pi = &revs->pruning.pathspec.items[i];
+ path_component_nr = 1;
+
+ /* remove single trailing slash from path, if needed */
+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+ path_alloc = xmemdupz(pi->match, pi->len - 1);
+ path = path_alloc;
+ } else
+ path = pi->match;
+
+ len = strlen(path);
+ if (!len)
+ goto fail;
+
+ p = path;
+ while (*p) {
+ /*
+ * At this point, the path is normalized to use
+ * Unix-style path separators. This is required due to
+ * how the changed-path Bloom filters store the paths.
+ */
+ if (*p == '/')
+ path_component_nr++;
+ p++;
+ }
- p = path + len - 1;
- while (p > path) {
- if (*p == '/')
- bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
- path_component_nr++,
- revs->bloom_filter_settings);
- p--;
+ bloom_keyvec = bloom_keyvec_new(path_component_nr);
+ revs->bloom_keyvecs[i] = bloom_keyvec;
+
+ bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
+ revs->bloom_filter_settings);
+ path_component_nr = 1;
+
+ p = path + len - 1;
+ while (p > path) {
+ if (*p == '/')
+ bloom_keyvec_fill_key(path, p - path,
+ bloom_keyvec,
+ path_component_nr++,
+ revs->bloom_filter_settings);
+ p--;
+ }
+ FREE_AND_NULL(path_alloc);
}
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
@@ -760,7 +763,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
bloom_filter_atexit_registered = 1;
}
+ return;
+
+fail:
+ revs->bloom_filter_settings = NULL;
free(path_alloc);
+ release_revisions_bloom_keyvecs(revs);
}
static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
@@ -3204,6 +3212,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
static void release_revisions_topo_walk_info(struct topo_walk_info *info);
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+ for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+ bloom_keyvec_free(revs->bloom_keyvecs[nr]);
+ FREE_AND_NULL(revs->bloom_keyvecs);
+ revs->bloom_keyvecs_nr = 0;
+}
+
static void free_void_commit_list(void *list)
{
free_commit_list(list);
@@ -3232,11 +3248,7 @@ void release_revisions(struct rev_info *revs)
clear_decoration(&revs->treesame, free);
line_log_free(revs);
oidset_clear(&revs->missing_commits);
-
- for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
- bloom_keyvec_free(revs->bloom_keyvecs[i]);
- FREE_AND_NULL(revs->bloom_keyvecs);
- revs->bloom_keyvecs_nr = 0;
+ release_revisions_bloom_keyvecs(revs);
}
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
setup () {
rm -f "$TRASH_DIRECTORY/trace.perf" &&
- git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
- GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+ eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+ eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+ git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
}
test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
test_bloom_filters_not_used "--walk-reflogs -- A"
'
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
- test_bloom_filters_not_used "-- file4 A/file1"
-'
-
test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
test_bloom_filters_not_used "-- ."
'
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
test_bloom_filters_used "-- *renamed"
'
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
- test_bloom_filters_not_used "-- *" &&
- test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+ test_bloom_filters_used "-- file4 A/file1" &&
+ test_bloom_filters_used "-- *" &&
+ test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+ test_bloom_filters_not_used "-- file\*" &&
+ test_bloom_filters_not_used "-- A/\* file4" &&
+ test_bloom_filters_not_used "-- file4 A/\*" &&
+ test_bloom_filters_not_used "-- * A/\*"
'
test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
2025-07-04 11:14 ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-07 11:43 ` Derrick Stolee
2025-07-07 14:18 ` Lidong Yan
2025-07-07 15:14 ` Junio C Hamano
0 siblings, 2 replies; 72+ messages in thread
From: Derrick Stolee @ 2025-07-07 11:43 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, gitster
On 7/4/2025 7:14 AM, Lidong Yan wrote:
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add code to initialize and check each pathspec item's bloom_keyvec.
>
> Add new function release_revisions_bloom_keyvecs() to free all bloom
> keyvec owned by rev_info.
>
> Add new test cases in t/t4216-log-bloom.sh to ensure
> - consistent results between the optimization for multiple pathspec
> items using bloom filter and the case without bloom filter
> optimization.
> - does not use bloom filter if any pathspec item is not literal.
This would be a great time to add some performance statistics when
using this feature with multiple pathspecs on some standard repos (git
and the Linux kernel repo are two good examples).
We don't have a great performance script for this, since each test
repo will have different paths to use for comparisons, but you can
use 'hyperfine' to assemble your own comparisons before and after
this change and report them here (and in your cover letter).
> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> ---
> revision.c | 118 ++++++++++++++++++++++++-------------------
> t/t4216-log-bloom.sh | 23 +++++----
> 2 files changed, 79 insertions(+), 62 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 7cbb49617d..9a77c0d0bc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -675,16 +675,17 @@ static int forbid_bloom_filters(struct pathspec *spec)
> {
> if (spec->has_wildcard)
> return 1;
> - if (spec->nr > 1)
> - return 1;
> if (spec->magic & ~PATHSPEC_LITERAL)
> return 1;
> - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> - return 1;
> + for (size_t nr = 0; nr < spec->nr; nr++)
> + if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> + return 1;
This is a good check: if any is non-literal, then we can't use bloom
filters.
> +static void release_revisions_bloom_keyvecs(struct rev_info *revs);
> +
> static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> @@ -692,7 +693,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> char *path_alloc = NULL;
> const char *path, *p;
> size_t len;
> - int path_component_nr = 1;
> + int path_component_nr;
We can move this into the interior of the loop, right?
> if (!revs->commits)
> return;
> @@ -709,50 +710,52 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> if (!revs->pruning.pathspec.nr)
> return;
>
> - pi = &revs->pruning.pathspec.items[0];
> -
> - /* remove single trailing slash from path, if needed */
> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> - path_alloc = xmemdupz(pi->match, pi->len - 1);
> - path = path_alloc;
> - } else
> - path = pi->match;
> -
> - len = strlen(path);
> - if (!len) {
> - revs->bloom_filter_settings = NULL;
> - free(path_alloc);
> - return;
> - }
> -
> - p = path;
> - while (*p) {
> - /*
> - * At this point, the path is normalized to use Unix-style
> - * path separators. This is required due to how the
> - * changed-path Bloom filters store the paths.
> - */
> - if (*p == '/')
> - path_component_nr++;
> - p++;
> - }
> -
> - revs->bloom_keyvecs_nr = 1;
> - CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> - bloom_keyvec = bloom_keyvec_new(path_component_nr);
> - revs->bloom_keyvecs[0] = bloom_keyvec;
> -
> - bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
> - revs->bloom_filter_settings);
> - path_component_nr = 1;
The size of this diff is unfortunate. I wonder if there could first
be an extraction of this logic to operate on a single pathspec and
bloom_keyvec in a way that would be an obvious code move, then this
patch could call that method in a loop now that we have an array of
bloom_keyvecs.
I think it would make a cleaner patch and a cleaner final result.
> + revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> + CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> + for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> + pi = &revs->pruning.pathspec.items[i];
> + path_component_nr = 1;
> +
> + /* remove single trailing slash from path, if needed */
> + if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> + path_alloc = xmemdupz(pi->match, pi->len - 1);
> + path = path_alloc;
> + } else
> + path = pi->match;
> +
> + len = strlen(path);
> + if (!len)
> + goto fail;
> +
> + p = path;
> + while (*p) {
> + /*
> + * At this point, the path is normalized to use
> + * Unix-style path separators. This is required due to
> + * how the changed-path Bloom filters store the paths.
> + */
> + if (*p == '/')
> + path_component_nr++;
> + p++;
> + }
>
> - p = path + len - 1;
> - while (p > path) {
> - if (*p == '/')
> - bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
> - path_component_nr++,
> - revs->bloom_filter_settings);
> - p--;
> + bloom_keyvec = bloom_keyvec_new(path_component_nr);
> + revs->bloom_keyvecs[i] = bloom_keyvec;
> +
> + bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
> + revs->bloom_filter_settings);
> + path_component_nr = 1;
> +
> + p = path + len - 1;
> + while (p > path) {
> + if (*p == '/')
> + bloom_keyvec_fill_key(path, p - path,
> + bloom_keyvec,
> + path_component_nr++,
> + revs->bloom_filter_settings);
> + p--;
> + }
> + FREE_AND_NULL(path_alloc);
...
> -test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
> - test_bloom_filters_not_used "-- file4 A/file1"
> -'
> -
Love to see this.
> test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
> test_bloom_filters_not_used "-- ."
> '
> @@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
> test_bloom_filters_used "-- *renamed"
> '
>
> -test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
> - test_bloom_filters_not_used "-- *" &&
> - test_bloom_filters_not_used "-- file*"
> +test_expect_success 'git log with multiple literal paths uses Bloom filter' '
> + test_bloom_filters_used "-- file4 A/file1" &&
> + test_bloom_filters_used "-- *" &&
> + test_bloom_filters_used "-- file*"
> +'
> +
> +test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
> + test_bloom_filters_not_used "-- file\*" &&
> + test_bloom_filters_not_used "-- A/\* file4" &&
> + test_bloom_filters_not_used "-- file4 A/\*" &&
> + test_bloom_filters_not_used "-- * A/\*"
> '
And these new test cases are great.
Thanks for this work. I'm happy to see the feature be added and my suggestions
are purely cosmetic as the proof is in your tests.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
2025-07-07 11:43 ` Derrick Stolee
@ 2025-07-07 14:18 ` Lidong Yan
2025-07-07 15:14 ` Junio C Hamano
1 sibling, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-07 14:18 UTC (permalink / raw)
To: Derrick Stolee; +Cc: git, gitster
Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/4/2025 7:14 AM, Lidong Yan wrote:
>> To enable optimize multiple pathspec items in revision traversal,
>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>> Add code to initialize and check each pathspec item's bloom_keyvec.
>>
>> Add new function release_revisions_bloom_keyvecs() to free all bloom
>> keyvec owned by rev_info.
>>
>> Add new test cases in t/t4216-log-bloom.sh to ensure
>> - consistent results between the optimization for multiple pathspec
>> items using bloom filter and the case without bloom filter
>> optimization.
>> - does not use bloom filter if any pathspec item is not literal.
>
> This would be a great time to add some performance statistics when
> using this feature with multiple pathspecs on some standard repos (git
> and the Linux kernel repo are two good examples).
>
> We don't have a great performance script for this, since each test
> repo will have different paths to use for comparisons, but you can
> use 'hyperfine' to assemble your own comparisons before and after
> this change and report them here (and in your cover letter).
Got it, I will add statistics in the next version.
>> + int path_component_nr;
>
> We can move this into the interior of the loop, right?
Yes, move this into loop looks better.
>
> The size of this diff is unfortunate. I wonder if there could first
> be an extraction of this logic to operate on a single pathspec and
> bloom_keyvec in a way that would be an obvious code move, then this
> patch could call that method in a loop now that we have an array of
> bloom_keyvecs.
I will try to figure out a way to make diff cleaner.
Thanks,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
2025-07-07 11:43 ` Derrick Stolee
2025-07-07 14:18 ` Lidong Yan
@ 2025-07-07 15:14 ` Junio C Hamano
1 sibling, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-07 15:14 UTC (permalink / raw)
To: Derrick Stolee; +Cc: Lidong Yan, 502024330056, git
Derrick Stolee <stolee@gmail.com> writes:
> On 7/4/2025 7:14 AM, Lidong Yan wrote:
>> To enable optimize multiple pathspec items in revision traversal,
>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>> Add code to initialize and check each pathspec item's bloom_keyvec.
>>
>> Add new function release_revisions_bloom_keyvecs() to free all bloom
>> keyvec owned by rev_info.
>>
>> Add new test cases in t/t4216-log-bloom.sh to ensure
>> - consistent results between the optimization for multiple pathspec
>> items using bloom filter and the case without bloom filter
>> optimization.
>> - does not use bloom filter if any pathspec item is not literal.
>
> This would be a great time to add some performance statistics when
> using this feature with multiple pathspecs on some standard repos (git
> and the Linux kernel repo are two good examples).
>
> We don't have a great performance script for this, since each test
> repo will have different paths to use for comparisons, but you can
> use 'hyperfine' to assemble your own comparisons before and after
> this change and report them here (and in your cover letter).
Excellent suggestion. Very much appreciated.
> The size of this diff is unfortunate. I wonder if there could first
> be an extraction of this logic to operate on a single pathspec and
> bloom_keyvec in a way that would be an obvious code move, then this
> patch could call that method in a loop now that we have an array of
> bloom_keyvecs.
>
> I think it would make a cleaner patch and a cleaner final result.
Hmph. Didn't think of that extra "preliminary clean-up" step
myself, but I think you have a point. That would likely make
the resulting series easier to follow.
Thanks.
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
` (3 preceding siblings ...)
2025-07-04 11:14 ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-10 8:48 ` Lidong Yan
2025-07-10 8:48 ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
` (5 more replies)
4 siblings, 6 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 8:48 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon
The revision traversal limited by pathspec has optimization when
the pathspec has only one element, it does not use any pathspec
magic (other than literal), and there is no wildcard. The absence
of optimization for multiple pathspec elements in revision traversal
cause an issue raised by Kai Koponen at
https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/
While it is much harder to lift the latter two limitations,
supporting a pathspec with multiple elements is relatively easy.
Just make sure we hash each of them separately and ask the bloom
filter about them, and if we see none of them can possibly be
affected by the commit, we can skip without tree comparison.
The difference from v4 is:
- for the bloom_key_* functions, we now pass struct bloom_key *
as the first parameter.
- bloom_keyvec_fill_key() and bloom_keyvec_new() have been merged
into a single function, so that each key is filled during the
creation of the bloom_keyvec.
Below is a comparison of the time taken to run git log on Git and
LLVM repositories before and after applying this patch.
Setup commit-graph:
$ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
$ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths
Run git log on Git repository
$ cd ~/src/git
$ hash -p /usr/bin/git git # my system git binary is 2.43.0
$ time git log -100 -- commit.c commit-graph.c >/dev/null
real 0m0.055s
user 0m0.040s
sys 0m0.015s
$ hash -p ~/bin/git/bin/git git
$ time git log -100 -- commit.c commit-graph.c >/dev/null
real 0m0.039s
user 0m0.020s
sys 0m0.020s
Run git log in LLVM repository
$ cd ~/src/llvm
$ hash -p /usr/bin/git git # my system git binary is 2.43.0
$ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
real 0m2.365s
user 0m2.252s
sys 0m0.113s
$ hash -p ~/bin/git/bin/git git
$ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
real 0m0.240s
user 0m0.158s
sys 0m0.064s
Lidong Yan (4):
bloom: add test helper to return murmur3 hash
bloom: rename function operates on bloom_key
bloom: replace struct bloom_key * with struct bloom_keyvec
bloom: optimize multiple pathspec items in revision traversal
blame.c | 2 +-
bloom.c | 84 +++++++++++++++++++++++++++++++++---
bloom.h | 54 +++++++++++++++++------
line-log.c | 5 ++-
revision.c | 99 +++++++++++++++++++------------------------
revision.h | 6 +--
t/helper/test-bloom.c | 8 ++--
t/t4216-log-bloom.sh | 23 ++++++----
8 files changed, 187 insertions(+), 94 deletions(-)
Range-diff against v4:
1: d6883e9d6c = 1: 31c048dcdb bloom: add test helper to return murmur3 hash
2: f114556c0f ! 2: d2603c1752 bloom: rename function operates on bloom_key
@@ blame.c: static void add_bloom_key(struct blame_bloom_data *bd,
bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
- fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
-+ bloom_key_fill(path, strlen(path), bd->keys[bd->nr], bd->settings);
++ bloom_key_fill(bd->keys[bd->nr], path, strlen(path), bd->settings);
bd->nr++;
}
@@ bloom.c: static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size
}
-void fill_bloom_key(const char *data,
-+void bloom_key_fill(const char *data,
- size_t len,
- struct bloom_key *key,
+- size_t len,
+- struct bloom_key *key,
++void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings)
+ {
+ int i;
@@ bloom.c: void fill_bloom_key(const char *data,
key->hashes[i] = hash0 + i * hash1;
}
@@ bloom.c: struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
hashmap_for_each_entry(&pathmap, &iter, e, entry) {
struct bloom_key key;
- fill_bloom_key(e->path, strlen(e->path), &key, settings);
-+ bloom_key_fill(e->path, strlen(e->path), &key, settings);
++ bloom_key_fill(&key, e->path, strlen(e->path), settings);
add_key_to_filter(&key, filter, settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
@@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g,
uint32_t graph_pos);
-void fill_bloom_key(const char *data,
-+void bloom_key_fill(const char *data,
- size_t len,
- struct bloom_key *key,
+- size_t len,
+- struct bloom_key *key,
++void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
@@ line-log.c: static int bloom_filter_check(struct rev_info *rev,
while (!result && range) {
- fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
-+ bloom_key_fill(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
++ bloom_key_fill(&key, range->path, strlen(range->path),
++ rev->bloom_filter_settings);
if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
result = 1;
@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
- fill_bloom_key(path, len, &revs->bloom_keys[0],
-+ bloom_key_fill(path, len, &revs->bloom_keys[0],
++ bloom_key_fill(&revs->bloom_keys[0], path, len,
revs->bloom_filter_settings);
path_component_nr = 1;
@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
while (p > path) {
if (*p == '/')
- fill_bloom_key(path, p - path,
-+ bloom_key_fill(path, p - path,
- &revs->bloom_keys[path_component_nr++],
+- &revs->bloom_keys[path_component_nr++],
++ bloom_key_fill(&revs->bloom_keys[path_component_nr++],
++ path, p - path,
revs->bloom_filter_settings);
p--;
+ }
@@ revision.c: void release_revisions(struct rev_info *revs)
oidset_clear(&revs->missing_commits);
@@ t/helper/test-bloom.c: static struct bloom_filter_settings settings = DEFAULT_BL
struct bloom_key key;
- fill_bloom_key(data, strlen(data), &key, &settings);
-+ bloom_key_fill(data, strlen(data), &key, &settings);
++ bloom_key_fill(&key, data, strlen(data), &settings);
printf("Hashes:");
for (size_t i = 0; i < settings.num_hashes; i++)
printf("0x%08x|", key.hashes[i]);
3: c042907b92 ! 3: 60a3b16bbb bloom: replace struct bloom_key * with struct bloom_keyvec
@@ Metadata
## Commit message ##
bloom: replace struct bloom_key * with struct bloom_keyvec
- The revision traversal limited by pathspec has optimization when
- the pathspec has only one element. To support optimization for
- multiple pathspec items, we need to modify the data structures
- in struct rev_info.
+ Previously, we stored bloom keys in a flat array and marked a commit
+ as NOT TREESAME if any key reported "definitely not changed".
- struct rev_info uses bloom_keys and bloom_nr to store the bloom keys
- corresponding to a single pathspec item. To allow struct rev_info
- to store bloom keys for multiple pathspec items, a new data structure
- `struct bloom_keyvec` is introduced. Each `struct bloom_keyvec`
- corresponds to a single pathspec item.
+ To support multiple pathspec items, we now require that for each
+ pathspec item, there exists a bloom key reporting "definitely not
+ changed".
- In `struct rev_info`, replace bloom_keys and bloom_nr with bloom_keyvecs
- and bloom_keyvec_nr. This commit still optimize one pathspec item, thus
- bloom_keyvec_nr can only be 0 or 1.
+ This "for every" condition makes a flat array insufficient, so we
+ introduce a new structure to group keys by a single pathspec item.
+ `struct bloom_keyvec` is introduced to replace `struct bloom_key *`
+ and `bloom_key_nr`. And because we want to support multiple pathspec
+ items, we added a bloom_keyvec * and a bloom_keyvec_nr field to
+ `struct rev_info` to represent an array of bloom_keyvecs. This commit
+ still optimize only one pathspec item, thus bloom_keyvec_nr can only
+ be 0 or 1.
New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
- contained in a bloom filter. bloom_keyvec_fill_key() is added to
- initialize a key in keyvec.
+ contained in a bloom filter.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
@@ bloom.c: void deinit_bloom_filters(void)
deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
}
-+struct bloom_keyvec *bloom_keyvec_new(size_t count)
++struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
++ const struct bloom_filter_settings *settings)
+{
+ struct bloom_keyvec *vec;
-+ size_t sz = sizeof(struct bloom_keyvec);
-+ sz += count * sizeof(struct bloom_key);
++ const char *p;
++ size_t sz;
++ size_t nr = 1;
++
++ p = path;
++ while (*p) {
++ /*
++ * At this point, the path is normalized to use Unix-style
++ * path separators. This is required due to how the
++ * changed-path Bloom filters store the paths.
++ */
++ if (*p == '/')
++ nr++;
++ p++;
++ }
++
++ sz = sizeof(struct bloom_keyvec);
++ sz += nr * sizeof(struct bloom_key);
+ vec = (struct bloom_keyvec *)xcalloc(1, sz);
-+ vec->count = count;
++ if (!vec)
++ return NULL;
++ vec->count = nr;
++
++ bloom_key_fill(&vec->key[0], path, len, settings);
++ nr = 1;
++ p = path + len - 1;
++ while (p > path) {
++ if (*p == '/') {
++ bloom_key_fill(&vec->key[nr++], path, p - path, settings);
++ }
++ p--;
++ }
++ assert(nr == vec->count);
+ return vec;
+}
+
@@ bloom.h: struct bloom_key {
int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-@@ bloom.h: void bloom_key_fill(const char *data,
+@@ bloom.h: void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings);
void bloom_key_clear(struct bloom_key *key);
-+struct bloom_keyvec *bloom_keyvec_new(size_t count);
++/*
++ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
++ * given path.
++ *
++ * This function splits the input path by '/' and generates a bloom key for each
++ * prefix, in reverse order of specificity. For example, given the input
++ * "a/b/c", it will generate bloom keys for:
++ * - "a/b/c"
++ * - "a/b"
++ * - "a"
++ *
++ * The resulting keys are stored in a newly allocated bloom_keyvec.
++ */
++struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
++ const struct bloom_filter_settings *settings);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
-+
-+static inline void bloom_keyvec_fill_key(const char *data, size_t len,
-+ struct bloom_keyvec *vec, size_t nr,
-+ const struct bloom_filter_settings *settings)
-+{
-+ assert(nr < vec->count);
-+ bloom_key_fill(data, len, &vec->key[nr], settings);
-+}
+
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
@@ bloom.h: int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
++/*
++ * bloom_filter_contains_vec - Check if all keys in a key vector are in the
++ * Bloom filter.
++ *
++ * Returns 1 if **all** keys in the vector are present in the filter,
++ * 0 if **any** key is not present.
++ */
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *v,
+ const struct bloom_filter_settings *settings);
@@ bloom.h: int bloom_filter_contains(const struct bloom_filter *filter,
## revision.c ##
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
+ return 0;
+ }
+
++static void release_revisions_bloom_keyvecs(struct rev_info *revs);
++
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
char *path_alloc = NULL;
const char *path, *p;
size_t len;
+- int path_component_nr = 1;
+
+ if (!revs->commits)
+ return;
@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
- p++;
- }
+ if (!revs->pruning.pathspec.nr)
+ return;
-- revs->bloom_keys_nr = path_component_nr;
-- ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+ revs->bloom_keyvecs_nr = 1;
+ CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-+ bloom_keyvec = bloom_keyvec_new(path_component_nr);
-+ revs->bloom_keyvecs[0] = bloom_keyvec;
+ pi = &revs->pruning.pathspec.items[0];
+
+ /* remove single trailing slash from path, if needed */
+@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
+ path = pi->match;
-- bloom_key_fill(path, len, &revs->bloom_keys[0],
+ len = strlen(path);
+- if (!len) {
+- revs->bloom_filter_settings = NULL;
+- free(path_alloc);
+- return;
+- }
+-
+- p = path;
+- while (*p) {
+- /*
+- * At this point, the path is normalized to use Unix-style
+- * path separators. This is required due to how the
+- * changed-path Bloom filters store the paths.
+- */
+- if (*p == '/')
+- path_component_nr++;
+- p++;
+- }
+-
+- revs->bloom_keys_nr = path_component_nr;
+- ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
++ if (!len)
++ goto fail;
+
+- bloom_key_fill(&revs->bloom_keys[0], path, len,
- revs->bloom_filter_settings);
-+ bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
-+ revs->bloom_filter_settings);
- path_component_nr = 1;
-
- p = path + len - 1;
- while (p > path) {
- if (*p == '/')
-- bloom_key_fill(path, p - path,
-- &revs->bloom_keys[path_component_nr++],
+- path_component_nr = 1;
+-
+- p = path + len - 1;
+- while (p > path) {
+- if (*p == '/')
+- bloom_key_fill(&revs->bloom_keys[path_component_nr++],
+- path, p - path,
- revs->bloom_filter_settings);
-+ bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
-+ path_component_nr++,
-+ revs->bloom_filter_settings);
- p--;
+- p--;
+- }
++ revs->bloom_keyvecs[0] =
++ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
+
+ if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
+ atexit(trace2_bloom_filter_statistics_atexit);
+ bloom_filter_atexit_registered = 1;
}
-@@ revision.c: static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
++ return;
++
++fail:
++ revs->bloom_filter_settings = NULL;
+ free(path_alloc);
++ release_revisions_bloom_keyvecs(revs);
+ }
+
+ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
struct commit *commit)
{
struct bloom_filter *filter;
@@ revision.c: static int rev_same_tree_as_empty(struct rev_info *revs, struct comm
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (!bloom_ret)
return 1;
+@@ revision.c: static void release_revisions_mailmap(struct string_list *mailmap)
+
+ static void release_revisions_topo_walk_info(struct topo_walk_info *info);
+
++static void release_revisions_bloom_keyvecs(struct rev_info *revs)
++{
++ for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
++ bloom_keyvec_free(revs->bloom_keyvecs[nr]);
++ FREE_AND_NULL(revs->bloom_keyvecs);
++ revs->bloom_keyvecs_nr = 0;
++}
++
+ static void free_void_commit_list(void *list)
+ {
+ free_commit_list(list);
@@ revision.c: void release_revisions(struct rev_info *revs)
+ clear_decoration(&revs->treesame, free);
line_log_free(revs);
oidset_clear(&revs->missing_commits);
-
+-
- for (int i = 0; i < revs->bloom_keys_nr; i++)
- bloom_key_clear(&revs->bloom_keys[i]);
- FREE_AND_NULL(revs->bloom_keys);
- revs->bloom_keys_nr = 0;
-+ for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
-+ bloom_keyvec_free(revs->bloom_keyvecs[i]);
-+ FREE_AND_NULL(revs->bloom_keyvecs);
-+ revs->bloom_keyvecs_nr = 0;
++ release_revisions_bloom_keyvecs(revs);
}
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
4: de4d554b55 ! 4: 198a7da17c bloom: optimize multiple pathspec items in revision traversal
@@ Commit message
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
- Add code to initialize and check each pathspec item's bloom_keyvec.
-
- Add new function release_revisions_bloom_keyvecs() to free all bloom
- keyvec owned by rev_info.
+ Add for loops to initialize and check each pathspec item's bloom_keyvec
+ when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
return 0;
}
-
-+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
-+
- static void prepare_to_use_bloom_filter(struct rev_info *revs)
- {
- struct pathspec_item *pi;
-@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
- char *path_alloc = NULL;
- const char *path, *p;
- size_t len;
-- int path_component_nr = 1;
-+ int path_component_nr;
-
- if (!revs->commits)
- return;
@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
+- revs->bloom_keyvecs_nr = 1;
+- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
- pi = &revs->pruning.pathspec.items[0];
--
++ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
++ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
++ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
++ pi = &revs->pruning.pathspec.items[i];
+
- /* remove single trailing slash from path, if needed */
- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
- path_alloc = xmemdupz(pi->match, pi->len - 1);
- path = path_alloc;
- } else
- path = pi->match;
--
-- len = strlen(path);
-- if (!len) {
-- revs->bloom_filter_settings = NULL;
-- free(path_alloc);
-- return;
-- }
--
-- p = path;
-- while (*p) {
-- /*
-- * At this point, the path is normalized to use Unix-style
-- * path separators. This is required due to how the
-- * changed-path Bloom filters store the paths.
-- */
-- if (*p == '/')
-- path_component_nr++;
-- p++;
-- }
--
-- revs->bloom_keyvecs_nr = 1;
-- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-- bloom_keyvec = bloom_keyvec_new(path_component_nr);
-- revs->bloom_keyvecs[0] = bloom_keyvec;
--
-- bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
-- revs->bloom_filter_settings);
-- path_component_nr = 1;
-+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
-+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
-+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
-+ pi = &revs->pruning.pathspec.items[i];
-+ path_component_nr = 1;
-+
+ /* remove single trailing slash from path, if needed */
+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+ path_alloc = xmemdupz(pi->match, pi->len - 1);
+ path = path_alloc;
+ } else
+ path = pi->match;
-+
+
+- len = strlen(path);
+- if (!len)
+- goto fail;
+ len = strlen(path);
+ if (!len)
+ goto fail;
-+
-+ p = path;
-+ while (*p) {
-+ /*
-+ * At this point, the path is normalized to use
-+ * Unix-style path separators. This is required due to
-+ * how the changed-path Bloom filters store the paths.
-+ */
-+ if (*p == '/')
-+ path_component_nr++;
-+ p++;
-+ }
-- p = path + len - 1;
-- while (p > path) {
-- if (*p == '/')
-- bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
-- path_component_nr++,
-- revs->bloom_filter_settings);
-- p--;
-+ bloom_keyvec = bloom_keyvec_new(path_component_nr);
-+ revs->bloom_keyvecs[i] = bloom_keyvec;
-+
-+ bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
-+ revs->bloom_filter_settings);
-+ path_component_nr = 1;
-+
-+ p = path + len - 1;
-+ while (p > path) {
-+ if (*p == '/')
-+ bloom_keyvec_fill_key(path, p - path,
-+ bloom_keyvec,
-+ path_component_nr++,
-+ revs->bloom_filter_settings);
-+ p--;
-+ }
+- revs->bloom_keyvecs[0] =
+- bloom_keyvec_new(path, len, revs->bloom_filter_settings);
++ revs->bloom_keyvecs[i] =
++ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
+ FREE_AND_NULL(path_alloc);
- }
++ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
-@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
- bloom_filter_atexit_registered = 1;
- }
-
-+ return;
-+
-+fail:
-+ revs->bloom_filter_settings = NULL;
- free(path_alloc);
-+ release_revisions_bloom_keyvecs(revs);
- }
-
- static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
-@@ revision.c: static void release_revisions_mailmap(struct string_list *mailmap)
-
- static void release_revisions_topo_walk_info(struct topo_walk_info *info);
-
-+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
-+{
-+ for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
-+ bloom_keyvec_free(revs->bloom_keyvecs[nr]);
-+ FREE_AND_NULL(revs->bloom_keyvecs);
-+ revs->bloom_keyvecs_nr = 0;
-+}
-+
- static void free_void_commit_list(void *list)
- {
- free_commit_list(list);
-@@ revision.c: void release_revisions(struct rev_info *revs)
- clear_decoration(&revs->treesame, free);
- line_log_free(revs);
- oidset_clear(&revs->missing_commits);
--
-- for (size_t i = 0; i < revs->bloom_keyvecs_nr; i++)
-- bloom_keyvec_free(revs->bloom_keyvecs[i]);
-- FREE_AND_NULL(revs->bloom_keyvecs);
-- revs->bloom_keyvecs_nr = 0;
-+ release_revisions_bloom_keyvecs(revs);
- }
-
- static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
+ atexit(trace2_bloom_filter_statistics_atexit);
## t/t4216-log-bloom.sh ##
@@ t/t4216-log-bloom.sh: sane_unset GIT_TRACE2_CONFIG_PARAMS
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply [flat|nested] 72+ messages in thread* [PATCH v5 1/4] bloom: add test helper to return murmur3 hash
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
@ 2025-07-10 8:48 ` Lidong Yan
2025-07-10 8:48 ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
` (4 subsequent siblings)
5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 8:48 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon
In bloom.h, murmur3_seeded_v2() is exported for the use of test murmur3
hash. To clarify that murmur3_seeded_v2() is exported solely for testing
purposes, a new helper function test_murmur3_seeded() was added instead
of exporting murmur3_seeded_v2() directly.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
bloom.c | 13 ++++++++++++-
bloom.h | 12 +++---------
t/helper/test-bloom.c | 4 ++--
3 files changed, 17 insertions(+), 12 deletions(-)
diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..946c5e8c98 100644
--- a/bloom.c
+++ b/bloom.c
@@ -107,7 +107,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
* Not considered to be cryptographically secure.
* Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
*/
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+static uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
{
const uint32_t c1 = 0xcc9e2d51;
const uint32_t c2 = 0x1b873593;
@@ -540,3 +540,14 @@ int bloom_filter_contains(const struct bloom_filter *filter,
return 1;
}
+
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+ int version)
+{
+ assert(version == 1 || version == 2);
+
+ if (version == 2)
+ return murmur3_seeded_v2(seed, data, len);
+ else
+ return murmur3_seeded_v1(seed, data, len);
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..a9ded1822f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,15 +78,6 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-/*
- * Calculate the murmur3 32-bit hash value for the given data
- * using the given seed.
- * Produces a uniformly distributed hash value.
- * Not considered to be cryptographically secure.
- * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
- */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
-
void fill_bloom_key(const char *data,
size_t len,
struct bloom_key *key,
@@ -137,4 +128,7 @@ int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+ int version);
+
#endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9aa2c5a592..6a24b6e0a6 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -61,13 +61,13 @@ int cmd__bloom(int argc, const char **argv)
uint32_t hashed;
if (argc < 3)
usage(bloom_usage);
- hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+ hashed = test_bloom_murmur3_seeded(0, argv[2], strlen(argv[2]), 2);
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
uint32_t hashed;
- hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+ hashed = test_bloom_murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7, 2);
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v5 2/4] bloom: rename function operates on bloom_key
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
2025-07-10 8:48 ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
@ 2025-07-10 8:48 ` Lidong Yan
2025-07-10 8:48 ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
` (3 subsequent siblings)
5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 8:48 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon
git code style requires that functions operating on a struct S
should be named in the form S_verb. However, the functions operating
on struct bloom_key do not follow this convention. Therefore,
fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill()
and bloom_key_clear(), respectively.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
blame.c | 2 +-
bloom.c | 10 ++++------
bloom.h | 6 ++----
line-log.c | 5 +++--
revision.c | 8 ++++----
t/helper/test-bloom.c | 4 ++--
6 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/blame.c b/blame.c
index 57daa45e89..811c6d8f9f 100644
--- a/blame.c
+++ b/blame.c
@@ -1310,7 +1310,7 @@ static void add_bloom_key(struct blame_bloom_data *bd,
}
bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
- fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+ bloom_key_fill(bd->keys[bd->nr], path, strlen(path), bd->settings);
bd->nr++;
}
diff --git a/bloom.c b/bloom.c
index 946c5e8c98..5523d198c8 100644
--- a/bloom.c
+++ b/bloom.c
@@ -221,9 +221,7 @@ static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
return seed;
}
-void fill_bloom_key(const char *data,
- size_t len,
- struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings)
{
int i;
@@ -243,7 +241,7 @@ void fill_bloom_key(const char *data,
key->hashes[i] = hash0 + i * hash1;
}
-void clear_bloom_key(struct bloom_key *key)
+void bloom_key_clear(struct bloom_key *key)
{
FREE_AND_NULL(key->hashes);
}
@@ -500,9 +498,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
hashmap_for_each_entry(&pathmap, &iter, e, entry) {
struct bloom_key key;
- fill_bloom_key(e->path, strlen(e->path), &key, settings);
+ bloom_key_fill(&key, e->path, strlen(e->path), settings);
add_key_to_filter(&key, filter, settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
}
cleanup:
diff --git a/bloom.h b/bloom.h
index a9ded1822f..603bc1f90f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,11 +78,9 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-void fill_bloom_key(const char *data,
- size_t len,
- struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 628e3fe3ae..07f2154e84 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1172,12 +1172,13 @@ static int bloom_filter_check(struct rev_info *rev,
return 0;
while (!result && range) {
- fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+ bloom_key_fill(&key, range->path, strlen(range->path),
+ rev->bloom_filter_settings);
if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
result = 1;
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
range = range->next;
}
diff --git a/revision.c b/revision.c
index afee111196..a7eadff0a5 100644
--- a/revision.c
+++ b/revision.c
@@ -739,15 +739,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_keys_nr = path_component_nr;
ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
- fill_bloom_key(path, len, &revs->bloom_keys[0],
+ bloom_key_fill(&revs->bloom_keys[0], path, len,
revs->bloom_filter_settings);
path_component_nr = 1;
p = path + len - 1;
while (p > path) {
if (*p == '/')
- fill_bloom_key(path, p - path,
- &revs->bloom_keys[path_component_nr++],
+ bloom_key_fill(&revs->bloom_keys[path_component_nr++],
+ path, p - path,
revs->bloom_filter_settings);
p--;
}
@@ -3231,7 +3231,7 @@ void release_revisions(struct rev_info *revs)
oidset_clear(&revs->missing_commits);
for (int i = 0; i < revs->bloom_keys_nr; i++)
- clear_bloom_key(&revs->bloom_keys[i]);
+ bloom_key_clear(&revs->bloom_keys[i]);
FREE_AND_NULL(revs->bloom_keys);
revs->bloom_keys_nr = 0;
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6a24b6e0a6..3283544bd3 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -12,13 +12,13 @@ static struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
struct bloom_key key;
- fill_bloom_key(data, strlen(data), &key, &settings);
+ bloom_key_fill(&key, data, strlen(data), &settings);
printf("Hashes:");
for (size_t i = 0; i < settings.num_hashes; i++)
printf("0x%08x|", key.hashes[i]);
printf("\n");
add_key_to_filter(&key, filter, &settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
}
static void print_bloom_filter(struct bloom_filter *filter) {
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
2025-07-10 8:48 ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-10 8:48 ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
@ 2025-07-10 8:48 ` Lidong Yan
2025-07-10 16:17 ` Junio C Hamano
2025-07-10 8:48 ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
` (2 subsequent siblings)
5 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 8:48 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon
Previously, we stored bloom keys in a flat array and marked a commit
as NOT TREESAME if any key reported "definitely not changed".
To support multiple pathspec items, we now require that for each
pathspec item, there exists a bloom key reporting "definitely not
changed".
This "for every" condition makes a flat array insufficient, so we
introduce a new structure to group keys by a single pathspec item.
`struct bloom_keyvec` is introduced to replace `struct bloom_key *`
and `bloom_key_nr`. And because we want to support multiple pathspec
items, we added a bloom_keyvec * and a bloom_keyvec_nr field to
`struct rev_info` to represent an array of bloom_keyvecs. This commit
still optimize only one pathspec item, thus bloom_keyvec_nr can only
be 0 or 1.
New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
bloom.c | 61 ++++++++++++++++++++++++++++++++++++++++++++
bloom.h | 38 +++++++++++++++++++++++++++
revision.c | 75 ++++++++++++++++++++++--------------------------------
revision.h | 6 ++---
4 files changed, 132 insertions(+), 48 deletions(-)
diff --git a/bloom.c b/bloom.c
index 5523d198c8..b86015f6d1 100644
--- a/bloom.c
+++ b/bloom.c
@@ -278,6 +278,55 @@ void deinit_bloom_filters(void)
deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
}
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+ const struct bloom_filter_settings *settings)
+{
+ struct bloom_keyvec *vec;
+ const char *p;
+ size_t sz;
+ size_t nr = 1;
+
+ p = path;
+ while (*p) {
+ /*
+ * At this point, the path is normalized to use Unix-style
+ * path separators. This is required due to how the
+ * changed-path Bloom filters store the paths.
+ */
+ if (*p == '/')
+ nr++;
+ p++;
+ }
+
+ sz = sizeof(struct bloom_keyvec);
+ sz += nr * sizeof(struct bloom_key);
+ vec = (struct bloom_keyvec *)xcalloc(1, sz);
+ if (!vec)
+ return NULL;
+ vec->count = nr;
+
+ bloom_key_fill(&vec->key[0], path, len, settings);
+ nr = 1;
+ p = path + len - 1;
+ while (p > path) {
+ if (*p == '/') {
+ bloom_key_fill(&vec->key[nr++], path, p - path, settings);
+ }
+ p--;
+ }
+ assert(nr == vec->count);
+ return vec;
+}
+
+void bloom_keyvec_free(struct bloom_keyvec *vec)
+{
+ if (!vec)
+ return;
+ for (size_t nr = 0; nr < vec->count; nr++)
+ bloom_key_clear(&vec->key[nr]);
+ free(vec);
+}
+
static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
const struct hashmap_entry *eptr,
const struct hashmap_entry *entry_or_key,
@@ -539,6 +588,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
return 1;
}
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *vec,
+ const struct bloom_filter_settings *settings)
+{
+ int ret = 1;
+
+ for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+ ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+ return ret;
+}
+
uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
int version)
{
diff --git a/bloom.h b/bloom.h
index 603bc1f90f..3602d32054 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,16 @@ struct bloom_key {
uint32_t *hashes;
};
+/*
+ * A bloom_keyvec is a vector of bloom_keys, which
+ * can be used to store multiple keys for a single
+ * pathspec item.
+ */
+struct bloom_keyvec {
+ size_t count;
+ struct bloom_key key[FLEX_ARRAY];
+};
+
int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
@@ -82,6 +92,23 @@ void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings);
void bloom_key_clear(struct bloom_key *key);
+/*
+ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
+ * given path.
+ *
+ * This function splits the input path by '/' and generates a bloom key for each
+ * prefix, in reverse order of specificity. For example, given the input
+ * "a/b/c", it will generate bloom keys for:
+ * - "a/b/c"
+ * - "a/b"
+ * - "a"
+ *
+ * The resulting keys are stored in a newly allocated bloom_keyvec.
+ */
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+ const struct bloom_filter_settings *settings);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
+
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
const struct bloom_filter_settings *settings);
@@ -126,6 +153,17 @@ int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
+/*
+ * bloom_filter_contains_vec - Check if all keys in a key vector are in the
+ * Bloom filter.
+ *
+ * Returns 1 if **all** keys in the vector are present in the filter,
+ * 0 if **any** key is not present.
+ */
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *v,
+ const struct bloom_filter_settings *settings);
+
uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
int version);
diff --git a/revision.c b/revision.c
index a7eadff0a5..22bcfab7f9 100644
--- a/revision.c
+++ b/revision.c
@@ -685,13 +685,15 @@ static int forbid_bloom_filters(struct pathspec *spec)
return 0;
}
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
+ struct bloom_keyvec *bloom_keyvec;
char *path_alloc = NULL;
const char *path, *p;
size_t len;
- int path_component_nr = 1;
if (!revs->commits)
return;
@@ -708,6 +710,8 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
+ revs->bloom_keyvecs_nr = 1;
+ CALLOC_ARRAY(revs->bloom_keyvecs, 1);
pi = &revs->pruning.pathspec.items[0];
/* remove single trailing slash from path, if needed */
@@ -718,53 +722,30 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
path = pi->match;
len = strlen(path);
- if (!len) {
- revs->bloom_filter_settings = NULL;
- free(path_alloc);
- return;
- }
-
- p = path;
- while (*p) {
- /*
- * At this point, the path is normalized to use Unix-style
- * path separators. This is required due to how the
- * changed-path Bloom filters store the paths.
- */
- if (*p == '/')
- path_component_nr++;
- p++;
- }
-
- revs->bloom_keys_nr = path_component_nr;
- ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+ if (!len)
+ goto fail;
- bloom_key_fill(&revs->bloom_keys[0], path, len,
- revs->bloom_filter_settings);
- path_component_nr = 1;
-
- p = path + len - 1;
- while (p > path) {
- if (*p == '/')
- bloom_key_fill(&revs->bloom_keys[path_component_nr++],
- path, p - path,
- revs->bloom_filter_settings);
- p--;
- }
+ revs->bloom_keyvecs[0] =
+ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
bloom_filter_atexit_registered = 1;
}
+ return;
+
+fail:
+ revs->bloom_filter_settings = NULL;
free(path_alloc);
+ release_revisions_bloom_keyvecs(revs);
}
static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
struct commit *commit)
{
struct bloom_filter *filter;
- int result = 1, j;
+ int result = 0;
if (!revs->repo->objects->commit_graph)
return -1;
@@ -779,10 +760,10 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
return -1;
}
- for (j = 0; result && j < revs->bloom_keys_nr; j++) {
- result = bloom_filter_contains(filter,
- &revs->bloom_keys[j],
- revs->bloom_filter_settings);
+ for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+ result = bloom_filter_contains_vec(filter,
+ revs->bloom_keyvecs[nr],
+ revs->bloom_filter_settings);
}
if (result)
@@ -823,7 +804,7 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
- if (revs->bloom_keys_nr && !nth_parent) {
+ if (revs->bloom_keyvecs_nr && !nth_parent) {
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (bloom_ret == 0)
@@ -850,7 +831,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
if (!t1)
return 0;
- if (!nth_parent && revs->bloom_keys_nr) {
+ if (!nth_parent && revs->bloom_keyvecs_nr) {
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (!bloom_ret)
return 1;
@@ -3201,6 +3182,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
static void release_revisions_topo_walk_info(struct topo_walk_info *info);
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+ for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+ bloom_keyvec_free(revs->bloom_keyvecs[nr]);
+ FREE_AND_NULL(revs->bloom_keyvecs);
+ revs->bloom_keyvecs_nr = 0;
+}
+
static void free_void_commit_list(void *list)
{
free_commit_list(list);
@@ -3229,11 +3218,7 @@ void release_revisions(struct rev_info *revs)
clear_decoration(&revs->treesame, free);
line_log_free(revs);
oidset_clear(&revs->missing_commits);
-
- for (int i = 0; i < revs->bloom_keys_nr; i++)
- bloom_key_clear(&revs->bloom_keys[i]);
- FREE_AND_NULL(revs->bloom_keys);
- revs->bloom_keys_nr = 0;
+ release_revisions_bloom_keyvecs(revs);
}
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
struct rev_info;
struct string_list;
struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
struct bloom_filter_settings;
struct option;
struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
/* Commit graph bloom filter fields */
/* The bloom filter key(s) for the pathspec */
- struct bloom_key *bloom_keys;
- int bloom_keys_nr;
+ struct bloom_keyvec **bloom_keyvecs;
+ int bloom_keyvecs_nr;
/*
* The bloom filter settings used to generate the key.
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-10 8:48 ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-10 16:17 ` Junio C Hamano
2025-07-11 12:46 ` Lidong Yan
0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-07-10 16:17 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, toon
Lidong Yan <yldhome2d2@gmail.com> writes:
> static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> + struct bloom_keyvec *bloom_keyvec;
This new variable is no longer used, since the code to create a new
keyvec is in a helper function and its return value is directly
stored in the array of keyvecs.
> char *path_alloc = NULL;
> const char *path, *p;
And the "p" variable no longer is used, because the logic it used to
create a new keyvec is moved elsewhere.
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-10 16:17 ` Junio C Hamano
@ 2025-07-11 12:46 ` Lidong Yan
2025-07-11 15:06 ` Junio C Hamano
0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-11 12:46 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, toon
Junio C Hamano <gitster@pobox.com> write:
>
> Lidong Yan <yldhome2d2@gmail.com> writes:
>
>> static void prepare_to_use_bloom_filter(struct rev_info *revs)
>> {
>> struct pathspec_item *pi;
>> + struct bloom_keyvec *bloom_keyvec;
>
> This new variable is no longer used, since the code to create a new
> keyvec is in a helper function and its return value is directly
> stored in the array of keyvecs.
>
>> char *path_alloc = NULL;
>> const char *path, *p;
>
> And the "p" variable no longer is used, because the logic it used to
> create a new keyvec is moved elsewhere.
Will fix in v6, Thanks,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-11 12:46 ` Lidong Yan
@ 2025-07-11 15:06 ` Junio C Hamano
0 siblings, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-11 15:06 UTC (permalink / raw)
To: Lidong Yan; +Cc: git, toon
Lidong Yan <502024330056@smail.nju.edu.cn> writes:
> Junio C Hamano <gitster@pobox.com> write:
>>
>> Lidong Yan <yldhome2d2@gmail.com> writes:
>>
>>> static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>> {
>>> struct pathspec_item *pi;
>>> + struct bloom_keyvec *bloom_keyvec;
>>
>> This new variable is no longer used, since the code to create a new
>> keyvec is in a helper function and its return value is directly
>> stored in the array of keyvecs.
>>
>>> char *path_alloc = NULL;
>>> const char *path, *p;
>>
>> And the "p" variable no longer is used, because the logic it used to
>> create a new keyvec is moved elsewhere.
>
> Will fix in v6, Thanks,
FWIW, what I queued have these two already removed from v5.
Thanks.
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
` (2 preceding siblings ...)
2025-07-10 8:48 ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-10 8:48 ` Lidong Yan
2025-07-10 13:51 ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
2025-07-10 13:55 ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
2025-07-10 13:49 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
5 siblings, 2 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 8:48 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
items using bloom filter and the case without bloom filter
optimization.
- does not use bloom filter if any pathspec item is not literal.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
revision.c | 38 ++++++++++++++++++++------------------
t/t4216-log-bloom.sh | 23 ++++++++++++++---------
2 files changed, 34 insertions(+), 27 deletions(-)
diff --git a/revision.c b/revision.c
index 22bcfab7f9..f25a61bb6c 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
{
if (spec->has_wildcard)
return 1;
- if (spec->nr > 1)
- return 1;
if (spec->magic & ~PATHSPEC_LITERAL)
return 1;
- if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
- return 1;
+ for (size_t nr = 0; nr < spec->nr; nr++)
+ if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+ return 1;
return 0;
}
@@ -710,23 +709,26 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
- pi = &revs->pruning.pathspec.items[0];
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ pi = &revs->pruning.pathspec.items[i];
- /* remove single trailing slash from path, if needed */
- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
- path_alloc = xmemdupz(pi->match, pi->len - 1);
- path = path_alloc;
- } else
- path = pi->match;
+ /* remove single trailing slash from path, if needed */
+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+ path_alloc = xmemdupz(pi->match, pi->len - 1);
+ path = path_alloc;
+ } else
+ path = pi->match;
- len = strlen(path);
- if (!len)
- goto fail;
+ len = strlen(path);
+ if (!len)
+ goto fail;
- revs->bloom_keyvecs[0] =
- bloom_keyvec_new(path, len, revs->bloom_filter_settings);
+ revs->bloom_keyvecs[i] =
+ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
+ FREE_AND_NULL(path_alloc);
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
setup () {
rm -f "$TRASH_DIRECTORY/trace.perf" &&
- git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
- GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+ eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+ eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+ git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
}
test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
test_bloom_filters_not_used "--walk-reflogs -- A"
'
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
- test_bloom_filters_not_used "-- file4 A/file1"
-'
-
test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
test_bloom_filters_not_used "-- ."
'
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
test_bloom_filters_used "-- *renamed"
'
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
- test_bloom_filters_not_used "-- *" &&
- test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+ test_bloom_filters_used "-- file4 A/file1" &&
+ test_bloom_filters_used "-- *" &&
+ test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+ test_bloom_filters_not_used "-- file\*" &&
+ test_bloom_filters_not_used "-- A/\* file4" &&
+ test_bloom_filters_not_used "-- file4 A/\*" &&
+ test_bloom_filters_not_used "-- * A/\*"
'
test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
--
2.50.0.107.g33b6ec8c79
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
2025-07-10 8:48 ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-10 13:51 ` Derrick Stolee
2025-07-10 15:42 ` Lidong Yan
2025-07-10 13:55 ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
1 sibling, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-10 13:51 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, gitster, toon
On 7/10/2025 4:48 AM, Lidong Yan wrote:
> @@ -710,23 +709,26 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> if (!revs->pruning.pathspec.nr)
> return;
>
> - revs->bloom_keyvecs_nr = 1;
> - CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> - pi = &revs->pruning.pathspec.items[0];
> + revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> + CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> + for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> + pi = &revs->pruning.pathspec.items[i];
>
> - /* remove single trailing slash from path, if needed */
> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> - path_alloc = xmemdupz(pi->match, pi->len - 1);
> - path = path_alloc;
> - } else
> - path = pi->match;
> + /* remove single trailing slash from path, if needed */
> + if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> + path_alloc = xmemdupz(pi->match, pi->len - 1);
> + path = path_alloc;
> + } else
> + path = pi->match;
>
> - len = strlen(path);
> - if (!len)
> - goto fail;
> + len = strlen(path);
> + if (!len)
> + goto fail;
>
> - revs->bloom_keyvecs[0] =
> - bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> + revs->bloom_keyvecs[i] =
> + bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> + FREE_AND_NULL(path_alloc);
> + }
This diff is still bigger than I was hoping, so I'm sending a couple
of patches that simplify this code movement. Feel free to ignore
them as being too nit-picky.
--- >8 ---
From 69fa36dc615e140ae842b536f7da792beaebb272 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <stolee@gmail.com>
Date: Thu, 10 Jul 2025 08:06:29 -0400
Subject: [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
When preparing to use bloom filters in a revision walk, Git populates a
boom_keyvec with an array of bloom keys for the components of a path.
Before we create the ability to map multiple pathspecs to multiple
bloom_keyvecs, extract the conversion from a pathspec to a bloom_keyvec
into its own helper method. This simplifies the state that persists in
prepare_to_use_bloom_filter() as well as makes the next change much
simpler.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
revision.c | 50 +++++++++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 19 deletions(-)
diff --git a/revision.c b/revision.c
index 22bcfab7f93..4c09b594c55 100644
--- a/revision.c
+++ b/revision.c
@@ -687,14 +687,37 @@ static int forbid_bloom_filters(struct pathspec *spec)
static void release_revisions_bloom_keyvecs(struct rev_info *revs);
-static void prepare_to_use_bloom_filter(struct rev_info *revs)
+static int convert_pathspec_to_filter(const struct pathspec_item *pi,
+ struct bloom_keyvec **bloom_keyvec,
+ const struct bloom_filter_settings *settings)
{
- struct pathspec_item *pi;
- struct bloom_keyvec *bloom_keyvec;
- char *path_alloc = NULL;
- const char *path, *p;
size_t len;
+ const char *path;
+ char *path_alloc = NULL;
+ int res = 0;
+
+ /* remove single trailing slash from path, if needed */
+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+ path_alloc = xmemdupz(pi->match, pi->len - 1);
+ path = path_alloc;
+ } else
+ path = pi->match;
+
+ len = strlen(path);
+ if (!len) {
+ res = -1;
+ goto cleanup;
+ }
+
+ *bloom_keyvec = bloom_keyvec_new(path, len, settings);
+cleanup:
+ FREE_AND_NULL(path_alloc);
+ return res;
+}
+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
if (!revs->commits)
return;
@@ -712,22 +735,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_keyvecs_nr = 1;
CALLOC_ARRAY(revs->bloom_keyvecs, 1);
- pi = &revs->pruning.pathspec.items[0];
- /* remove single trailing slash from path, if needed */
- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
- path_alloc = xmemdupz(pi->match, pi->len - 1);
- path = path_alloc;
- } else
- path = pi->match;
-
- len = strlen(path);
- if (!len)
+ if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
+ &revs->bloom_keyvecs[0],
+ revs->bloom_filter_settings))
goto fail;
- revs->bloom_keyvecs[0] =
- bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
bloom_filter_atexit_registered = 1;
@@ -737,7 +750,6 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
fail:
revs->bloom_filter_settings = NULL;
- free(path_alloc);
release_revisions_bloom_keyvecs(revs);
}
--
2.47.2.vfs.0.2
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
2025-07-10 13:51 ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
@ 2025-07-10 15:42 ` Lidong Yan
0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 15:42 UTC (permalink / raw)
To: Derrick Stolee; +Cc: git, gitster, toon
Derrick Stolee <stolee@gmail.com> wrote:
>
> This diff is still bigger than I was hoping, so I'm sending a couple
> of patches that simplify this code movement. Feel free to ignore
> them as being too nit-picky.
>
> --- >8 ---
>
> From 69fa36dc615e140ae842b536f7da792beaebb272 Mon Sep 17 00:00:00 2001
> From: Derrick Stolee <stolee@gmail.com>
> Date: Thu, 10 Jul 2025 08:06:29 -0400
> Subject: [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key
>
> When preparing to use bloom filters in a revision walk, Git populates a
> boom_keyvec with an array of bloom keys for the components of a path.
> Before we create the ability to map multiple pathspecs to multiple
> bloom_keyvecs, extract the conversion from a pathspec to a bloom_keyvec
> into its own helper method. This simplifies the state that persists in
> prepare_to_use_bloom_filter() as well as makes the next change much
> simpler.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> revision.c | 50 +++++++++++++++++++++++++++++++-------------------
> 1 file changed, 31 insertions(+), 19 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 22bcfab7f93..4c09b594c55 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -687,14 +687,37 @@ static int forbid_bloom_filters(struct pathspec *spec)
>
> static void release_revisions_bloom_keyvecs(struct rev_info *revs);
>
> -static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +static int convert_pathspec_to_filter(const struct pathspec_item *pi,
> + struct bloom_keyvec **bloom_keyvec,
> + const struct bloom_filter_settings *settings)
> {
> - struct pathspec_item *pi;
> - struct bloom_keyvec *bloom_keyvec;
> - char *path_alloc = NULL;
> - const char *path, *p;
> size_t len;
> + const char *path;
> + char *path_alloc = NULL;
> + int res = 0;
> +
> + /* remove single trailing slash from path, if needed */
> + if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> + path_alloc = xmemdupz(pi->match, pi->len - 1);
> + path = path_alloc;
> + } else
> + path = pi->match;
> +
> + len = strlen(path);
> + if (!len) {
> + res = -1;
> + goto cleanup;
> + }
> +
> + *bloom_keyvec = bloom_keyvec_new(path, len, settings);
>
> +cleanup:
> + FREE_AND_NULL(path_alloc);
I think we don’t need to NULL path_alloc here, but it doesn’t hurt.
> + return res;
> +}
> +
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> if (!revs->commits)
> return;
>
> @@ -712,22 +735,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
> revs->bloom_keyvecs_nr = 1;
> CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> - pi = &revs->pruning.pathspec.items[0];
>
> - /* remove single trailing slash from path, if needed */
> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> - path_alloc = xmemdupz(pi->match, pi->len - 1);
> - path = path_alloc;
> - } else
> - path = pi->match;
> -
> - len = strlen(path);
> - if (!len)
> + if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
> + &revs->bloom_keyvecs[0],
> + revs->bloom_filter_settings))
> goto fail;
>
> - revs->bloom_keyvecs[0] =
> - bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> -
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> atexit(trace2_bloom_filter_statistics_atexit);
> bloom_filter_atexit_registered = 1;
> @@ -737,7 +750,6 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
> fail:
> revs->bloom_filter_settings = NULL;
> - free(path_alloc);
> release_revisions_bloom_keyvecs(revs);
> }
>
> --
> 2.47.2.vfs.0.2
This looks perfect to me. I would squash patch 3.5 to patch 3 in v6.
Thanks for your patch,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
2025-07-10 8:48 ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-10 13:51 ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
@ 2025-07-10 13:55 ` Derrick Stolee
2025-07-10 15:49 ` Lidong Yan
1 sibling, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-10 13:55 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, gitster, toon
On 7/10/2025 4:48 AM, Lidong Yan wrote:
> @@ -710,23 +709,26 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> if (!revs->pruning.pathspec.nr)
> return;
>
> - revs->bloom_keyvecs_nr = 1;
> - CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> - pi = &revs->pruning.pathspec.items[0];
> + revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> + CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> + for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> + pi = &revs->pruning.pathspec.items[i];
>
> - /* remove single trailing slash from path, if needed */
> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> - path_alloc = xmemdupz(pi->match, pi->len - 1);
> - path = path_alloc;
> - } else
> - path = pi->match;
> + /* remove single trailing slash from path, if needed */
> + if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> + path_alloc = xmemdupz(pi->match, pi->len - 1);
> + path = path_alloc;
> + } else
> + path = pi->match;
>
> - len = strlen(path);
> - if (!len)
> - goto fail;
> + len = strlen(path);
> + if (!len)
> + goto fail;
>
> - revs->bloom_keyvecs[0] =
> - bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> + revs->bloom_keyvecs[i] =
> + bloom_keyvec_new(path, len, revs->bloom_filter_settings);
> + FREE_AND_NULL(path_alloc);
> + }
Focus on the change to this diff when the patch below is applied on
top of the 3.5/4 I sent earlier, resulting in this diff:
@@ -733,13 +732,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-
- if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
- &revs->bloom_keyvecs[0],
- revs->bloom_filter_settings))
- goto fail;
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[i],
+ &revs->bloom_keyvecs[i],
+ revs->bloom_filter_settings))
+ goto fail;
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
Also, I've included the hyperfine performance output in the commit
message.
Thanks,
-Stolee
--- >8 ---
From fe255b1acfbe90fa8e4c335435ae18ee95e6243c Mon Sep 17 00:00:00 2001
From: Lidong Yan <502024330056@smail.nju.edu.cn>
Date: Thu, 10 Jul 2025 08:04:34 -0400
Subject: [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
traversal
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
items using bloom filter and the case without bloom filter
optimization.
- does not use bloom filter if any pathspec item is not literal.
With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
But in a larger repo, such as the LLVM project repo below, we get even
better results:
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
revision.c | 22 +++++++++++-----------
t/t4216-log-bloom.sh | 23 ++++++++++++++---------
2 files changed, 25 insertions(+), 20 deletions(-)
diff --git a/revision.c b/revision.c
index 4c09b594c55..ca8c1dde8ca 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
{
if (spec->has_wildcard)
return 1;
- if (spec->nr > 1)
- return 1;
if (spec->magic & ~PATHSPEC_LITERAL)
return 1;
- if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
- return 1;
+ for (size_t nr = 0; nr < spec->nr; nr++)
+ if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+ return 1;
return 0;
}
@@ -733,13 +732,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-
- if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
- &revs->bloom_keyvecs[0],
- revs->bloom_filter_settings))
- goto fail;
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[i],
+ &revs->bloom_keyvecs[i],
+ revs->bloom_filter_settings))
+ goto fail;
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac1..639868ac562 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
setup () {
rm -f "$TRASH_DIRECTORY/trace.perf" &&
- git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
- GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+ eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+ eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+ git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
}
test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
test_bloom_filters_not_used "--walk-reflogs -- A"
'
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
- test_bloom_filters_not_used "-- file4 A/file1"
-'
-
test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
test_bloom_filters_not_used "-- ."
'
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
test_bloom_filters_used "-- *renamed"
'
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
- test_bloom_filters_not_used "-- *" &&
- test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+ test_bloom_filters_used "-- file4 A/file1" &&
+ test_bloom_filters_used "-- *" &&
+ test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+ test_bloom_filters_not_used "-- file\*" &&
+ test_bloom_filters_not_used "-- A/\* file4" &&
+ test_bloom_filters_not_used "-- file4 A/\*" &&
+ test_bloom_filters_not_used "-- * A/\*"
'
test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
--
2.47.2.vfs.0.2
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
2025-07-10 13:55 ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
@ 2025-07-10 15:49 ` Lidong Yan
0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-10 15:49 UTC (permalink / raw)
To: Derrick Stolee; +Cc: git, gitster, toon
Derrick Stolee <stolee@gmail.com> wrote:
>
> --- >8 ---
>
> From fe255b1acfbe90fa8e4c335435ae18ee95e6243c Mon Sep 17 00:00:00 2001
> From: Lidong Yan <502024330056@smail.nju.edu.cn>
> Date: Thu, 10 Jul 2025 08:04:34 -0400
> Subject: [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision
> traversal
>
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add for loops to initialize and check each pathspec item's bloom_keyvec
> when optimization is possible.
>
> Add new test cases in t/t4216-log-bloom.sh to ensure
> - consistent results between the optimization for multiple pathspec
> items using bloom filter and the case without bloom filter
> optimization.
> - does not use bloom filter if any pathspec item is not literal.
>
> With these optimizations, we get some improvements for multi-pathspec runs
> of 'git log'. First, in the Git repository we see these modest results:
>
> Benchmark 1: old
> Time (mean ± σ): 73.1 ms ± 2.9 ms
> Range (min … max): 69.9 ms … 84.5 ms 42 runs
>
> Benchmark 2: new
> Time (mean ± σ): 55.1 ms ± 2.9 ms
> Range (min … max): 51.1 ms … 61.2 ms 52 runs
>
> Summary
> 'new' ran
> 1.33 ± 0.09 times faster than 'old'
>
> But in a larger repo, such as the LLVM project repo below, we get even
> better results:
>
> Benchmark 1: old
> Time (mean ± σ): 1.974 s ± 0.006 s
> Range (min … max): 1.960 s … 1.983 s 10 runs
>
> Benchmark 2: new
> Time (mean ± σ): 262.9 ms ± 2.4 ms
> Range (min … max): 257.7 ms … 266.2 ms 11 runs
>
> Summary
> 'new' ran
> 7.51 ± 0.07 times faster than 'old'
Hyperfine do looks better. I will put this into commit message and
cover letter in v6.
>
> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> revision.c | 22 +++++++++++-----------
> t/t4216-log-bloom.sh | 23 ++++++++++++++---------
> 2 files changed, 25 insertions(+), 20 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 4c09b594c55..ca8c1dde8ca 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
> {
> if (spec->has_wildcard)
> return 1;
> - if (spec->nr > 1)
> - return 1;
> if (spec->magic & ~PATHSPEC_LITERAL)
> return 1;
> - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> - return 1;
> + for (size_t nr = 0; nr < spec->nr; nr++)
> + if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> + return 1;
>
> return 0;
> }
> @@ -733,13 +732,14 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> if (!revs->pruning.pathspec.nr)
> return;
>
> - revs->bloom_keyvecs_nr = 1;
> - CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> -
> - if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[0],
> - &revs->bloom_keyvecs[0],
> - revs->bloom_filter_settings))
> - goto fail;
> + revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> + CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> + for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> + if (convert_pathspec_to_filter(&revs->pruning.pathspec.items[i],
> + &revs->bloom_keyvecs[i],
> + revs->bloom_filter_settings))
> + goto fail;
> + }
>
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> atexit(trace2_bloom_filter_statistics_atexit);
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 8910d53cac1..639868ac562 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
>
> setup () {
> rm -f "$TRASH_DIRECTORY/trace.perf" &&
> - git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
> - GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
> + eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
> + eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
> + git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
> }
>
> test_bloom_filters_used () {
> @@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
> test_bloom_filters_not_used "--walk-reflogs -- A"
> '
>
> -test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
> - test_bloom_filters_not_used "-- file4 A/file1"
> -'
> -
> test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
> test_bloom_filters_not_used "-- ."
> '
> @@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
> test_bloom_filters_used "-- *renamed"
> '
>
> -test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
> - test_bloom_filters_not_used "-- *" &&
> - test_bloom_filters_not_used "-- file*"
> +test_expect_success 'git log with multiple literal paths uses Bloom filter' '
> + test_bloom_filters_used "-- file4 A/file1" &&
> + test_bloom_filters_used "-- *" &&
> + test_bloom_filters_used "-- file*"
> +'
> +
> +test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
> + test_bloom_filters_not_used "-- file\*" &&
> + test_bloom_filters_not_used "-- A/\* file4" &&
> + test_bloom_filters_not_used "-- file4 A/\*" &&
> + test_bloom_filters_not_used "-- * A/\*"
> '
>
> test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
> --
> 2.47.2.vfs.0.2
>
Looks great, I will apply this above patch 3.
Thanks,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
` (3 preceding siblings ...)
2025-07-10 8:48 ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
@ 2025-07-10 13:49 ` Derrick Stolee
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
5 siblings, 0 replies; 72+ messages in thread
From: Derrick Stolee @ 2025-07-10 13:49 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, gitster, toon
On 7/10/2025 4:48 AM, Lidong Yan wrote:
> The difference from v4 is:
> - for the bloom_key_* functions, we now pass struct bloom_key *
> as the first parameter.
> - bloom_keyvec_fill_key() and bloom_keyvec_new() have been merged
> into a single function, so that each key is filled during the
> creation of the bloom_keyvec.
>
> Below is a comparison of the time taken to run git log on Git and
> LLVM repositories before and after applying this patch.
>
> Setup commit-graph:
> $ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
> $ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths
>
> Run git log on Git repository
> $ cd ~/src/git
> $ hash -p /usr/bin/git git # my system git binary is 2.43.0
> $ time git log -100 -- commit.c commit-graph.c >/dev/null
> real 0m0.055s
> user 0m0.040s
> sys 0m0.015s
> $ hash -p ~/bin/git/bin/git git
> $ time git log -100 -- commit.c commit-graph.c >/dev/null
> real 0m0.039s
> user 0m0.020s
> sys 0m0.020s
>
> Run git log in LLVM repository
> $ cd ~/src/llvm
> $ hash -p /usr/bin/git git # my system git binary is 2.43.0
> $ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
> real 0m2.365s
> user 0m2.252s
> sys 0m0.113s
> $ hash -p ~/bin/git/bin/git git
> $ time git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h >/dev/null
> real 0m0.240s
> user 0m0.158s
> sys 0m0.064s
Thanks for these stats, though I'd recommend trying hyperfine [1]
which with this setup would let you run these experiments as the
following:
$ $ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- commit.c commit-graph.c' \
> -n 'new' '~/_git/git/git log -100 -- commit.c commit-graph.c'
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms [User: 48.8 ms, System: 23.9 ms]
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms [User: 30.5 ms, System: 24.4 ms]
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
And for LLVM:
$ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h' \
> -n 'new' '~/_git/git/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h'
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s [User: 1.877 s, System: 0.097 s]
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms [User: 214.2 ms, System: 48.4 ms]
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
[1] https://github.com/sharkdp/hyperfine
Finally, putting these performance numbers in the commit message will
make the results permanently findable in the repo history.
> 3: c042907b92 ! 3: 60a3b16bbb bloom: replace struct bloom_key * with struct bloom_keyvec
> -+struct bloom_keyvec *bloom_keyvec_new(size_t count)
> ++struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
> ++ const struct bloom_filter_settings *settings)
> +{
> + struct bloom_keyvec *vec;
> -+ size_t sz = sizeof(struct bloom_keyvec);
> -+ sz += count * sizeof(struct bloom_key);
> ++ const char *p;
> ++ size_t sz;
> ++ size_t nr = 1;
> ++
> ++ p = path;
> ++ while (*p) {
> ++ /*
> ++ * At this point, the path is normalized to use Unix-style
> ++ * path separators. This is required due to how the
> ++ * changed-path Bloom filters store the paths.
> ++ */
> ++ if (*p == '/')
> ++ nr++;
> ++ p++;
> ++ }
> ++
> ++ sz = sizeof(struct bloom_keyvec);
> ++ sz += nr * sizeof(struct bloom_key);
> + vec = (struct bloom_keyvec *)xcalloc(1, sz);
> -+ vec->count = count;
> ++ if (!vec)
> ++ return NULL;
> ++ vec->count = nr;
> ++
> ++ bloom_key_fill(&vec->key[0], path, len, settings);
> ++ nr = 1;
> ++ p = path + len - 1;
> ++ while (p > path) {
> ++ if (*p == '/') {
> ++ bloom_key_fill(&vec->key[nr++], path, p - path, settings);
> ++ }
> ++ p--;
> ++ }
> ++ assert(nr == vec->count);
> + return vec;
> +}
These additions to bloom_keyvec_new() certainly help simplify some
of the code movement I was talking about, but there is more that
can be done to simplify patch 4. I'll send a couple example patches
in reply to patch 4 with what I mean.
Overall, the patch series is correct. My complaints are stylistic
more than anything.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 72+ messages in thread* [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
` (4 preceding siblings ...)
2025-07-10 13:49 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
@ 2025-07-12 9:35 ` Lidong Yan
2025-07-12 9:35 ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
` (5 more replies)
5 siblings, 6 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:35 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee
The revision traversal limited by pathspec has optimization when
the pathspec has only one element, it does not use any pathspec
magic (other than literal), and there is no wildcard. The absence
of optimization for multiple pathspec elements in revision traversal
cause an issue raised by Kai Koponen at
https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/
While it is much harder to lift the latter two limitations,
supporting a pathspec with multiple elements is relatively easy.
Just make sure we hash each of them separately and ask the bloom
filter about them, and if we see none of them can possibly be
affected by the commit, we can skip without tree comparison.
The difference from v5 is:
- extract convert pathspec item to bloom_keyvec logic to
a separate function, which simplifies the prepare_to_use_bloom_filter()
function.
- fix few bugs in v5.
Below is a comparison of the time taken to run git log on Git and
LLVM repositories before and after applying this patch. These statistics
are given by Derrick Stolee at
https://lore.kernel.org/git/afb68948-218b-4b56-9faa-29578ef9c73c@gmail.com/
Setup commit-graph:
$ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
$ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths
Running hyperfine [1] on Git repository:
$ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- commit.c commit-graph.c' \
> -n 'new' '~/_git/git/git log -100 -- commit.c commit-graph.c'
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms [User: 48.8 ms, System: 23.9 ms]
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms [User: 30.5 ms, System: 24.4 ms]
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
And for LLVM:
$ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h' \
> -n 'new' '~/_git/git/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h'
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s [User: 1.877 s, System: 0.097 s]
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms [User: 214.2 ms, System: 48.4 ms]
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
[1] https://github.com/sharkdp/hyperfine
Lidong Yan (5):
bloom: add test helper to return murmur3 hash
bloom: rename function operates on bloom_key
bloom: replace struct bloom_key * with struct bloom_keyvec
revision: make helper for pathspec to bloom keyvec
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's
bloom_keyvec when optimization is possible.
blame.c | 2 +-
bloom.c | 84 ++++++++++++++++++++++++++---
bloom.h | 54 ++++++++++++++-----
line-log.c | 5 +-
revision.c | 122 +++++++++++++++++++++---------------------
revision.h | 6 +--
t/helper/test-bloom.c | 8 +--
t/t4216-log-bloom.sh | 23 ++++----
8 files changed, 204 insertions(+), 100 deletions(-)
Range-diff against v5:
1: 4d8f60e5ff = 1: f5ab19063d bloom: add test helper to return murmur3 hash
2: acee03e397 = 2: 51a180daa6 bloom: rename function operates on bloom_key
3: d7690bd02c ! 3: e17249ab4b bloom: replace struct bloom_key * with struct bloom_keyvec
@@ bloom.h: void bloom_key_fill(struct bloom_key *key, const char *data, size_t len
void bloom_key_clear(struct bloom_key *key);
+/*
-+ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
++ * bloom_keyvec_new - Allocate and populate a bloom_keyvec with keys for the
+ * given path.
+ *
+ * This function splits the input path by '/' and generates a bloom key for each
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
-+ struct bloom_keyvec *bloom_keyvec;
char *path_alloc = NULL;
- const char *path, *p;
+- const char *path, *p;
++ const char *path;
size_t len;
- int path_component_nr = 1;
-: ---------- > 4: b3c1f5bcd1 revision: make helper for pathspec to bloom keyvec
4: e577aa1bfd ! 5: 785bd43674 bloom: optimize multiple pathspec items in revision traversal
@@ Metadata
Author: Lidong Yan <yldhome2d2@gmail.com>
## Commit message ##
- bloom: optimize multiple pathspec items in revision traversal
-
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- - consistent results between the optimization for multiple pathspec
- items using bloom filter and the case without bloom filter
- optimization.
- - does not use bloom filter if any pathspec item is not literal.
+ - consistent results between the optimization for multiple pathspec
+ items using bloom filter and the case without bloom filter
+ optimization.
+ - does not use bloom filter if any pathspec item is not literal.
+
+ With these optimizations, we get some improvements for multi-pathspec runs
+ of 'git log'. First, in the Git repository we see these modest results:
+
+ Benchmark 1: old
+ Time (mean ± σ): 73.1 ms ± 2.9 ms
+ Range (min … max): 69.9 ms … 84.5 ms 42 runs
+
+ Benchmark 2: new
+ Time (mean ± σ): 55.1 ms ± 2.9 ms
+ Range (min … max): 51.1 ms … 61.2 ms 52 runs
+
+ Summary
+ 'new' ran
+ 1.33 ± 0.09 times faster than 'old'
+
+ But in a larger repo, such as the LLVM project repo below, we get even
+ better results:
+
+ Benchmark 1: old
+ Time (mean ± σ): 1.974 s ± 0.006 s
+ Range (min … max): 1.960 s … 1.983 s 10 runs
+
+ Benchmark 2: new
+ Time (mean ± σ): 262.9 ms ± 2.4 ms
+ Range (min … max): 257.7 ms … 266.2 ms 11 runs
+
+ Summary
+ 'new' ran
+ 7.51 ± 0.07 times faster than 'old'
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
+ Signed-off-by: Derrick Stolee <stolee@gmail.com>
## revision.c ##
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-- pi = &revs->pruning.pathspec.items[0];
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
-+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
-+ pi = &revs->pruning.pathspec.items[i];
-- /* remove single trailing slash from path, if needed */
-- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-- path_alloc = xmemdupz(pi->match, pi->len - 1);
-- path = path_alloc;
-- } else
-- path = pi->match;
-+ /* remove single trailing slash from path, if needed */
-+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-+ path_alloc = xmemdupz(pi->match, pi->len - 1);
-+ path = path_alloc;
-+ } else
-+ path = pi->match;
-
-- len = strlen(path);
-- if (!len)
+- if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
+- &revs->pruning.pathspec.items[0],
+- revs->bloom_filter_settings))
- goto fail;
-+ len = strlen(path);
-+ if (!len)
++ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
++ if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
++ &revs->pruning.pathspec.items[i],
++ revs->bloom_filter_settings))
+ goto fail;
-
-- revs->bloom_keyvecs[0] =
-- bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-+ revs->bloom_keyvecs[i] =
-+ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-+ FREE_AND_NULL(path_alloc);
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
--
2.39.5 (Apple Git-154)
^ permalink raw reply [flat|nested] 72+ messages in thread* [PATCH v6 1/5] bloom: add test helper to return murmur3 hash
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
@ 2025-07-12 9:35 ` Lidong Yan
2025-07-12 9:35 ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
` (4 subsequent siblings)
5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:35 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee
In bloom.h, murmur3_seeded_v2() is exported for the use of test murmur3
hash. To clarify that murmur3_seeded_v2() is exported solely for testing
purposes, a new helper function test_murmur3_seeded() was added instead
of exporting murmur3_seeded_v2() directly.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
bloom.c | 13 ++++++++++++-
bloom.h | 12 +++---------
t/helper/test-bloom.c | 4 ++--
3 files changed, 17 insertions(+), 12 deletions(-)
diff --git a/bloom.c b/bloom.c
index 0c8d2cebf9..946c5e8c98 100644
--- a/bloom.c
+++ b/bloom.c
@@ -107,7 +107,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
* Not considered to be cryptographically secure.
* Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
*/
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
+static uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len)
{
const uint32_t c1 = 0xcc9e2d51;
const uint32_t c2 = 0x1b873593;
@@ -540,3 +540,14 @@ int bloom_filter_contains(const struct bloom_filter *filter,
return 1;
}
+
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+ int version)
+{
+ assert(version == 1 || version == 2);
+
+ if (version == 2)
+ return murmur3_seeded_v2(seed, data, len);
+ else
+ return murmur3_seeded_v1(seed, data, len);
+}
diff --git a/bloom.h b/bloom.h
index 6e46489a20..a9ded1822f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,15 +78,6 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-/*
- * Calculate the murmur3 32-bit hash value for the given data
- * using the given seed.
- * Produces a uniformly distributed hash value.
- * Not considered to be cryptographically secure.
- * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
- */
-uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len);
-
void fill_bloom_key(const char *data,
size_t len,
struct bloom_key *key,
@@ -137,4 +128,7 @@ int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
+uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
+ int version);
+
#endif
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 9aa2c5a592..6a24b6e0a6 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -61,13 +61,13 @@ int cmd__bloom(int argc, const char **argv)
uint32_t hashed;
if (argc < 3)
usage(bloom_usage);
- hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2]));
+ hashed = test_bloom_murmur3_seeded(0, argv[2], strlen(argv[2]), 2);
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
if (!strcmp(argv[1], "get_murmur3_seven_highbit")) {
uint32_t hashed;
- hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7);
+ hashed = test_bloom_murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7, 2);
printf("Murmur3 Hash with seed=0:0x%08x\n", hashed);
}
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v6 2/5] bloom: rename function operates on bloom_key
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
2025-07-12 9:35 ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
@ 2025-07-12 9:35 ` Lidong Yan
2025-07-12 9:35 ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
` (3 subsequent siblings)
5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:35 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee
git code style requires that functions operating on a struct S
should be named in the form S_verb. However, the functions operating
on struct bloom_key do not follow this convention. Therefore,
fill_bloom_key() and clear_bloom_key() are renamed to bloom_key_fill()
and bloom_key_clear(), respectively.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
blame.c | 2 +-
bloom.c | 10 ++++------
bloom.h | 6 ++----
line-log.c | 5 +++--
revision.c | 8 ++++----
t/helper/test-bloom.c | 4 ++--
6 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/blame.c b/blame.c
index 57daa45e89..811c6d8f9f 100644
--- a/blame.c
+++ b/blame.c
@@ -1310,7 +1310,7 @@ static void add_bloom_key(struct blame_bloom_data *bd,
}
bd->keys[bd->nr] = xmalloc(sizeof(struct bloom_key));
- fill_bloom_key(path, strlen(path), bd->keys[bd->nr], bd->settings);
+ bloom_key_fill(bd->keys[bd->nr], path, strlen(path), bd->settings);
bd->nr++;
}
diff --git a/bloom.c b/bloom.c
index 946c5e8c98..5523d198c8 100644
--- a/bloom.c
+++ b/bloom.c
@@ -221,9 +221,7 @@ static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
return seed;
}
-void fill_bloom_key(const char *data,
- size_t len,
- struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings)
{
int i;
@@ -243,7 +241,7 @@ void fill_bloom_key(const char *data,
key->hashes[i] = hash0 + i * hash1;
}
-void clear_bloom_key(struct bloom_key *key)
+void bloom_key_clear(struct bloom_key *key)
{
FREE_AND_NULL(key->hashes);
}
@@ -500,9 +498,9 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
hashmap_for_each_entry(&pathmap, &iter, e, entry) {
struct bloom_key key;
- fill_bloom_key(e->path, strlen(e->path), &key, settings);
+ bloom_key_fill(&key, e->path, strlen(e->path), settings);
add_key_to_filter(&key, filter, settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
}
cleanup:
diff --git a/bloom.h b/bloom.h
index a9ded1822f..603bc1f90f 100644
--- a/bloom.h
+++ b/bloom.h
@@ -78,11 +78,9 @@ int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
-void fill_bloom_key(const char *data,
- size_t len,
- struct bloom_key *key,
+void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings);
-void clear_bloom_key(struct bloom_key *key);
+void bloom_key_clear(struct bloom_key *key);
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
diff --git a/line-log.c b/line-log.c
index 628e3fe3ae..07f2154e84 100644
--- a/line-log.c
+++ b/line-log.c
@@ -1172,12 +1172,13 @@ static int bloom_filter_check(struct rev_info *rev,
return 0;
while (!result && range) {
- fill_bloom_key(range->path, strlen(range->path), &key, rev->bloom_filter_settings);
+ bloom_key_fill(&key, range->path, strlen(range->path),
+ rev->bloom_filter_settings);
if (bloom_filter_contains(filter, &key, rev->bloom_filter_settings))
result = 1;
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
range = range->next;
}
diff --git a/revision.c b/revision.c
index afee111196..a7eadff0a5 100644
--- a/revision.c
+++ b/revision.c
@@ -739,15 +739,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_keys_nr = path_component_nr;
ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
- fill_bloom_key(path, len, &revs->bloom_keys[0],
+ bloom_key_fill(&revs->bloom_keys[0], path, len,
revs->bloom_filter_settings);
path_component_nr = 1;
p = path + len - 1;
while (p > path) {
if (*p == '/')
- fill_bloom_key(path, p - path,
- &revs->bloom_keys[path_component_nr++],
+ bloom_key_fill(&revs->bloom_keys[path_component_nr++],
+ path, p - path,
revs->bloom_filter_settings);
p--;
}
@@ -3231,7 +3231,7 @@ void release_revisions(struct rev_info *revs)
oidset_clear(&revs->missing_commits);
for (int i = 0; i < revs->bloom_keys_nr; i++)
- clear_bloom_key(&revs->bloom_keys[i]);
+ bloom_key_clear(&revs->bloom_keys[i]);
FREE_AND_NULL(revs->bloom_keys);
revs->bloom_keys_nr = 0;
}
diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
index 6a24b6e0a6..3283544bd3 100644
--- a/t/helper/test-bloom.c
+++ b/t/helper/test-bloom.c
@@ -12,13 +12,13 @@ static struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
static void add_string_to_filter(const char *data, struct bloom_filter *filter) {
struct bloom_key key;
- fill_bloom_key(data, strlen(data), &key, &settings);
+ bloom_key_fill(&key, data, strlen(data), &settings);
printf("Hashes:");
for (size_t i = 0; i < settings.num_hashes; i++)
printf("0x%08x|", key.hashes[i]);
printf("\n");
add_key_to_filter(&key, filter, &settings);
- clear_bloom_key(&key);
+ bloom_key_clear(&key);
}
static void print_bloom_filter(struct bloom_filter *filter) {
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
2025-07-12 9:35 ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-12 9:35 ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
@ 2025-07-12 9:35 ` Lidong Yan
2025-07-12 9:35 ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
` (2 subsequent siblings)
5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:35 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee
Previously, we stored bloom keys in a flat array and marked a commit
as NOT TREESAME if any key reported "definitely not changed".
To support multiple pathspec items, we now require that for each
pathspec item, there exists a bloom key reporting "definitely not
changed".
This "for every" condition makes a flat array insufficient, so we
introduce a new structure to group keys by a single pathspec item.
`struct bloom_keyvec` is introduced to replace `struct bloom_key *`
and `bloom_key_nr`. And because we want to support multiple pathspec
items, we added a bloom_keyvec * and a bloom_keyvec_nr field to
`struct rev_info` to represent an array of bloom_keyvecs. This commit
still optimize only one pathspec item, thus bloom_keyvec_nr can only
be 0 or 1.
New bloom_keyvec_* functions are added to create and destroy a keyvec.
bloom_filter_contains_vec() is added to check if all key in keyvec is
contained in a bloom filter.
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
bloom.c | 61 +++++++++++++++++++++++++++++++++++++++++++
bloom.h | 38 +++++++++++++++++++++++++++
revision.c | 76 +++++++++++++++++++++---------------------------------
revision.h | 6 ++---
4 files changed, 132 insertions(+), 49 deletions(-)
diff --git a/bloom.c b/bloom.c
index 5523d198c8..b86015f6d1 100644
--- a/bloom.c
+++ b/bloom.c
@@ -278,6 +278,55 @@ void deinit_bloom_filters(void)
deep_clear_bloom_filter_slab(&bloom_filters, free_one_bloom_filter);
}
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+ const struct bloom_filter_settings *settings)
+{
+ struct bloom_keyvec *vec;
+ const char *p;
+ size_t sz;
+ size_t nr = 1;
+
+ p = path;
+ while (*p) {
+ /*
+ * At this point, the path is normalized to use Unix-style
+ * path separators. This is required due to how the
+ * changed-path Bloom filters store the paths.
+ */
+ if (*p == '/')
+ nr++;
+ p++;
+ }
+
+ sz = sizeof(struct bloom_keyvec);
+ sz += nr * sizeof(struct bloom_key);
+ vec = (struct bloom_keyvec *)xcalloc(1, sz);
+ if (!vec)
+ return NULL;
+ vec->count = nr;
+
+ bloom_key_fill(&vec->key[0], path, len, settings);
+ nr = 1;
+ p = path + len - 1;
+ while (p > path) {
+ if (*p == '/') {
+ bloom_key_fill(&vec->key[nr++], path, p - path, settings);
+ }
+ p--;
+ }
+ assert(nr == vec->count);
+ return vec;
+}
+
+void bloom_keyvec_free(struct bloom_keyvec *vec)
+{
+ if (!vec)
+ return;
+ for (size_t nr = 0; nr < vec->count; nr++)
+ bloom_key_clear(&vec->key[nr]);
+ free(vec);
+}
+
static int pathmap_cmp(const void *hashmap_cmp_fn_data UNUSED,
const struct hashmap_entry *eptr,
const struct hashmap_entry *entry_or_key,
@@ -539,6 +588,18 @@ int bloom_filter_contains(const struct bloom_filter *filter,
return 1;
}
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *vec,
+ const struct bloom_filter_settings *settings)
+{
+ int ret = 1;
+
+ for (size_t nr = 0; ret > 0 && nr < vec->count; nr++)
+ ret = bloom_filter_contains(filter, &vec->key[nr], settings);
+
+ return ret;
+}
+
uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
int version)
{
diff --git a/bloom.h b/bloom.h
index 603bc1f90f..92ab2100d3 100644
--- a/bloom.h
+++ b/bloom.h
@@ -74,6 +74,16 @@ struct bloom_key {
uint32_t *hashes;
};
+/*
+ * A bloom_keyvec is a vector of bloom_keys, which
+ * can be used to store multiple keys for a single
+ * pathspec item.
+ */
+struct bloom_keyvec {
+ size_t count;
+ struct bloom_key key[FLEX_ARRAY];
+};
+
int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
uint32_t graph_pos);
@@ -82,6 +92,23 @@ void bloom_key_fill(struct bloom_key *key, const char *data, size_t len,
const struct bloom_filter_settings *settings);
void bloom_key_clear(struct bloom_key *key);
+/*
+ * bloom_keyvec_new - Allocate and populate a bloom_keyvec with keys for the
+ * given path.
+ *
+ * This function splits the input path by '/' and generates a bloom key for each
+ * prefix, in reverse order of specificity. For example, given the input
+ * "a/b/c", it will generate bloom keys for:
+ * - "a/b/c"
+ * - "a/b"
+ * - "a"
+ *
+ * The resulting keys are stored in a newly allocated bloom_keyvec.
+ */
+struct bloom_keyvec *bloom_keyvec_new(const char *path, size_t len,
+ const struct bloom_filter_settings *settings);
+void bloom_keyvec_free(struct bloom_keyvec *vec);
+
void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
const struct bloom_filter_settings *settings);
@@ -126,6 +153,17 @@ int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings);
+/*
+ * bloom_filter_contains_vec - Check if all keys in a key vector are in the
+ * Bloom filter.
+ *
+ * Returns 1 if **all** keys in the vector are present in the filter,
+ * 0 if **any** key is not present.
+ */
+int bloom_filter_contains_vec(const struct bloom_filter *filter,
+ const struct bloom_keyvec *v,
+ const struct bloom_filter_settings *settings);
+
uint32_t test_bloom_murmur3_seeded(uint32_t seed, const char *data, size_t len,
int version);
diff --git a/revision.c b/revision.c
index a7eadff0a5..e4e0c83b0c 100644
--- a/revision.c
+++ b/revision.c
@@ -685,13 +685,14 @@ static int forbid_bloom_filters(struct pathspec *spec)
return 0;
}
+static void release_revisions_bloom_keyvecs(struct rev_info *revs);
+
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
char *path_alloc = NULL;
- const char *path, *p;
+ const char *path;
size_t len;
- int path_component_nr = 1;
if (!revs->commits)
return;
@@ -708,6 +709,8 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
+ revs->bloom_keyvecs_nr = 1;
+ CALLOC_ARRAY(revs->bloom_keyvecs, 1);
pi = &revs->pruning.pathspec.items[0];
/* remove single trailing slash from path, if needed */
@@ -718,53 +721,30 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
path = pi->match;
len = strlen(path);
- if (!len) {
- revs->bloom_filter_settings = NULL;
- free(path_alloc);
- return;
- }
-
- p = path;
- while (*p) {
- /*
- * At this point, the path is normalized to use Unix-style
- * path separators. This is required due to how the
- * changed-path Bloom filters store the paths.
- */
- if (*p == '/')
- path_component_nr++;
- p++;
- }
-
- revs->bloom_keys_nr = path_component_nr;
- ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+ if (!len)
+ goto fail;
- bloom_key_fill(&revs->bloom_keys[0], path, len,
- revs->bloom_filter_settings);
- path_component_nr = 1;
-
- p = path + len - 1;
- while (p > path) {
- if (*p == '/')
- bloom_key_fill(&revs->bloom_keys[path_component_nr++],
- path, p - path,
- revs->bloom_filter_settings);
- p--;
- }
+ revs->bloom_keyvecs[0] =
+ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
bloom_filter_atexit_registered = 1;
}
+ return;
+
+fail:
+ revs->bloom_filter_settings = NULL;
free(path_alloc);
+ release_revisions_bloom_keyvecs(revs);
}
static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
struct commit *commit)
{
struct bloom_filter *filter;
- int result = 1, j;
+ int result = 0;
if (!revs->repo->objects->commit_graph)
return -1;
@@ -779,10 +759,10 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
return -1;
}
- for (j = 0; result && j < revs->bloom_keys_nr; j++) {
- result = bloom_filter_contains(filter,
- &revs->bloom_keys[j],
- revs->bloom_filter_settings);
+ for (size_t nr = 0; !result && nr < revs->bloom_keyvecs_nr; nr++) {
+ result = bloom_filter_contains_vec(filter,
+ revs->bloom_keyvecs[nr],
+ revs->bloom_filter_settings);
}
if (result)
@@ -823,7 +803,7 @@ static int rev_compare_tree(struct rev_info *revs,
return REV_TREE_SAME;
}
- if (revs->bloom_keys_nr && !nth_parent) {
+ if (revs->bloom_keyvecs_nr && !nth_parent) {
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (bloom_ret == 0)
@@ -850,7 +830,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
if (!t1)
return 0;
- if (!nth_parent && revs->bloom_keys_nr) {
+ if (!nth_parent && revs->bloom_keyvecs_nr) {
bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
if (!bloom_ret)
return 1;
@@ -3201,6 +3181,14 @@ static void release_revisions_mailmap(struct string_list *mailmap)
static void release_revisions_topo_walk_info(struct topo_walk_info *info);
+static void release_revisions_bloom_keyvecs(struct rev_info *revs)
+{
+ for (size_t nr = 0; nr < revs->bloom_keyvecs_nr; nr++)
+ bloom_keyvec_free(revs->bloom_keyvecs[nr]);
+ FREE_AND_NULL(revs->bloom_keyvecs);
+ revs->bloom_keyvecs_nr = 0;
+}
+
static void free_void_commit_list(void *list)
{
free_commit_list(list);
@@ -3229,11 +3217,7 @@ void release_revisions(struct rev_info *revs)
clear_decoration(&revs->treesame, free);
line_log_free(revs);
oidset_clear(&revs->missing_commits);
-
- for (int i = 0; i < revs->bloom_keys_nr; i++)
- bloom_key_clear(&revs->bloom_keys[i]);
- FREE_AND_NULL(revs->bloom_keys);
- revs->bloom_keys_nr = 0;
+ release_revisions_bloom_keyvecs(revs);
}
static void add_child(struct rev_info *revs, struct commit *parent, struct commit *child)
diff --git a/revision.h b/revision.h
index 6d369cdad6..ac843f58d0 100644
--- a/revision.h
+++ b/revision.h
@@ -62,7 +62,7 @@ struct repository;
struct rev_info;
struct string_list;
struct saved_parents;
-struct bloom_key;
+struct bloom_keyvec;
struct bloom_filter_settings;
struct option;
struct parse_opt_ctx_t;
@@ -360,8 +360,8 @@ struct rev_info {
/* Commit graph bloom filter fields */
/* The bloom filter key(s) for the pathspec */
- struct bloom_key *bloom_keys;
- int bloom_keys_nr;
+ struct bloom_keyvec **bloom_keyvecs;
+ int bloom_keyvecs_nr;
/*
* The bloom filter settings used to generate the key.
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
` (2 preceding siblings ...)
2025-07-12 9:35 ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
@ 2025-07-12 9:35 ` Lidong Yan
2025-07-12 9:35 ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
2025-07-14 16:53 ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
5 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:35 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee
When preparing to use bloom filters in a revision walk, Git populates a
boom_keyvec with an array of bloom keys for the components of a path.
Before we create the ability to map multiple pathspecs to multiple
bloom_keyvecs, extract the conversion from a pathspec to a bloom_keyvec
into its own helper method. This simplifies the state that persists in
prepare_to_use_bloom_filter() as well as makes the future change much
simpler.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
revision.c | 45 +++++++++++++++++++++++++++++----------------
1 file changed, 29 insertions(+), 16 deletions(-)
diff --git a/revision.c b/revision.c
index e4e0c83b0c..1614c6ce0d 100644
--- a/revision.c
+++ b/revision.c
@@ -687,13 +687,37 @@ static int forbid_bloom_filters(struct pathspec *spec)
static void release_revisions_bloom_keyvecs(struct rev_info *revs);
-static void prepare_to_use_bloom_filter(struct rev_info *revs)
+static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out,
+ const struct pathspec_item *pi,
+ const struct bloom_filter_settings *settings)
{
- struct pathspec_item *pi;
char *path_alloc = NULL;
const char *path;
size_t len;
+ int res = 0;
+
+ /* remove single trailing slash from path, if needed */
+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
+ path_alloc = xmemdupz(pi->match, pi->len - 1);
+ path = path_alloc;
+ } else
+ path = pi->match;
+
+ len = strlen(path);
+ if (!len) {
+ res = -1;
+ goto cleanup;
+ }
+ *out = bloom_keyvec_new(path, len, settings);
+
+cleanup:
+ free(path_alloc);
+ return res;
+}
+
+static void prepare_to_use_bloom_filter(struct rev_info *revs)
+{
if (!revs->commits)
return;
@@ -711,22 +735,12 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
revs->bloom_keyvecs_nr = 1;
CALLOC_ARRAY(revs->bloom_keyvecs, 1);
- pi = &revs->pruning.pathspec.items[0];
- /* remove single trailing slash from path, if needed */
- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
- path_alloc = xmemdupz(pi->match, pi->len - 1);
- path = path_alloc;
- } else
- path = pi->match;
-
- len = strlen(path);
- if (!len)
+ if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
+ &revs->pruning.pathspec.items[0],
+ revs->bloom_filter_settings))
goto fail;
- revs->bloom_keyvecs[0] =
- bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
bloom_filter_atexit_registered = 1;
@@ -736,7 +750,6 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
fail:
revs->bloom_filter_settings = NULL;
- free(path_alloc);
release_revisions_bloom_keyvecs(revs);
}
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible.
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
` (3 preceding siblings ...)
2025-07-12 9:35 ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
@ 2025-07-12 9:35 ` Lidong Yan
2025-07-12 9:47 ` Lidong Yan
2025-07-14 16:53 ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
5 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:35 UTC (permalink / raw)
To: yldhome2d2; +Cc: 502024330056, git, gitster, toon, stolee
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
items using bloom filter and the case without bloom filter
optimization.
- does not use bloom filter if any pathspec item is not literal.
With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
But in a larger repo, such as the LLVM project repo below, we get even
better results:
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
revision.c | 21 +++++++++++----------
t/t4216-log-bloom.sh | 23 ++++++++++++++---------
2 files changed, 25 insertions(+), 19 deletions(-)
diff --git a/revision.c b/revision.c
index 1614c6ce0d..cf7198c0ea 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
{
if (spec->has_wildcard)
return 1;
- if (spec->nr > 1)
- return 1;
if (spec->magic & ~PATHSPEC_LITERAL)
return 1;
- if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
- return 1;
+ for (size_t nr = 0; nr < spec->nr; nr++)
+ if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+ return 1;
return 0;
}
@@ -733,13 +732,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
- if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
- &revs->pruning.pathspec.items[0],
- revs->bloom_filter_settings))
- goto fail;
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
+ &revs->pruning.pathspec.items[i],
+ revs->bloom_filter_settings))
+ goto fail;
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
setup () {
rm -f "$TRASH_DIRECTORY/trace.perf" &&
- git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
- GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+ eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+ eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+ git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
}
test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
test_bloom_filters_not_used "--walk-reflogs -- A"
'
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
- test_bloom_filters_not_used "-- file4 A/file1"
-'
-
test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
test_bloom_filters_not_used "-- ."
'
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
test_bloom_filters_used "-- *renamed"
'
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
- test_bloom_filters_not_used "-- *" &&
- test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+ test_bloom_filters_used "-- file4 A/file1" &&
+ test_bloom_filters_used "-- *" &&
+ test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+ test_bloom_filters_not_used "-- file\*" &&
+ test_bloom_filters_not_used "-- A/\* file4" &&
+ test_bloom_filters_not_used "-- file4 A/\*" &&
+ test_bloom_filters_not_used "-- * A/\*"
'
test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible.
2025-07-12 9:35 ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
@ 2025-07-12 9:47 ` Lidong Yan
2025-07-12 9:51 ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:47 UTC (permalink / raw)
To: Lidong Yan; +Cc: git, gitster, stolee
Lidong Yan <yldhome2d2@gmail.com> writes:
>
> Add new test cases in t/t4216-log-bloom.sh to ensure
> - consistent results between the optimization for multiple pathspec
> items using bloom filter and the case without bloom filter
> optimization.
> - does not use bloom filter if any pathspec item is not literal.
>
> With these optimizations, we get some improvements for multi-pathspec runs
> of 'git log'. First, in the Git repository we see these modest results:
Sorry, seems like I wrote bad commit message, I will resend patch 5/5 soon.
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
2025-07-12 9:47 ` Lidong Yan
@ 2025-07-12 9:51 ` Lidong Yan
2025-07-14 16:51 ` Derrick Stolee
0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-12 9:51 UTC (permalink / raw)
To: yldhome2d2; +Cc: git, gitster, stolee, Lidong Yan
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
items using bloom filter and the case without bloom filter
optimization.
- does not use bloom filter if any pathspec item is not literal.
With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
But in a larger repo, such as the LLVM project repo below, we get even
better results:
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
revision.c | 21 +++++++++++----------
t/t4216-log-bloom.sh | 23 ++++++++++++++---------
2 files changed, 25 insertions(+), 19 deletions(-)
diff --git a/revision.c b/revision.c
index 1614c6ce0d..cf7198c0ea 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
{
if (spec->has_wildcard)
return 1;
- if (spec->nr > 1)
- return 1;
if (spec->magic & ~PATHSPEC_LITERAL)
return 1;
- if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
- return 1;
+ for (size_t nr = 0; nr < spec->nr; nr++)
+ if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+ return 1;
return 0;
}
@@ -733,13 +732,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
- if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
- &revs->pruning.pathspec.items[0],
- revs->bloom_filter_settings))
- goto fail;
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
+ &revs->pruning.pathspec.items[i],
+ revs->bloom_filter_settings))
+ goto fail;
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
setup () {
rm -f "$TRASH_DIRECTORY/trace.perf" &&
- git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
- GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+ eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+ eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+ git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
}
test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
test_bloom_filters_not_used "--walk-reflogs -- A"
'
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
- test_bloom_filters_not_used "-- file4 A/file1"
-'
-
test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
test_bloom_filters_not_used "-- ."
'
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
test_bloom_filters_used "-- *renamed"
'
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
- test_bloom_filters_not_used "-- *" &&
- test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+ test_bloom_filters_used "-- file4 A/file1" &&
+ test_bloom_filters_used "-- *" &&
+ test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+ test_bloom_filters_not_used "-- file\*" &&
+ test_bloom_filters_not_used "-- A/\* file4" &&
+ test_bloom_filters_not_used "-- file4 A/\*" &&
+ test_bloom_filters_not_used "-- * A/\*"
'
test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
2025-07-12 9:51 ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
@ 2025-07-14 16:51 ` Derrick Stolee
2025-07-14 17:01 ` Junio C Hamano
0 siblings, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-14 16:51 UTC (permalink / raw)
To: Lidong Yan; +Cc: git, gitster, Lidong Yan
On 7/12/2025 5:51 AM, Lidong Yan wrote:
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add for loops to initialize and check each pathspec item's bloom_keyvec
> when optimization is possible.
The patch itself is good.
> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
Here, I'll just point out that your sign-off should follow mine
because you were the last to touch the patch. In this way, the
sign-off gives a kind of timestamp to who made the most-recent
changes (and that those changes have that person's sign-off,
and may not have been vetted by previous signers).
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
2025-07-14 16:51 ` Derrick Stolee
@ 2025-07-14 17:01 ` Junio C Hamano
2025-07-15 1:37 ` Lidong Yan
0 siblings, 1 reply; 72+ messages in thread
From: Junio C Hamano @ 2025-07-14 17:01 UTC (permalink / raw)
To: Derrick Stolee; +Cc: Lidong Yan, git, Lidong Yan
Derrick Stolee <stolee@gmail.com> writes:
> On 7/12/2025 5:51 AM, Lidong Yan wrote:
>> To enable optimize multiple pathspec items in revision traversal,
>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>> Add for loops to initialize and check each pathspec item's bloom_keyvec
>> when optimization is possible.
>
> The patch itself is good.
>
>> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>
> Here, I'll just point out that your sign-off should follow mine
> because you were the last to touch the patch. In this way, the
> sign-off gives a kind of timestamp to who made the most-recent
> changes (and that those changes have that person's sign-off,
> and may not have been vetted by previous signers).
Thanks for pointing it out. Also perhaps a single-liner attribution
to clarify who did what, e.g.
Signed-off-by: Derrick
[ly: did this and that to derrick's code to adjust]
Signed-off-by: Lidong
would be more helpful.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
2025-07-14 17:01 ` Junio C Hamano
@ 2025-07-15 1:37 ` Lidong Yan
2025-07-15 2:56 ` [RESEND][PATCH " Lidong Yan
0 siblings, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-15 1:37 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Derrick Stolee, git
Junio C Hamano <gitster@pobox.com> writes:
>
> Derrick Stolee <stolee@gmail.com> writes:
>
>> On 7/12/2025 5:51 AM, Lidong Yan wrote:
>>> To enable optimize multiple pathspec items in revision traversal,
>>> return 0 if all pathspec item is literal in forbid_bloom_filters().
>>> Add for loops to initialize and check each pathspec item's bloom_keyvec
>>> when optimization is possible.
>>
>> The patch itself is good.
>>
>>> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
>>> Signed-off-by: Derrick Stolee <stolee@gmail.com>
>>
>> Here, I'll just point out that your sign-off should follow mine
>> because you were the last to touch the patch. In this way, the
>> sign-off gives a kind of timestamp to who made the most-recent
>> changes (and that those changes have that person's sign-off,
>> and may not have been vetted by previous signers).
>
> Thanks for pointing it out. Also perhaps a single-liner attribution
> to clarify who did what, e.g.
>
> Signed-off-by: Derrick
> [ly: did this and that to derrick's code to adjust]
> Signed-off-by: Lidong
I will fix the order of the sign-offs, add the attributes, and resend
PATCH v6 5/5.
Thanks,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread
* [RESEND][PATCH v6 5/5] bloom: optimize multiple pathspec items in revision
2025-07-15 1:37 ` Lidong Yan
@ 2025-07-15 2:56 ` Lidong Yan
0 siblings, 0 replies; 72+ messages in thread
From: Lidong Yan @ 2025-07-15 2:56 UTC (permalink / raw)
To: 502024330056; +Cc: git, gitster, stolee, Lidong Yan
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- consistent results between the optimization for multiple pathspec
items using bloom filter and the case without bloom filter
optimization.
- does not use bloom filter if any pathspec item is not literal.
With these optimizations, we get some improvements for multi-pathspec runs
of 'git log'. First, in the Git repository we see these modest results:
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
But in a larger repo, such as the LLVM project repo below, we get even
better results:
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
Signed-off-by: Derrick Stolee <stolee@gmail.com>
[ly: rename convert_pathspec_to_filter() to convert_pathspec_to_bloom_keyvec()]
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
---
revision.c | 21 +++++++++++----------
t/t4216-log-bloom.sh | 23 ++++++++++++++---------
2 files changed, 25 insertions(+), 19 deletions(-)
diff --git a/revision.c b/revision.c
index 1614c6ce0d..cf7198c0ea 100644
--- a/revision.c
+++ b/revision.c
@@ -675,12 +675,11 @@ static int forbid_bloom_filters(struct pathspec *spec)
{
if (spec->has_wildcard)
return 1;
- if (spec->nr > 1)
- return 1;
if (spec->magic & ~PATHSPEC_LITERAL)
return 1;
- if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
- return 1;
+ for (size_t nr = 0; nr < spec->nr; nr++)
+ if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
+ return 1;
return 0;
}
@@ -733,13 +732,15 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
if (!revs->pruning.pathspec.nr)
return;
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
- if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
- &revs->pruning.pathspec.items[0],
- revs->bloom_filter_settings))
- goto fail;
+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
+ if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
+ &revs->pruning.pathspec.items[i],
+ revs->bloom_filter_settings))
+ goto fail;
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
atexit(trace2_bloom_filter_statistics_atexit);
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 8910d53cac..639868ac56 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -66,8 +66,9 @@ sane_unset GIT_TRACE2_CONFIG_PARAMS
setup () {
rm -f "$TRASH_DIRECTORY/trace.perf" &&
- git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom &&
- GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom
+ eval git -c core.commitGraph=false log --pretty="format:%s" "$1" >log_wo_bloom &&
+ eval "GIT_TRACE2_PERF=\"$TRASH_DIRECTORY/trace.perf\"" \
+ git -c core.commitGraph=true log --pretty="format:%s" "$1" >log_w_bloom
}
test_bloom_filters_used () {
@@ -138,10 +139,6 @@ test_expect_success 'git log with --walk-reflogs does not use Bloom filters' '
test_bloom_filters_not_used "--walk-reflogs -- A"
'
-test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
- test_bloom_filters_not_used "-- file4 A/file1"
-'
-
test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
test_bloom_filters_not_used "-- ."
'
@@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
test_bloom_filters_used "-- *renamed"
'
-test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
- test_bloom_filters_not_used "-- *" &&
- test_bloom_filters_not_used "-- file*"
+test_expect_success 'git log with multiple literal paths uses Bloom filter' '
+ test_bloom_filters_used "-- file4 A/file1" &&
+ test_bloom_filters_used "-- *" &&
+ test_bloom_filters_used "-- file*"
+'
+
+test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
+ test_bloom_filters_not_used "-- file\*" &&
+ test_bloom_filters_not_used "-- A/\* file4" &&
+ test_bloom_filters_not_used "-- file4 A/\*" &&
+ test_bloom_filters_not_used "-- * A/\*"
'
test_expect_success 'setup - add commit-graph to the chain without Bloom filters' '
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
` (4 preceding siblings ...)
2025-07-12 9:35 ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
@ 2025-07-14 16:53 ` Derrick Stolee
2025-07-14 17:02 ` Junio C Hamano
2025-07-15 1:34 ` Lidong Yan
5 siblings, 2 replies; 72+ messages in thread
From: Derrick Stolee @ 2025-07-14 16:53 UTC (permalink / raw)
To: Lidong Yan; +Cc: 502024330056, git, gitster, toon
On 7/12/2025 5:35 AM, Lidong Yan wrote:
> The difference from v5 is:
> - extract convert pathspec item to bloom_keyvec logic to
> a separate function, which simplifies the prepare_to_use_bloom_filter()
> function.
> - fix few bugs in v5.
Thanks for making these changes. Including your fixed patch 5, this
version looks ready to me.
I wouldn't say "fix a few bugs" but instead "fix some compile-time
linting complaints when using DEVELOPER=1" to be clear that the
functionality hasn't changed but the code is cleaner.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-14 16:53 ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
@ 2025-07-14 17:02 ` Junio C Hamano
2025-07-15 1:34 ` Lidong Yan
1 sibling, 0 replies; 72+ messages in thread
From: Junio C Hamano @ 2025-07-14 17:02 UTC (permalink / raw)
To: Derrick Stolee; +Cc: Lidong Yan, 502024330056, git, toon
Derrick Stolee <stolee@gmail.com> writes:
> On 7/12/2025 5:35 AM, Lidong Yan wrote:
>
>> The difference from v5 is:
>> - extract convert pathspec item to bloom_keyvec logic to
>> a separate function, which simplifies the prepare_to_use_bloom_filter()
>> function.
>> - fix few bugs in v5.
>
> Thanks for making these changes. Including your fixed patch 5, this
> version looks ready to me.
>
> I wouldn't say "fix a few bugs" but instead "fix some compile-time
> linting complaints when using DEVELOPER=1" to be clear that the
> functionality hasn't changed but the code is cleaner.
Thanks both, for polishing and reviewing.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-14 16:53 ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-14 17:02 ` Junio C Hamano
@ 2025-07-15 1:34 ` Lidong Yan
2025-07-15 2:48 ` Derrick Stolee
1 sibling, 1 reply; 72+ messages in thread
From: Lidong Yan @ 2025-07-15 1:34 UTC (permalink / raw)
To: Derrick Stolee; +Cc: git, gitster, toon
Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/12/2025 5:35 AM, Lidong Yan wrote:
>
>> The difference from v5 is:
>> - extract convert pathspec item to bloom_keyvec logic to
>> a separate function, which simplifies the prepare_to_use_bloom_filter()
>> function.
>> - fix few bugs in v5.
>
> Thanks for making these changes. Including your fixed patch 5, this
> version looks ready to me.
>
> I wouldn't say "fix a few bugs" but instead "fix some compile-time
> linting complaints when using DEVELOPER=1" to be clear that the
> functionality hasn't changed but the code is cleaner.
I just learned that `make DEVELOPER=1` treats warnings as errors.
Since this is just a cover letter issue, I feel it might not be worth rerolling
the patch again.
Thanks,
Lidong
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
2025-07-15 1:34 ` Lidong Yan
@ 2025-07-15 2:48 ` Derrick Stolee
2025-07-15 15:09 ` Junio C Hamano
0 siblings, 1 reply; 72+ messages in thread
From: Derrick Stolee @ 2025-07-15 2:48 UTC (permalink / raw)
To: Lidong Yan; +Cc: git, gitster, toon
On 7/14/2025 9:34 PM, Lidong Yan wrote:
> Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 7/12/2025 5:35 AM, Lidong Yan wrote:
>>
>>> The difference from v5 is:
>>> - extract convert pathspec item to bloom_keyvec logic to
>>> a separate function, which simplifies the prepare_to_use_bloom_filter()
>>> function.
>>> - fix few bugs in v5.
>>
>> Thanks for making these changes. Including your fixed patch 5, this
>> version looks ready to me.
>>
>> I wouldn't say "fix a few bugs" but instead "fix some compile-time
>> linting complaints when using DEVELOPER=1" to be clear that the
>> functionality hasn't changed but the code is cleaner.
>
> I just learned that `make DEVELOPER=1` treats warnings as errors.
> Since this is just a cover letter issue, I feel it might not be worth rerolling
> the patch again.
No need to reroll anything, I think. Junio's got the right
fixups in place.
This was just a comment to help you next time.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 72+ messages in thread