From: Lidong Yan <yldhome2d2@gmail.com>
To: yldhome2d2@gmail.com
Cc: 502024330056@smail.nju.edu.cn, git@vger.kernel.org,
gitster@pobox.com, toon@iotcl.com, stolee@gmail.com
Subject: [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal
Date: Sat, 12 Jul 2025 17:35:12 +0800 [thread overview]
Message-ID: <20250712093517.17907-1-yldhome2d2@gmail.com> (raw)
In-Reply-To: <20250710084829.2171855-1-502024330056@smail.nju.edu.cn>
The revision traversal limited by pathspec has optimization when
the pathspec has only one element, it does not use any pathspec
magic (other than literal), and there is no wildcard. The absence
of optimization for multiple pathspec elements in revision traversal
cause an issue raised by Kai Koponen at
https://lore.kernel.org/git/CADYQcGqaMC=4jgbmnF9Q11oC11jfrqyvH8EuiRRHytpMXd4wYA@mail.gmail.com/
While it is much harder to lift the latter two limitations,
supporting a pathspec with multiple elements is relatively easy.
Just make sure we hash each of them separately and ask the bloom
filter about them, and if we see none of them can possibly be
affected by the commit, we can skip without tree comparison.
The difference from v5 is:
- extract convert pathspec item to bloom_keyvec logic to
a separate function, which simplifies the prepare_to_use_bloom_filter()
function.
- fix few bugs in v5.
Below is a comparison of the time taken to run git log on Git and
LLVM repositories before and after applying this patch. These statistics
are given by Derrick Stolee at
https://lore.kernel.org/git/afb68948-218b-4b56-9faa-29578ef9c73c@gmail.com/
Setup commit-graph:
$ cd ~/src/git && git commit-graph write --split --reachable --changed-paths
$ cd ~/src/llvm && git commit-graph write --split --reachable --changed-paths
Running hyperfine [1] on Git repository:
$ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- commit.c commit-graph.c' \
> -n 'new' '~/_git/git/git log -100 -- commit.c commit-graph.c'
Benchmark 1: old
Time (mean ± σ): 73.1 ms ± 2.9 ms [User: 48.8 ms, System: 23.9 ms]
Range (min … max): 69.9 ms … 84.5 ms 42 runs
Benchmark 2: new
Time (mean ± σ): 55.1 ms ± 2.9 ms [User: 30.5 ms, System: 24.4 ms]
Range (min … max): 51.1 ms … 61.2 ms 52 runs
Summary
'new' ran
1.33 ± 0.09 times faster than 'old'
And for LLVM:
$ hyperfine --warmup=3 \
> -n 'old' '~/_git/git-sparse-checkout-clean/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h' \
> -n 'new' '~/_git/git/git log -100 -- llvm/lib/Support/CommandLine.cpp llvm/lib/Support/CommandLine.h'
Benchmark 1: old
Time (mean ± σ): 1.974 s ± 0.006 s [User: 1.877 s, System: 0.097 s]
Range (min … max): 1.960 s … 1.983 s 10 runs
Benchmark 2: new
Time (mean ± σ): 262.9 ms ± 2.4 ms [User: 214.2 ms, System: 48.4 ms]
Range (min … max): 257.7 ms … 266.2 ms 11 runs
Summary
'new' ran
7.51 ± 0.07 times faster than 'old'
[1] https://github.com/sharkdp/hyperfine
Lidong Yan (5):
bloom: add test helper to return murmur3 hash
bloom: rename function operates on bloom_key
bloom: replace struct bloom_key * with struct bloom_keyvec
revision: make helper for pathspec to bloom keyvec
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's
bloom_keyvec when optimization is possible.
blame.c | 2 +-
bloom.c | 84 ++++++++++++++++++++++++++---
bloom.h | 54 ++++++++++++++-----
line-log.c | 5 +-
revision.c | 122 +++++++++++++++++++++---------------------
revision.h | 6 +--
t/helper/test-bloom.c | 8 +--
t/t4216-log-bloom.sh | 23 ++++----
8 files changed, 204 insertions(+), 100 deletions(-)
Range-diff against v5:
1: 4d8f60e5ff = 1: f5ab19063d bloom: add test helper to return murmur3 hash
2: acee03e397 = 2: 51a180daa6 bloom: rename function operates on bloom_key
3: d7690bd02c ! 3: e17249ab4b bloom: replace struct bloom_key * with struct bloom_keyvec
@@ bloom.h: void bloom_key_fill(struct bloom_key *key, const char *data, size_t len
void bloom_key_clear(struct bloom_key *key);
+/*
-+ * bloom_keyvec_fill - Allocate and populate a bloom_keyvec with keys for the
++ * bloom_keyvec_new - Allocate and populate a bloom_keyvec with keys for the
+ * given path.
+ *
+ * This function splits the input path by '/' and generates a bloom key for each
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
-+ struct bloom_keyvec *bloom_keyvec;
char *path_alloc = NULL;
- const char *path, *p;
+- const char *path, *p;
++ const char *path;
size_t len;
- int path_component_nr = 1;
-: ---------- > 4: b3c1f5bcd1 revision: make helper for pathspec to bloom keyvec
4: e577aa1bfd ! 5: 785bd43674 bloom: optimize multiple pathspec items in revision traversal
@@ Metadata
Author: Lidong Yan <yldhome2d2@gmail.com>
## Commit message ##
- bloom: optimize multiple pathspec items in revision traversal
-
To enable optimize multiple pathspec items in revision traversal,
return 0 if all pathspec item is literal in forbid_bloom_filters().
Add for loops to initialize and check each pathspec item's bloom_keyvec
when optimization is possible.
Add new test cases in t/t4216-log-bloom.sh to ensure
- - consistent results between the optimization for multiple pathspec
- items using bloom filter and the case without bloom filter
- optimization.
- - does not use bloom filter if any pathspec item is not literal.
+ - consistent results between the optimization for multiple pathspec
+ items using bloom filter and the case without bloom filter
+ optimization.
+ - does not use bloom filter if any pathspec item is not literal.
+
+ With these optimizations, we get some improvements for multi-pathspec runs
+ of 'git log'. First, in the Git repository we see these modest results:
+
+ Benchmark 1: old
+ Time (mean ± σ): 73.1 ms ± 2.9 ms
+ Range (min … max): 69.9 ms … 84.5 ms 42 runs
+
+ Benchmark 2: new
+ Time (mean ± σ): 55.1 ms ± 2.9 ms
+ Range (min … max): 51.1 ms … 61.2 ms 52 runs
+
+ Summary
+ 'new' ran
+ 1.33 ± 0.09 times faster than 'old'
+
+ But in a larger repo, such as the LLVM project repo below, we get even
+ better results:
+
+ Benchmark 1: old
+ Time (mean ± σ): 1.974 s ± 0.006 s
+ Range (min … max): 1.960 s … 1.983 s 10 runs
+
+ Benchmark 2: new
+ Time (mean ± σ): 262.9 ms ± 2.4 ms
+ Range (min … max): 257.7 ms … 266.2 ms 11 runs
+
+ Summary
+ 'new' ran
+ 7.51 ± 0.07 times faster than 'old'
Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
+ Signed-off-by: Derrick Stolee <stolee@gmail.com>
## revision.c ##
@@ revision.c: static int forbid_bloom_filters(struct pathspec *spec)
@@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
- revs->bloom_keyvecs_nr = 1;
- CALLOC_ARRAY(revs->bloom_keyvecs, 1);
-- pi = &revs->pruning.pathspec.items[0];
+ revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
+ CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
-+ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
-+ pi = &revs->pruning.pathspec.items[i];
-- /* remove single trailing slash from path, if needed */
-- if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-- path_alloc = xmemdupz(pi->match, pi->len - 1);
-- path = path_alloc;
-- } else
-- path = pi->match;
-+ /* remove single trailing slash from path, if needed */
-+ if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
-+ path_alloc = xmemdupz(pi->match, pi->len - 1);
-+ path = path_alloc;
-+ } else
-+ path = pi->match;
-
-- len = strlen(path);
-- if (!len)
+- if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[0],
+- &revs->pruning.pathspec.items[0],
+- revs->bloom_filter_settings))
- goto fail;
-+ len = strlen(path);
-+ if (!len)
++ for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
++ if (convert_pathspec_to_bloom_keyvec(&revs->bloom_keyvecs[i],
++ &revs->pruning.pathspec.items[i],
++ revs->bloom_filter_settings))
+ goto fail;
-
-- revs->bloom_keyvecs[0] =
-- bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-+ revs->bloom_keyvecs[i] =
-+ bloom_keyvec_new(path, len, revs->bloom_filter_settings);
-+ FREE_AND_NULL(path_alloc);
+ }
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
--
2.39.5 (Apple Git-154)
next prev parent reply other threads:[~2025-07-12 9:35 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-06-25 17:43 ` Junio C Hamano
2025-06-26 3:44 ` Lidong Yan
2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
2025-06-27 13:50 ` Junio C Hamano
2025-06-27 14:24 ` Lidong Yan
2025-06-27 18:09 ` Junio C Hamano
2025-07-01 5:52 ` Lidong Yan
2025-07-01 15:19 ` Junio C Hamano
2025-07-02 7:14 ` Lidong Yan
2025-07-02 15:48 ` Junio C Hamano
2025-07-03 1:52 ` Lidong Yan
2025-07-04 12:09 ` Lidong Yan
2025-07-01 8:50 ` SZEDER Gábor
2025-07-01 11:40 ` Lidong Yan
2025-07-01 15:43 ` Junio C Hamano
2025-06-27 20:39 ` Junio C Hamano
2025-06-28 2:54 ` Lidong Yan
2025-06-25 17:32 ` [PATCH 0/2] bloom: use bloom filter given multiple pathspec Junio C Hamano
2025-06-26 3:34 ` Lidong Yan
2025-06-26 14:15 ` Junio C Hamano
2025-06-27 6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
2025-06-28 4:21 ` [PATCH v3 " Lidong Yan
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
2025-07-04 11:14 ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-04 11:14 ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
2025-07-04 11:14 ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-07 11:35 ` Derrick Stolee
2025-07-07 14:14 ` Lidong Yan
2025-07-04 11:14 ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-07 11:43 ` Derrick Stolee
2025-07-07 14:18 ` Lidong Yan
2025-07-07 15:14 ` Junio C Hamano
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
2025-07-10 8:48 ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-10 8:48 ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
2025-07-10 8:48 ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-10 16:17 ` Junio C Hamano
2025-07-11 12:46 ` Lidong Yan
2025-07-11 15:06 ` Junio C Hamano
2025-07-10 8:48 ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-10 13:51 ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
2025-07-10 15:42 ` Lidong Yan
2025-07-10 13:55 ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
2025-07-10 15:49 ` Lidong Yan
2025-07-10 13:49 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-12 9:35 ` Lidong Yan [this message]
2025-07-12 9:35 ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-12 9:35 ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
2025-07-12 9:35 ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-12 9:35 ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
2025-07-12 9:35 ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
2025-07-12 9:47 ` Lidong Yan
2025-07-12 9:51 ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
2025-07-14 16:51 ` Derrick Stolee
2025-07-14 17:01 ` Junio C Hamano
2025-07-15 1:37 ` Lidong Yan
2025-07-15 2:56 ` [RESEND][PATCH " Lidong Yan
2025-07-14 16:53 ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-14 17:02 ` Junio C Hamano
2025-07-15 1:34 ` Lidong Yan
2025-07-15 2:48 ` Derrick Stolee
2025-07-15 15:09 ` Junio C Hamano
2025-06-28 4:21 ` [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-02 15:08 ` Patrick Steinhardt
2025-07-02 15:49 ` Lidong Yan
2025-07-02 18:28 ` Junio C Hamano
2025-07-03 1:41 ` Lidong Yan
2025-06-28 4:21 ` [PATCH v3 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-06-27 6:21 ` [PATCH v2 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-06-27 6:21 ` [PATCH v2 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250712093517.17907-1-yldhome2d2@gmail.com \
--to=yldhome2d2@gmail.com \
--cc=502024330056@smail.nju.edu.cn \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=stolee@gmail.com \
--cc=toon@iotcl.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).