From: Derrick Stolee <stolee@gmail.com>
To: Lidong Yan <yldhome2d2@gmail.com>
Cc: 502024330056@smail.nju.edu.cn, git@vger.kernel.org, gitster@pobox.com
Subject: Re: [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal
Date: Mon, 7 Jul 2025 07:43:19 -0400 [thread overview]
Message-ID: <ea144a72-0975-4ac9-b2e4-ae0f7fcb6837@gmail.com> (raw)
In-Reply-To: <20250704111437.2660251-5-502024330056@smail.nju.edu.cn>
On 7/4/2025 7:14 AM, Lidong Yan wrote:
> To enable optimize multiple pathspec items in revision traversal,
> return 0 if all pathspec item is literal in forbid_bloom_filters().
> Add code to initialize and check each pathspec item's bloom_keyvec.
>
> Add new function release_revisions_bloom_keyvecs() to free all bloom
> keyvec owned by rev_info.
>
> Add new test cases in t/t4216-log-bloom.sh to ensure
> - consistent results between the optimization for multiple pathspec
> items using bloom filter and the case without bloom filter
> optimization.
> - does not use bloom filter if any pathspec item is not literal.
This would be a great time to add some performance statistics when
using this feature with multiple pathspecs on some standard repos (git
and the Linux kernel repo are two good examples).
We don't have a great performance script for this, since each test
repo will have different paths to use for comparisons, but you can
use 'hyperfine' to assemble your own comparisons before and after
this change and report them here (and in your cover letter).
> Signed-off-by: Lidong Yan <502024330056@smail.nju.edu.cn>
> ---
> revision.c | 118 ++++++++++++++++++++++++-------------------
> t/t4216-log-bloom.sh | 23 +++++----
> 2 files changed, 79 insertions(+), 62 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index 7cbb49617d..9a77c0d0bc 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -675,16 +675,17 @@ static int forbid_bloom_filters(struct pathspec *spec)
> {
> if (spec->has_wildcard)
> return 1;
> - if (spec->nr > 1)
> - return 1;
> if (spec->magic & ~PATHSPEC_LITERAL)
> return 1;
> - if (spec->nr && (spec->items[0].magic & ~PATHSPEC_LITERAL))
> - return 1;
> + for (size_t nr = 0; nr < spec->nr; nr++)
> + if (spec->items[nr].magic & ~PATHSPEC_LITERAL)
> + return 1;
This is a good check: if any is non-literal, then we can't use bloom
filters.
> +static void release_revisions_bloom_keyvecs(struct rev_info *revs);
> +
> static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> @@ -692,7 +693,7 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> char *path_alloc = NULL;
> const char *path, *p;
> size_t len;
> - int path_component_nr = 1;
> + int path_component_nr;
We can move this into the interior of the loop, right?
> if (!revs->commits)
> return;
> @@ -709,50 +710,52 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> if (!revs->pruning.pathspec.nr)
> return;
>
> - pi = &revs->pruning.pathspec.items[0];
> -
> - /* remove single trailing slash from path, if needed */
> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> - path_alloc = xmemdupz(pi->match, pi->len - 1);
> - path = path_alloc;
> - } else
> - path = pi->match;
> -
> - len = strlen(path);
> - if (!len) {
> - revs->bloom_filter_settings = NULL;
> - free(path_alloc);
> - return;
> - }
> -
> - p = path;
> - while (*p) {
> - /*
> - * At this point, the path is normalized to use Unix-style
> - * path separators. This is required due to how the
> - * changed-path Bloom filters store the paths.
> - */
> - if (*p == '/')
> - path_component_nr++;
> - p++;
> - }
> -
> - revs->bloom_keyvecs_nr = 1;
> - CALLOC_ARRAY(revs->bloom_keyvecs, 1);
> - bloom_keyvec = bloom_keyvec_new(path_component_nr);
> - revs->bloom_keyvecs[0] = bloom_keyvec;
> -
> - bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
> - revs->bloom_filter_settings);
> - path_component_nr = 1;
The size of this diff is unfortunate. I wonder if there could first
be an extraction of this logic to operate on a single pathspec and
bloom_keyvec in a way that would be an obvious code move, then this
patch could call that method in a loop now that we have an array of
bloom_keyvecs.
I think it would make a cleaner patch and a cleaner final result.
> + revs->bloom_keyvecs_nr = revs->pruning.pathspec.nr;
> + CALLOC_ARRAY(revs->bloom_keyvecs, revs->bloom_keyvecs_nr);
> + for (int i = 0; i < revs->pruning.pathspec.nr; i++) {
> + pi = &revs->pruning.pathspec.items[i];
> + path_component_nr = 1;
> +
> + /* remove single trailing slash from path, if needed */
> + if (pi->len > 0 && pi->match[pi->len - 1] == '/') {
> + path_alloc = xmemdupz(pi->match, pi->len - 1);
> + path = path_alloc;
> + } else
> + path = pi->match;
> +
> + len = strlen(path);
> + if (!len)
> + goto fail;
> +
> + p = path;
> + while (*p) {
> + /*
> + * At this point, the path is normalized to use
> + * Unix-style path separators. This is required due to
> + * how the changed-path Bloom filters store the paths.
> + */
> + if (*p == '/')
> + path_component_nr++;
> + p++;
> + }
>
> - p = path + len - 1;
> - while (p > path) {
> - if (*p == '/')
> - bloom_keyvec_fill_key(path, p - path, bloom_keyvec,
> - path_component_nr++,
> - revs->bloom_filter_settings);
> - p--;
> + bloom_keyvec = bloom_keyvec_new(path_component_nr);
> + revs->bloom_keyvecs[i] = bloom_keyvec;
> +
> + bloom_keyvec_fill_key(path, len, bloom_keyvec, 0,
> + revs->bloom_filter_settings);
> + path_component_nr = 1;
> +
> + p = path + len - 1;
> + while (p > path) {
> + if (*p == '/')
> + bloom_keyvec_fill_key(path, p - path,
> + bloom_keyvec,
> + path_component_nr++,
> + revs->bloom_filter_settings);
> + p--;
> + }
> + FREE_AND_NULL(path_alloc);
...
> -test_expect_success 'git log -- multiple path specs does not use Bloom filters' '
> - test_bloom_filters_not_used "-- file4 A/file1"
> -'
> -
Love to see this.
> test_expect_success 'git log -- "." pathspec at root does not use Bloom filters' '
> test_bloom_filters_not_used "-- ."
> '
> @@ -151,9 +148,17 @@ test_expect_success 'git log with wildcard that resolves to a single path uses B
> test_bloom_filters_used "-- *renamed"
> '
>
> -test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses Bloom filters' '
> - test_bloom_filters_not_used "-- *" &&
> - test_bloom_filters_not_used "-- file*"
> +test_expect_success 'git log with multiple literal paths uses Bloom filter' '
> + test_bloom_filters_used "-- file4 A/file1" &&
> + test_bloom_filters_used "-- *" &&
> + test_bloom_filters_used "-- file*"
> +'
> +
> +test_expect_success 'git log with path contains a wildcard does not use Bloom filter' '
> + test_bloom_filters_not_used "-- file\*" &&
> + test_bloom_filters_not_used "-- A/\* file4" &&
> + test_bloom_filters_not_used "-- file4 A/\*" &&
> + test_bloom_filters_not_used "-- * A/\*"
> '
And these new test cases are great.
Thanks for this work. I'm happy to see the feature be added and my suggestions
are purely cosmetic as the proof is in your tests.
Thanks,
-Stolee
next prev parent reply other threads:[~2025-07-07 11:43 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-25 12:55 [PATCH 0/2] bloom: use bloom filter given multiple pathspec Lidong Yan
2025-06-25 12:55 ` [PATCH 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-06-25 17:43 ` Junio C Hamano
2025-06-26 3:44 ` Lidong Yan
2025-06-25 12:55 ` [PATCH 2/2] bloom: enable multiple pathspec bloom keys Lidong Yan
2025-06-27 13:50 ` Junio C Hamano
2025-06-27 14:24 ` Lidong Yan
2025-06-27 18:09 ` Junio C Hamano
2025-07-01 5:52 ` Lidong Yan
2025-07-01 15:19 ` Junio C Hamano
2025-07-02 7:14 ` Lidong Yan
2025-07-02 15:48 ` Junio C Hamano
2025-07-03 1:52 ` Lidong Yan
2025-07-04 12:09 ` Lidong Yan
2025-07-01 8:50 ` SZEDER Gábor
2025-07-01 11:40 ` Lidong Yan
2025-07-01 15:43 ` Junio C Hamano
2025-06-27 20:39 ` Junio C Hamano
2025-06-28 2:54 ` Lidong Yan
2025-06-25 17:32 ` [PATCH 0/2] bloom: use bloom filter given multiple pathspec Junio C Hamano
2025-06-26 3:34 ` Lidong Yan
2025-06-26 14:15 ` Junio C Hamano
2025-06-27 6:21 ` [PATCH v2 0/2] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Lidong Yan
2025-06-28 4:21 ` [PATCH v3 " Lidong Yan
2025-07-04 11:14 ` [PATCH v4 0/4] " Lidong Yan
2025-07-04 11:14 ` [PATCH v4 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-04 11:14 ` [PATCH v4 2/4] bloom: rename function operates on bloom_key Lidong Yan
2025-07-04 11:14 ` [PATCH v4 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-07 11:35 ` Derrick Stolee
2025-07-07 14:14 ` Lidong Yan
2025-07-04 11:14 ` [PATCH v4 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-07 11:43 ` Derrick Stolee [this message]
2025-07-07 14:18 ` Lidong Yan
2025-07-07 15:14 ` Junio C Hamano
2025-07-10 8:48 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements " Lidong Yan
2025-07-10 8:48 ` [PATCH v5 1/4] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-10 8:48 ` [PATCH v5 2/4] bloom: rename function operates on bloom_key Lidong Yan
2025-07-10 8:48 ` [PATCH v5 3/4] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-10 16:17 ` Junio C Hamano
2025-07-11 12:46 ` Lidong Yan
2025-07-11 15:06 ` Junio C Hamano
2025-07-10 8:48 ` [PATCH v5 4/4] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-07-10 13:51 ` [PATCH v5.1 3.5/4] revision: make helper for pathspec to bloom key Derrick Stolee
2025-07-10 15:42 ` Lidong Yan
2025-07-10 13:55 ` [PATCH v5.1 4/4] bloom: optimize multiple pathspec items in revision Derrick Stolee
2025-07-10 15:49 ` Lidong Yan
2025-07-10 13:49 ` [PATCH v5 0/4] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-12 9:35 ` [PATCH v6 0/5] " Lidong Yan
2025-07-12 9:35 ` [PATCH v6 1/5] bloom: add test helper to return murmur3 hash Lidong Yan
2025-07-12 9:35 ` [PATCH v6 2/5] bloom: rename function operates on bloom_key Lidong Yan
2025-07-12 9:35 ` [PATCH v6 3/5] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-12 9:35 ` [PATCH v6 4/5] revision: make helper for pathspec to bloom keyvec Lidong Yan
2025-07-12 9:35 ` [PATCH v6 5/5] To enable optimize multiple pathspec items in revision traversal, return 0 if all pathspec item is literal in forbid_bloom_filters(). Add for loops to initialize and check each pathspec item's bloom_keyvec when optimization is possible Lidong Yan
2025-07-12 9:47 ` Lidong Yan
2025-07-12 9:51 ` [PATCH v6 5/5] bloom: optimize multiple pathspec items in revision Lidong Yan
2025-07-14 16:51 ` Derrick Stolee
2025-07-14 17:01 ` Junio C Hamano
2025-07-15 1:37 ` Lidong Yan
2025-07-15 2:56 ` [RESEND][PATCH " Lidong Yan
2025-07-14 16:53 ` [PATCH v6 0/5] bloom: enable bloom filter optimization for multiple pathspec elements in revision traversal Derrick Stolee
2025-07-14 17:02 ` Junio C Hamano
2025-07-15 1:34 ` Lidong Yan
2025-07-15 2:48 ` Derrick Stolee
2025-07-15 15:09 ` Junio C Hamano
2025-06-28 4:21 ` [PATCH v3 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-07-02 15:08 ` Patrick Steinhardt
2025-07-02 15:49 ` Lidong Yan
2025-07-02 18:28 ` Junio C Hamano
2025-07-03 1:41 ` Lidong Yan
2025-06-28 4:21 ` [PATCH v3 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
2025-06-27 6:21 ` [PATCH v2 1/2] bloom: replace struct bloom_key * with struct bloom_keyvec Lidong Yan
2025-06-27 6:21 ` [PATCH v2 2/2] bloom: optimize multiple pathspec items in revision traversal Lidong Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ea144a72-0975-4ac9-b2e4-ae0f7fcb6837@gmail.com \
--to=stolee@gmail.com \
--cc=502024330056@smail.nju.edu.cn \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=yldhome2d2@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).