* [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal @ 2025-08-07 5:12 Lidong Yan 2025-08-07 6:49 ` Patrick Steinhardt 2025-08-08 6:58 ` [PATCH v2] " Lidong Yan 0 siblings, 2 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-07 5:12 UTC (permalink / raw) To: git; +Cc: stolee, gitster, ttaylorr, Lidong Yan When traversing commits, a pathspec item can be used to limit the traversal to commits that modify the specified paths. And the commit-graph includes a Bloom filter to exclude commits that definitely did not modify a given pathspec item. During commit traversal, the Bloom filter can significantly improve performance. However, it is disabled if the specified pathspec item contains wildcard characters or magic signatures. Enable Bloom filter even if a pathspec item contains wildcard characters by filter only the non-wildcard part of the pathspec item. Also Enable Bloom filter if magic signature is not "exclude" or "icase". With this optimization, we get some improvements for pathspec with wildcard and magic signature. First, in the Git repository we see these modest results: git log -100 -- "t/*" Benchmark 1: new Time (mean ± σ): 20.7 ms ± 0.5 ms Range (min … max): 19.8 ms … 21.8 ms Benchmark 2: old Time (mean ± σ): 25.4 ms ± 0.6 ms Range (min … max): 24.1 ms … 26.8 ms git log -100 -- ":(top)t" Benchmark 1: new Time (mean ± σ): 15.3 ms ± 0.3 ms Range (min … max): 14.5 ms … 16.1 ms Benchmark 2: old Time (mean ± σ): 19.5 ms ± 0.5 ms Range (min … max): 18.7 ms … 20.7 ms But in a larger repo, such as the LLVM project repo below, we get even better results: git log -100 -- "libc/*" Benchmark 1: new Time (mean ± σ): 10.1 ms ± 0.7 ms Range (min … max): 8.7 ms … 11.5 ms Benchmark 2: old Time (mean ± σ): 26.1 ms ± 0.7 ms Range (min … max): 24.6 ms … 27.4 ms git log -100 -- ":(top)libc" Benchmark 1: new Time (mean ± σ): 11.0 ms ± 0.8 ms Range (min … max): 9.6 ms … 13.9 ms Benchmark 2: old Time (mean ± σ): 20.7 ms ± 0.8 ms Range (min … max): 18.8 ms … 21.8 ms Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> --- revision.c | 26 ++++++++++++++++++++------ t/t4216-log-bloom.sh | 31 +++++++++++++++++++++++++++---- 2 files changed, 47 insertions(+), 10 deletions(-) diff --git a/revision.c b/revision.c index 18f300d455..ef8c0b6eca 100644 --- a/revision.c +++ b/revision.c @@ -671,12 +671,13 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { - if (spec->has_wildcard) - return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + int forbid_mask = + PATHSPEC_EXCLUDE | PATHSPEC_ICASE; + + if (spec->magic & forbid_mask) return 1; for (size_t nr = 0; nr < spec->nr; nr++) - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) + if (spec->items[nr].magic & forbid_mask) return 1; return 0; @@ -693,9 +694,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, size_t len; int res = 0; + len = pi->nowildcard_len; /* remove single trailing slash from path, if needed */ - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { - path_alloc = xmemdupz(pi->match, pi->len - 1); + if (len > 0 && pi->match[len - 1] == '/') + len--; + else if (len != pi->len) { + /* + * for path like "/dir/file*", nowildcard part would be + * "/dir/file", but only "/dir" should be used for the + * bloom filter + */ + while (len > 0 && pi->match[len - 1] != '/') + len--; + } + + if (len != pi->len) { + path_alloc = xmemdupz(pi->match, len); path = path_alloc; } else path = pi->match; diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 639868ac56..d8200e4dcb 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -154,11 +154,34 @@ test_expect_success 'git log with multiple literal paths uses Bloom filter' ' test_bloom_filters_used "-- file*" ' -test_expect_success 'git log with path contains a wildcard does not use Bloom filter' ' +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' + test_bloom_filters_used "-- A/\* file4" && + test_bloom_filters_used "-- file4 A/\*" && + test_bloom_filters_used "-- * A/\*" +' + +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' test_bloom_filters_not_used "-- file\*" && - test_bloom_filters_not_used "-- A/\* file4" && - test_bloom_filters_not_used "-- file4 A/\*" && - test_bloom_filters_not_used "-- * A/\*" + test_bloom_filters_not_used "-- file\* A/\*" && + test_bloom_filters_not_used "-- file\* *" && + test_bloom_filters_not_used "-- \*" +' + +test_expect_success 'git log with path contains various magic signatures' ' + cd A && + test_bloom_filters_used "-- \:\(top\)B" && + cd .. && + + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && + + cat >.gitattributes <<-EOF && + A/file1 text + A/B/file2 -text + EOF + test_bloom_filters_used "-- \:\(attr\:text\)A" && + rm .gitattributes ' test_expect_success 'setup - add commit-graph to the chain without Bloom filters' ' -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-07 5:12 [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal Lidong Yan @ 2025-08-07 6:49 ` Patrick Steinhardt 2025-08-07 8:59 ` Lidong Yan 2025-08-07 16:15 ` Junio C Hamano 2025-08-08 6:58 ` [PATCH v2] " Lidong Yan 1 sibling, 2 replies; 16+ messages in thread From: Patrick Steinhardt @ 2025-08-07 6:49 UTC (permalink / raw) To: Lidong Yan; +Cc: git, stolee, gitster, ttaylorr On Thu, Aug 07, 2025 at 01:12:43PM +0800, Lidong Yan wrote: In the subject it should be s/enable bloom/enable Bloom. > When traversing commits, a pathspec item can be used to limit the > traversal to commits that modify the specified paths. And the > commit-graph includes a Bloom filter to exclude commits that definitely > did not modify a given pathspec item. During commit traversal, the > Bloom filter can significantly improve performance. However, it is > disabled if the specified pathspec item contains wildcard characters > or magic signatures. Let's add a paragraph here, as we now switch into the "what is being done mode". > Enable Bloom filter even if a pathspec item contains wildcard > characters by filter only the non-wildcard part of the pathspec item. s/by filter/by filtering/ > Also Enable Bloom filter if magic signature is not "exclude" or > "icase". This explains what is done, but not why this is safe to do. > With this optimization, we get some improvements for pathspec with > wildcard and magic signature. First, in the Git repository we see these "for pathspecs with wildcards or magic signatures". > diff --git a/revision.c b/revision.c > index 18f300d455..ef8c0b6eca 100644 > --- a/revision.c > +++ b/revision.c > @@ -671,12 +671,13 @@ static void trace2_bloom_filter_statistics_atexit(void) > > static int forbid_bloom_filters(struct pathspec *spec) > { > - if (spec->has_wildcard) > - return 1; > - if (spec->magic & ~PATHSPEC_LITERAL) > + int forbid_mask = The mask should be `unsigned`. > + PATHSPEC_EXCLUDE | PATHSPEC_ICASE; I think instead of a forbid-mask we should use an allow-mask. Otherwise it can happen quite easily that we add new magic that isn't compatible with Bloom filters but forget to update this part here. I'd rather be slow but correct than fast but incorrect. > + if (spec->magic & forbid_mask) > return 1; > for (size_t nr = 0; nr < spec->nr; nr++) > - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) > + if (spec->items[nr].magic & forbid_mask) > return 1; > > return 0; > @@ -693,9 +694,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, > size_t len; > int res = 0; > > + len = pi->nowildcard_len; > /* remove single trailing slash from path, if needed */ > - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { > - path_alloc = xmemdupz(pi->match, pi->len - 1); > + if (len > 0 && pi->match[len - 1] == '/') > + len--; > + else if (len != pi->len) { > + /* > + * for path like "/dir/file*", nowildcard part would be > + * "/dir/file", but only "/dir" should be used for the > + * bloom filter > + */ > + while (len > 0 && pi->match[len - 1] != '/') > + len--; > + } > + > + if (len != pi->len) { > + path_alloc = xmemdupz(pi->match, len); > path = path_alloc; > } else > path = pi->match; Okay, this matches what I've expected: if we have a wildcard we cannot match on the component that contains the wildcard itself. But what we _can_ do is to match on all the components leading to that wildcard component. One thing I did wonder though: what happens if the first component contains the wildcard? We cannot really make any use of the Bloom filter in that case as the path we match against becomes empty. I expect that we'll handle this just fine. But is it still more performant than not even trying Bloom filters in the first place? > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > index 639868ac56..d8200e4dcb 100755 > --- a/t/t4216-log-bloom.sh > +++ b/t/t4216-log-bloom.sh > @@ -154,11 +154,34 @@ test_expect_success 'git log with multiple literal paths uses Bloom filter' ' > test_bloom_filters_used "-- file*" > ' > > -test_expect_success 'git log with path contains a wildcard does not use Bloom filter' ' > +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' > + test_bloom_filters_used "-- A/\* file4" && > + test_bloom_filters_used "-- file4 A/\*" && > + test_bloom_filters_used "-- * A/\*" > +' > + > +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' > test_bloom_filters_not_used "-- file\*" && > - test_bloom_filters_not_used "-- A/\* file4" && > - test_bloom_filters_not_used "-- file4 A/\*" && > - test_bloom_filters_not_used "-- * A/\*" > + test_bloom_filters_not_used "-- file\* A/\*" && > + test_bloom_filters_not_used "-- file\* *" && > + test_bloom_filters_not_used "-- \*" > +' > + > +test_expect_success 'git log with path contains various magic signatures' ' > + cd A && > + test_bloom_filters_used "-- \:\(top\)B" && > + cd .. && > + > + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && > + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && > + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && > + > + cat >.gitattributes <<-EOF && > + A/file1 text > + A/B/file2 -text We typically indent the heredoc text to the same level as the command. > + EOF > + test_bloom_filters_used "-- \:\(attr\:text\)A" && > + rm .gitattributes You can use `test_when_finished" instead to clean up after yourself even in case the test fails. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-07 6:49 ` Patrick Steinhardt @ 2025-08-07 8:59 ` Lidong Yan 2025-08-07 16:15 ` Junio C Hamano 1 sibling, 0 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-07 8:59 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: git, stolee, gitster, ttaylorr Patrick Steinhardt <ps@pks.im> wrote: > > On Thu, Aug 07, 2025 at 01:12:43PM +0800, Lidong Yan wrote: > > In the subject it should be s/enable bloom/enable Bloom. > >> When traversing commits, a pathspec item can be used to limit the >> traversal to commits that modify the specified paths. And the >> commit-graph includes a Bloom filter to exclude commits that definitely >> did not modify a given pathspec item. During commit traversal, the >> Bloom filter can significantly improve performance. However, it is >> disabled if the specified pathspec item contains wildcard characters >> or magic signatures. > > Let's add a paragraph here, as we now switch into the "what is being > done mode". > >> Enable Bloom filter even if a pathspec item contains wildcard >> characters by filter only the non-wildcard part of the pathspec item. > > s/by filter/by filtering/ > >> With this optimization, we get some improvements for pathspec with >> wildcard and magic signature. First, in the Git repository we see these > > "for pathspecs with wildcards or magic signatures". Will fix all the grammatical errors in next version. >> Also Enable Bloom filter if magic signature is not "exclude" or >> "icase". > > This explains what is done, but not why this is safe to do. I forgot to mention this — I’ll make sure to include it in the next commit message. > >> diff --git a/revision.c b/revision.c >> index 18f300d455..ef8c0b6eca 100644 >> --- a/revision.c >> +++ b/revision.c >> @@ -671,12 +671,13 @@ static void trace2_bloom_filter_statistics_atexit(void) >> >> static int forbid_bloom_filters(struct pathspec *spec) >> { >> - if (spec->has_wildcard) >> - return 1; >> - if (spec->magic & ~PATHSPEC_LITERAL) >> + int forbid_mask = > > The mask should be `unsigned`. > >> + PATHSPEC_EXCLUDE | PATHSPEC_ICASE; > > I think instead of a forbid-mask we should use an allow-mask. Otherwise > it can happen quite easily that we add new magic that isn't compatible > with Bloom filters but forget to update this part here. I'd rather be > slow but correct than fast but incorrect. Interesting and reasonable point — will fix. > > One thing I did wonder though: what happens if the first component > contains the wildcard? We cannot really make any use of the Bloom filter > in that case as the path we match against becomes empty. I expect that > we'll handle this just fine. But is it still more performant than not > even trying Bloom filters in the first place? If the first component contains the wildcard, convert_pathspec_to_bloom_keyvec() failed and returns -1, which lead to prepare_to_use_bloom_filter() cleanup (free and NULLing) all bloom filter related field. So the performant would be as same as bloom filter is not even tried. I actually find a bug in my code and I will add test and fix it, the code should be if (len != pi->len) while (len > 0 && pi->match[len - 1]) len—; if (len > 0 && pi->match[len - 1] == ‘/‘) len—; > We typically indent the heredoc text to the same level as the command. Got it, will fix. > >> + EOF >> + test_bloom_filters_used "-- \:\(attr\:text\)A" && >> + rm .gitattributes > > You can use `test_when_finished" instead to clean up after yourself even > in case the test fails. I see — I never knew how to deal with the issue where one failing test case causes all subsequent tests to fail. Thanks, Lidong ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-07 6:49 ` Patrick Steinhardt 2025-08-07 8:59 ` Lidong Yan @ 2025-08-07 16:15 ` Junio C Hamano 2025-08-08 6:40 ` Lidong Yan 1 sibling, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2025-08-07 16:15 UTC (permalink / raw) To: Patrick Steinhardt; +Cc: Lidong Yan, git, stolee, ttaylorr Patrick Steinhardt <ps@pks.im> writes: > On Thu, Aug 07, 2025 at 01:12:43PM +0800, Lidong Yan wrote: > > In the subject it should be s/enable bloom/enable Bloom. > >> When traversing commits, a pathspec item can be used to limit the >> traversal to commits that modify the specified paths. And the >> commit-graph includes a Bloom filter to exclude commits that definitely >> did not modify a given pathspec item. During commit traversal, the >> Bloom filter can significantly improve performance. However, it is >> disabled if the specified pathspec item contains wildcard characters >> or magic signatures. > > Let's add a paragraph here, as we now switch into the "what is being > done mode". I agree, a paragraph break is very welcome here. I was confused when I read this first that you are suggesting to add a new paragraph with unspecified contents. >> Also Enable Bloom filter if magic signature is not "exclude" or >> "icase". > > This explains what is done, but not why this is safe to do. And for that, the enumeration needs to be done inclusively, e.g. "we know :(literal) is OK because we can do X; we know :(glob) is OK because we can do Y". The numbers are impressive ;-) >> diff --git a/revision.c b/revision.c >> index 18f300d455..ef8c0b6eca 100644 >> --- a/revision.c >> +++ b/revision.c >> @@ -671,12 +671,13 @@ static void trace2_bloom_filter_statistics_atexit(void) >> >> static int forbid_bloom_filters(struct pathspec *spec) >> { >> - if (spec->has_wildcard) >> - return 1; >> - if (spec->magic & ~PATHSPEC_LITERAL) >> + int forbid_mask = > > The mask should be `unsigned`. > >> + PATHSPEC_EXCLUDE | PATHSPEC_ICASE; > > I think instead of a forbid-mask we should use an allow-mask. Otherwise > it can happen quite easily that we add new magic that isn't compatible > with Bloom filters but forget to update this part here. I'd rather be > slow but correct than fast but incorrect. Good suggestion. From the use of &-, it is obvious it is a mask, but "allow-mask" is not clear among what kind of things some are allowed, so how about calling it: unsigned allowed_magic; Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-07 16:15 ` Junio C Hamano @ 2025-08-08 6:40 ` Lidong Yan 0 siblings, 0 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-08 6:40 UTC (permalink / raw) To: Junio C Hamano; +Cc: Patrick Steinhardt, git, stolee, ttaylorr Junio C Hamano <gitster@pobox.com> writes: > > The numbers are impressive ;-) The bad news is that the Git versions I tested are 2.39 and 2.51, and I used git commit-graph write --split to build the commit graph, which resulted in a large value for filter_not_present in trace.perf. The good news is that after rebuilding the commit-graph, I repeated the experiment on HEAD and HEAD~1, and the results on HEAD were still better. Anyway, I will update the experimental results in the next patch. Thanks, Lidong ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-07 5:12 [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal Lidong Yan 2025-08-07 6:49 ` Patrick Steinhardt @ 2025-08-08 6:58 ` Lidong Yan 2025-08-08 15:50 ` Junio C Hamano 1 sibling, 1 reply; 16+ messages in thread From: Lidong Yan @ 2025-08-08 6:58 UTC (permalink / raw) To: yldhome2d2; +Cc: git, gitster, stolee, ttaylorr When traversing commits, a pathspec item can be used to limit the traversal to commits that modify the specified paths. And the commit-graph includes a Bloom filter to exclude commits that definitely did not modify a given pathspec item. During commit traversal, the Bloom filter can significantly improve performance. However, it is disabled if the specified pathspec item contains wildcard characters or magic signatures. For performance reason, enable Bloom filter even if a pathspec item contains wildcard characters by filtering only the non-wildcard part of the pathspec item. The function of pathspec magic signature is generally to narrow down the path specified by the pathspecs. So, enable Bloom filter when the magic signature is "top", "glob", "attr", "--depth" or "literal". "exclude" is used to select paths other than the specified path, rather than serving as a filtering function, so it cannot be used together with the Bloom filter. Since Bloom filter is not case insensitive even in case insensitive system (e.g. MacOS), it cannot be used together with "icase" magic. With this optimization, we get some improvements for pathspecs with wildcards or magic signatures. First, in the Git repository we see these modest results: git log -100 -- "t/*" Benchmark 1: new Time (mean ± σ): 20.4 ms ± 0.6 ms Range (min … max): 19.3 ms … 24.4 ms Benchmark 2: old Time (mean ± σ): 23.4 ms ± 0.5 ms Range (min … max): 22.5 ms … 24.7 ms git log -100 -- ":(top)t" Benchmark 1: new Time (mean ± σ): 16.2 ms ± 0.4 ms Range (min … max): 15.3 ms … 17.2 ms Benchmark 2: old Time (mean ± σ): 18.6 ms ± 0.5 ms Range (min … max): 17.6 ms … 20.4 ms But in a larger repo, such as the LLVM project repo below, we get even better results: git log -100 -- "libc/*" Benchmark 1: new Time (mean ± σ): 16.0 ms ± 0.6 ms Range (min … max): 14.7 ms … 17.8 ms Benchmark 2: old Time (mean ± σ): 26.7 ms ± 0.5 ms Range (min … max): 25.4 ms … 27.8 ms git log -100 -- ":(top)libc" Benchmark 1: new Time (mean ± σ): 15.6 ms ± 0.6 ms Range (min … max): 14.4 ms … 17.7 ms Benchmark 2: old Time (mean ± σ): 19.6 ms ± 0.5 ms Range (min … max): 18.6 ms … 20.6 ms Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> --- revision.c | 30 ++++++++++++++++++++++++------ t/t4216-log-bloom.sh | 31 +++++++++++++++++++++++++++---- 2 files changed, 51 insertions(+), 10 deletions(-) diff --git a/revision.c b/revision.c index 18f300d455..2a5b98390e 100644 --- a/revision.c +++ b/revision.c @@ -671,12 +671,17 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { - if (spec->has_wildcard) - return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + unsigned int allowed_magic = + PATHSPEC_FROMTOP | + PATHSPEC_MAXDEPTH | + PATHSPEC_LITERAL | + PATHSPEC_GLOB | + PATHSPEC_ATTR; + + if (spec->magic & ~allowed_magic) return 1; for (size_t nr = 0; nr < spec->nr; nr++) - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) + if (spec->items[nr].magic & ~allowed_magic) return 1; return 0; @@ -693,9 +698,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, size_t len; int res = 0; + len = pi->nowildcard_len; + if (len != pi->len) { + /* + * for path like "/dir/file*", nowildcard part would be + * "/dir/file", but only "/dir" should be used for the + * bloom filter + */ + while (len > 0 && pi->match[len - 1] != '/') + len--; + } /* remove single trailing slash from path, if needed */ - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { - path_alloc = xmemdupz(pi->match, pi->len - 1); + if (len > 0 && pi->match[len - 1] == '/') + len--; + + if (len != pi->len) { + path_alloc = xmemdupz(pi->match, len); path = path_alloc; } else path = pi->match; diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 639868ac56..1064990de3 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -154,11 +154,34 @@ test_expect_success 'git log with multiple literal paths uses Bloom filter' ' test_bloom_filters_used "-- file*" ' -test_expect_success 'git log with path contains a wildcard does not use Bloom filter' ' +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' + test_bloom_filters_used "-- A/\* file4" && + test_bloom_filters_used "-- A/file\*" && + test_bloom_filters_used "-- * A/\*" +' + +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' test_bloom_filters_not_used "-- file\*" && - test_bloom_filters_not_used "-- A/\* file4" && - test_bloom_filters_not_used "-- file4 A/\*" && - test_bloom_filters_not_used "-- * A/\*" + test_bloom_filters_not_used "-- file\* A/\*" && + test_bloom_filters_not_used "-- file\* *" && + test_bloom_filters_not_used "-- \*" +' + +test_expect_success 'git log with path contains various magic signatures' ' + cd A && + test_bloom_filters_used "-- \:\(top\)B" && + cd .. && + + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && + + test_when_finished "rm -f .gitattributes" && + cat >.gitattributes <<-EOF && + A/file1 text + A/B/file2 -text + EOF + test_bloom_filters_used "-- \:\(attr\:text\)A" ' test_expect_success 'setup - add commit-graph to the chain without Bloom filters' ' -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-08 6:58 ` [PATCH v2] " Lidong Yan @ 2025-08-08 15:50 ` Junio C Hamano 2025-08-09 2:06 ` Lidong Yan 2025-08-09 2:16 ` [PATCH v3] " Lidong Yan 0 siblings, 2 replies; 16+ messages in thread From: Junio C Hamano @ 2025-08-08 15:50 UTC (permalink / raw) To: Lidong Yan; +Cc: git, stolee, ttaylorr Lidong Yan <yldhome2d2@gmail.com> writes: > static int forbid_bloom_filters(struct pathspec *spec) > { > - if (spec->has_wildcard) > - return 1; > - if (spec->magic & ~PATHSPEC_LITERAL) > + unsigned int allowed_magic = > + PATHSPEC_FROMTOP | > + PATHSPEC_MAXDEPTH | > + PATHSPEC_LITERAL | > + PATHSPEC_GLOB | > + PATHSPEC_ATTR; > + > + if (spec->magic & ~allowed_magic) > return 1; > for (size_t nr = 0; nr < spec->nr; nr++) > - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) > + if (spec->items[nr].magic & ~allowed_magic) > return 1; Quite straight-forward and easy to see that this is a simple enhancement of the existing code. Good. > @@ -693,9 +698,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, > size_t len; > int res = 0; > > + len = pi->nowildcard_len; > + if (len != pi->len) { > + /* > + * for path like "/dir/file*", nowildcard part would be > + * "/dir/file", but only "/dir" should be used for the Leading "/" makes it look as if the pathspec element can begin with a slash, but it can not, can it? > + * bloom filter > + */ > + while (len > 0 && pi->match[len - 1] != '/') > + len--; In a tree that has both "builtin/" directory and "builtin.h" file, what pathspec_element do we get here when we run $ git log "builtin*" We need to be able to catch commits that touch anything in "builtin/" and also anything at the top-level that begins with "builtin", like "builtin.h" and other things that might have existed ever in the history. I think len goes down to 0 and that is the correct behaviour. The code does deal with the case where len is reduced down to zero, but a bit poorly. It would allocate a zero-length string, then realize it frees it, and returns -1. As it must know how long the resulting path is before it allocated, and by definition len is pi->len if it did not have to allocate, it should be able to take the "goto cleanup" code path before it even attempts to allocate, and it should not have to do strlen(). But the current code is not incorrect. > + } > /* remove single trailing slash from path, if needed */ > - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { > - path_alloc = xmemdupz(pi->match, pi->len - 1); > + if (len > 0 && pi->match[len - 1] == '/') > + len--; > + > + if (len != pi->len) { > + path_alloc = xmemdupz(pi->match, len); > path = path_alloc; > } else > path = pi->match; > +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' > + test_bloom_filters_used "-- A/\* file4" && We'd ask the Bloom filter: skip commits that you know that can never be touching either "A/(anything)" or "file4". > + test_bloom_filters_used "-- A/file\*" && We'd ask the Bloom filter: skip commits that you know that can never be touching either "A/". > + test_bloom_filters_used "-- * A/\*" What do we ask? The second one says that a commit that might touch "A/(something)" is worth investigating, but what about the first one? Ahhh, that one is not quoted, so the shell expands to existing files (which presumably do not have any wildcard characters). OK. If the lone * were quoted, we shouldn't be using the Bloom filter. > +' > + > +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' > test_bloom_filters_not_used "-- file\*" && OK, this is exactly the case I wondered about in the above wrt "builtin*" and I agree with the expectation of this test. > - test_bloom_filters_not_used "-- A/\* file4" && > - test_bloom_filters_not_used "-- file4 A/\*" && > - test_bloom_filters_not_used "-- * A/\*" > + test_bloom_filters_not_used "-- file\* A/\*" && This one I understand. The first one reduces len down to 0 in the wildcard stripping loop, and makes us say "a commit that might touch any path is worth investigating", which amounts to the same thing as not using the Bloom filter at all. > + test_bloom_filters_not_used "-- file\* *" && Ditto. Even if the unquoted * may expand to many concrete paths, "file\*" that can match any path that begins with "file" that ever have existed in the history would not help skipping any commit. > + test_bloom_filters_not_used "-- \*" Ditto. > +' > + > +test_expect_success 'git log with path contains various magic signatures' ' > + cd A && > + test_bloom_filters_used "-- \:\(top\)B" && > + cd .. && > + > + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && > + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && > + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && > + > + test_when_finished "rm -f .gitattributes" && > + cat >.gitattributes <<-EOF && > + A/file1 text > + A/B/file2 -text > + EOF > + test_bloom_filters_used "-- \:\(attr\:text\)A" > ' OK. Good to see negative test cases here, which we sometimes forget to test when showing off our shiny new toy. Taking what I suggested above, here is a possible improvement. revision.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git i/revision.c w/revision.c index 2a5b98390e..2a92bdda84 100644 --- i/revision.c +++ w/revision.c @@ -696,14 +696,14 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, char *path_alloc = NULL; const char *path; size_t len; - int res = 0; + int res = -1; /* be pessimistic */ len = pi->nowildcard_len; if (len != pi->len) { /* - * for path like "/dir/file*", nowildcard part would be - * "/dir/file", but only "/dir" should be used for the - * bloom filter + * for path like "dir/file*", nowildcard part would be + * "dir/file", but only "dir" should be used for the + * bloom filter. */ while (len > 0 && pi->match[len - 1] != '/') len--; @@ -712,19 +712,17 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, if (len > 0 && pi->match[len - 1] == '/') len--; + if (!len) + goto cleanup; + if (len != pi->len) { path_alloc = xmemdupz(pi->match, len); path = path_alloc; } else path = pi->match; - len = strlen(path); - if (!len) { - res = -1; - goto cleanup; - } - *out = bloom_keyvec_new(path, len, settings); + res = 0; cleanup: free(path_alloc); ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-08 15:50 ` Junio C Hamano @ 2025-08-09 2:06 ` Lidong Yan 2025-08-09 2:16 ` [PATCH v3] " Lidong Yan 1 sibling, 0 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-09 2:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, stolee, ttaylorr Junio C Hamano <gitster@pobox.com> writes: > >> @@ -693,9 +698,22 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, >> size_t len; >> int res = 0; >> >> + len = pi->nowildcard_len; >> + if (len != pi->len) { >> + /* >> + * for path like "/dir/file*", nowildcard part would be >> + * "/dir/file", but only "/dir" should be used for the > > Leading "/" makes it look as if the pathspec element can begin with > a slash, but it can not, can it? Yes, seems like if we pass a absolute path "/path/to/repository/dir/file”, git will automatically move "/path/to/repository” (in setup.c abspath_part_inside_repo()) So I should remove leading slash in my comment. > Taking what I suggested above, here is a possible improvement. > > revision.c | 18 ++++++++---------- > 1 file changed, 8 insertions(+), 10 deletions(-) > > diff --git i/revision.c w/revision.c > index 2a5b98390e..2a92bdda84 100644 > --- i/revision.c > +++ w/revision.c > @@ -696,14 +696,14 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, > char *path_alloc = NULL; > const char *path; > size_t len; > - int res = 0; > + int res = -1; /* be pessimistic */ > > len = pi->nowildcard_len; > if (len != pi->len) { > /* > - * for path like "/dir/file*", nowildcard part would be > - * "/dir/file", but only "/dir" should be used for the > - * bloom filter > + * for path like "dir/file*", nowildcard part would be > + * "dir/file", but only "dir" should be used for the > + * bloom filter. > */ > while (len > 0 && pi->match[len - 1] != '/') > len--; > @@ -712,19 +712,17 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, > if (len > 0 && pi->match[len - 1] == '/') > len--; > > + if (!len) > + goto cleanup; > + > if (len != pi->len) { > path_alloc = xmemdupz(pi->match, len); > path = path_alloc; > } else > path = pi->match; > > - len = strlen(path); > - if (!len) { > - res = -1; > - goto cleanup; > - } > - > *out = bloom_keyvec_new(path, len, settings); > + res = 0; > > cleanup: > free(path_alloc); Thanks, I will apply this and add your signed-off. Lidong ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v3] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-08 15:50 ` Junio C Hamano 2025-08-09 2:06 ` Lidong Yan @ 2025-08-09 2:16 ` Lidong Yan 2025-08-09 4:22 ` [PATCH v4] " Lidong Yan 2025-08-09 23:51 ` [PATCH v3] " Junio C Hamano 1 sibling, 2 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-09 2:16 UTC (permalink / raw) To: gitster; +Cc: git, stolee, ttaylorr, yldhome2d2 When traversing commits, a pathspec item can be used to limit the traversal to commits that modify the specified paths. And the commit-graph includes a Bloom filter to exclude commits that definitely did not modify a given pathspec item. During commit traversal, the Bloom filter can significantly improve performance. However, it is disabled if the specified pathspec item contains wildcard characters or magic signatures. For performance reason, enable Bloom filter even if a pathspec item contains wildcard characters by filtering only the non-wildcard part of the pathspec item. The function of pathspec magic signature is generally to narrow down the path specified by the pathspecs. So, enable Bloom filter when the magic signature is "top", "glob", "attr", "--depth" or "literal". "exclude" is used to select paths other than the specified path, rather than serving as a filtering function, so it cannot be used together with the Bloom filter. Since Bloom filter is not case insensitive even in case insensitive system (e.g. MacOS), it cannot be used together with "icase" magic. With this optimization, we get some improvements for pathspecs with wildcards or magic signatures. First, in the Git repository we see these modest results: git log -100 -- "t/*" Benchmark 1: new Time (mean ± σ): 20.4 ms ± 0.6 ms Range (min … max): 19.3 ms … 24.4 ms Benchmark 2: old Time (mean ± σ): 23.4 ms ± 0.5 ms Range (min … max): 22.5 ms … 24.7 ms git log -100 -- ":(top)t" Benchmark 1: new Time (mean ± σ): 16.2 ms ± 0.4 ms Range (min … max): 15.3 ms … 17.2 ms Benchmark 2: old Time (mean ± σ): 18.6 ms ± 0.5 ms Range (min … max): 17.6 ms … 20.4 ms But in a larger repo, such as the LLVM project repo below, we get even better results: git log -100 -- "libc/*" Benchmark 1: new Time (mean ± σ): 16.0 ms ± 0.6 ms Range (min … max): 14.7 ms … 17.8 ms Benchmark 2: old Time (mean ± σ): 26.7 ms ± 0.5 ms Range (min … max): 25.4 ms … 27.8 ms git log -100 -- ":(top)libc" Benchmark 1: new Time (mean ± σ): 15.6 ms ± 0.6 ms Range (min … max): 14.4 ms … 17.7 ms Benchmark 2: old Time (mean ± σ): 19.6 ms ± 0.5 ms Range (min … max): 18.6 ms … 20.6 ms Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> [jc: avoid allocating zero length path in convert_pathspec_to_bloom_keyvec()] Signed-off-by: Junio C Hamano <gitster@pobox.com> --- revision.c | 37 +++++++++++++++++++++++++++---------- t/t4216-log-bloom.sh | 31 +++++++++++++++++++++++++++---- 2 files changed, 54 insertions(+), 14 deletions(-) diff --git a/revision.c b/revision.c index 18f300d455..386c52aba1 100644 --- a/revision.c +++ b/revision.c @@ -671,12 +671,17 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { - if (spec->has_wildcard) - return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + unsigned int allowed_magic = + PATHSPEC_FROMTOP | + PATHSPEC_MAXDEPTH | + PATHSPEC_LITERAL | + PATHSPEC_GLOB | + PATHSPEC_ATTR; + + if (spec->magic & ~allowed_magic) return 1; for (size_t nr = 0; nr < spec->nr; nr++) - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) + if (spec->items[nr].magic & ~allowed_magic) return 1; return 0; @@ -693,19 +698,31 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, size_t len; int res = 0; + len = pi->nowildcard_len; + if (len != pi->len) { + /* + * for path like "dir/file*", nowildcard part would be + * "dir/file", but only "dir" should be used for the + * bloom filter + */ + while (len > 0 && pi->match[len - 1] != '/') + len--; + } /* remove single trailing slash from path, if needed */ - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { - path_alloc = xmemdupz(pi->match, pi->len - 1); - path = path_alloc; - } else - path = pi->match; + if (len > 0 && pi->match[len - 1] == '/') + len--; - len = strlen(path); if (!len) { res = -1; goto cleanup; } + if (len != pi->len) { + path_alloc = xmemdupz(pi->match, len); + path = path_alloc; + } else + path = pi->match; + *out = bloom_keyvec_new(path, len, settings); cleanup: diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 639868ac56..1064990de3 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -154,11 +154,34 @@ test_expect_success 'git log with multiple literal paths uses Bloom filter' ' test_bloom_filters_used "-- file*" ' -test_expect_success 'git log with path contains a wildcard does not use Bloom filter' ' +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' + test_bloom_filters_used "-- A/\* file4" && + test_bloom_filters_used "-- A/file\*" && + test_bloom_filters_used "-- * A/\*" +' + +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' test_bloom_filters_not_used "-- file\*" && - test_bloom_filters_not_used "-- A/\* file4" && - test_bloom_filters_not_used "-- file4 A/\*" && - test_bloom_filters_not_used "-- * A/\*" + test_bloom_filters_not_used "-- file\* A/\*" && + test_bloom_filters_not_used "-- file\* *" && + test_bloom_filters_not_used "-- \*" +' + +test_expect_success 'git log with path contains various magic signatures' ' + cd A && + test_bloom_filters_used "-- \:\(top\)B" && + cd .. && + + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && + + test_when_finished "rm -f .gitattributes" && + cat >.gitattributes <<-EOF && + A/file1 text + A/B/file2 -text + EOF + test_bloom_filters_used "-- \:\(attr\:text\)A" ' test_expect_success 'setup - add commit-graph to the chain without Bloom filters' ' -- 2.51.0.rc0.49.gea0dd4b6b4.dirty ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-09 2:16 ` [PATCH v3] " Lidong Yan @ 2025-08-09 4:22 ` Lidong Yan 2025-08-09 7:40 ` Lidong Yan 2025-08-11 6:01 ` [PATCH v5] " Lidong Yan 2025-08-09 23:51 ` [PATCH v3] " Junio C Hamano 1 sibling, 2 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-09 4:22 UTC (permalink / raw) To: yldhome2d2; +Cc: git, gitster, stolee, ttaylorr When traversing commits, a pathspec item can be used to limit the traversal to commits that modify the specified paths. And the commit-graph includes a Bloom filter to exclude commits that definitely did not modify a given pathspec item. During commit traversal, the Bloom filter can significantly improve performance. However, it is disabled if the specified pathspec item contains wildcard characters or magic signatures. For performance reason, enable Bloom filter even if a pathspec item contains wildcard characters by filtering only the non-wildcard part of the pathspec item. The function of pathspec magic signature is generally to narrow down the path specified by the pathspecs. So, enable Bloom filter when the magic signature is "top", "glob", "attr", "--depth" or "literal". "exclude" is used to select paths other than the specified path, rather than serving as a filtering function, so it cannot be used together with the Bloom filter. Since Bloom filter is not case insensitive even in case insensitive system (e.g. MacOS), it cannot be used together with "icase" magic. With this optimization, we get some improvements for pathspecs with wildcards or magic signatures. First, in the Git repository we see these modest results: git log -100 -- "t/*" Benchmark 1: new Time (mean ± σ): 20.4 ms ± 0.6 ms Range (min … max): 19.3 ms … 24.4 ms Benchmark 2: old Time (mean ± σ): 23.4 ms ± 0.5 ms Range (min … max): 22.5 ms … 24.7 ms git log -100 -- ":(top)t" Benchmark 1: new Time (mean ± σ): 16.2 ms ± 0.4 ms Range (min … max): 15.3 ms … 17.2 ms Benchmark 2: old Time (mean ± σ): 18.6 ms ± 0.5 ms Range (min … max): 17.6 ms … 20.4 ms But in a larger repo, such as the LLVM project repo below, we get even better results: git log -100 -- "libc/*" Benchmark 1: new Time (mean ± σ): 16.0 ms ± 0.6 ms Range (min … max): 14.7 ms … 17.8 ms Benchmark 2: old Time (mean ± σ): 26.7 ms ± 0.5 ms Range (min … max): 25.4 ms … 27.8 ms git log -100 -- ":(top)libc" Benchmark 1: new Time (mean ± σ): 15.6 ms ± 0.6 ms Range (min … max): 14.4 ms … 17.7 ms Benchmark 2: old Time (mean ± σ): 19.6 ms ± 0.5 ms Range (min … max): 18.6 ms … 20.6 ms Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> [jc: avoid allocating zero length path in convert_pathspec_to_bloom_keyvec()] Signed-off-by: Junio C Hamano <gitster@pobox.com> --- revision.c | 45 +++++++++++++++++++++++++++----------------- t/t4216-log-bloom.sh | 31 ++++++++++++++++++++++++++---- 2 files changed, 55 insertions(+), 21 deletions(-) diff --git a/revision.c b/revision.c index 18f300d455..79372fd483 100644 --- a/revision.c +++ b/revision.c @@ -671,12 +671,17 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { - if (spec->has_wildcard) - return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + unsigned int allowed_magic = + PATHSPEC_FROMTOP | + PATHSPEC_MAXDEPTH | + PATHSPEC_LITERAL | + PATHSPEC_GLOB | + PATHSPEC_ATTR; + + if (spec->magic & ~allowed_magic) return 1; for (size_t nr = 0; nr < spec->nr; nr++) - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) + if (spec->items[nr].magic & ~allowed_magic) return 1; return 0; @@ -691,26 +696,32 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, char *path_alloc = NULL; const char *path; size_t len; - int res = 0; + len = pi->nowildcard_len; + if (len != pi->len) { + /* + * for path like "dir/file*", nowildcard part would be + * "dir/file", but only "dir" should be used for the + * bloom filter + */ + while (len > 0 && pi->match[len - 1] != '/') + len--; + } /* remove single trailing slash from path, if needed */ - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { - path_alloc = xmemdupz(pi->match, pi->len - 1); + if (len > 0 && pi->match[len - 1] == '/') + len--; + + if (!len) + return -1; + + if (len != pi->len) { + path_alloc = xmemdupz(pi->match, len); path = path_alloc; } else path = pi->match; - len = strlen(path); - if (!len) { - res = -1; - goto cleanup; - } - *out = bloom_keyvec_new(path, len, settings); - -cleanup: - free(path_alloc); - return res; + return 0; } static void prepare_to_use_bloom_filter(struct rev_info *revs) diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 639868ac56..1064990de3 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -154,11 +154,34 @@ test_expect_success 'git log with multiple literal paths uses Bloom filter' ' test_bloom_filters_used "-- file*" ' -test_expect_success 'git log with path contains a wildcard does not use Bloom filter' ' +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' + test_bloom_filters_used "-- A/\* file4" && + test_bloom_filters_used "-- A/file\*" && + test_bloom_filters_used "-- * A/\*" +' + +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' test_bloom_filters_not_used "-- file\*" && - test_bloom_filters_not_used "-- A/\* file4" && - test_bloom_filters_not_used "-- file4 A/\*" && - test_bloom_filters_not_used "-- * A/\*" + test_bloom_filters_not_used "-- file\* A/\*" && + test_bloom_filters_not_used "-- file\* *" && + test_bloom_filters_not_used "-- \*" +' + +test_expect_success 'git log with path contains various magic signatures' ' + cd A && + test_bloom_filters_used "-- \:\(top\)B" && + cd .. && + + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && + + test_when_finished "rm -f .gitattributes" && + cat >.gitattributes <<-EOF && + A/file1 text + A/B/file2 -text + EOF + test_bloom_filters_used "-- \:\(attr\:text\)A" ' test_expect_success 'setup - add commit-graph to the chain without Bloom filters' ' -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v4] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-09 4:22 ` [PATCH v4] " Lidong Yan @ 2025-08-09 7:40 ` Lidong Yan 2025-08-11 6:01 ` [PATCH v5] " Lidong Yan 1 sibling, 0 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-09 7:40 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, gitster, stolee, ttaylorr Lidong Yan <yldhome2d2@gmail.com> writes: > > When traversing commits, a pathspec item can be used to limit the > traversal to commits that modify the specified paths. And the > commit-graph includes a Bloom filter to exclude commits that definitely > did not modify a given pathspec item. During commit traversal, the > Bloom filter can significantly improve performance. However, it is > disabled if the specified pathspec item contains wildcard characters > or magic signatures. > > For performance reason, enable Bloom filter even if a pathspec item > contains wildcard characters by filtering only the non-wildcard part of > the pathspec item. > > The function of pathspec magic signature is generally to narrow down > the path specified by the pathspecs. So, enable Bloom filter when > the magic signature is "top", "glob", "attr", "--depth" or "literal". > "exclude" is used to select paths other than the specified path, rather > than serving as a filtering function, so it cannot be used together with > the Bloom filter. Since Bloom filter is not case insensitive even in > case insensitive system (e.g. MacOS), it cannot be used together with > "icase" magic. > > With this optimization, we get some improvements for pathspecs with > wildcards or magic signatures. First, in the Git repository we see these > modest results: > > git log -100 -- "t/*" > > Benchmark 1: new > Time (mean ± σ): 20.4 ms ± 0.6 ms > Range (min … max): 19.3 ms … 24.4 ms > > Benchmark 2: old > Time (mean ± σ): 23.4 ms ± 0.5 ms > Range (min … max): 22.5 ms … 24.7 ms > > git log -100 -- ":(top)t" > > Benchmark 1: new > Time (mean ± σ): 16.2 ms ± 0.4 ms > Range (min … max): 15.3 ms … 17.2 ms > > Benchmark 2: old > Time (mean ± σ): 18.6 ms ± 0.5 ms > Range (min … max): 17.6 ms … 20.4 ms > > But in a larger repo, such as the LLVM project repo below, we get even > better results: > > git log -100 -- "libc/*" > > Benchmark 1: new > Time (mean ± σ): 16.0 ms ± 0.6 ms > Range (min … max): 14.7 ms … 17.8 ms > > Benchmark 2: old > Time (mean ± σ): 26.7 ms ± 0.5 ms > Range (min … max): 25.4 ms … 27.8 ms > > git log -100 -- ":(top)libc" > > Benchmark 1: new > Time (mean ± σ): 15.6 ms ± 0.6 ms > Range (min … max): 14.4 ms … 17.7 ms > > Benchmark 2: old > Time (mean ± σ): 19.6 ms ± 0.5 ms > Range (min … max): 18.6 ms … 20.6 ms > > Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> > [jc: avoid allocating zero length path in > convert_pathspec_to_bloom_keyvec()] > Signed-off-by: Junio C Hamano <gitster@pobox.com> > --- > revision.c | 45 +++++++++++++++++++++++++++----------------- > t/t4216-log-bloom.sh | 31 ++++++++++++++++++++++++++---- > 2 files changed, 55 insertions(+), 21 deletions(-) > > diff --git a/revision.c b/revision.c > index 18f300d455..79372fd483 100644 > --- a/revision.c > +++ b/revision.c > @@ -671,12 +671,17 @@ static void trace2_bloom_filter_statistics_atexit(void) > > static int forbid_bloom_filters(struct pathspec *spec) > { > - if (spec->has_wildcard) > - return 1; > - if (spec->magic & ~PATHSPEC_LITERAL) > + unsigned int allowed_magic = > + PATHSPEC_FROMTOP | > + PATHSPEC_MAXDEPTH | > + PATHSPEC_LITERAL | > + PATHSPEC_GLOB | > + PATHSPEC_ATTR; > + > + if (spec->magic & ~allowed_magic) > return 1; > for (size_t nr = 0; nr < spec->nr; nr++) > - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) > + if (spec->items[nr].magic & ~allowed_magic) > return 1; > > return 0; > @@ -691,26 +696,32 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, > char *path_alloc = NULL; > const char *path; > size_t len; > - int res = 0; > > + len = pi->nowildcard_len; > + if (len != pi->len) { > + /* > + * for path like "dir/file*", nowildcard part would be > + * "dir/file", but only "dir" should be used for the > + * bloom filter > + */ > + while (len > 0 && pi->match[len - 1] != '/') > + len--; > + } > /* remove single trailing slash from path, if needed */ > - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { > - path_alloc = xmemdupz(pi->match, pi->len - 1); > + if (len > 0 && pi->match[len - 1] == '/') > + len--; > + > + if (!len) > + return -1; > + > + if (len != pi->len) { > + path_alloc = xmemdupz(pi->match, len); > path = path_alloc; > } else > path = pi->match; > > - len = strlen(path); > - if (!len) { > - res = -1; > - goto cleanup; > - } > - > *out = bloom_keyvec_new(path, len, settings); > - > -cleanup: > - free(path_alloc); > - return res; > + return 0; > } I realized that I shouldn’t delete free(path_alloc) part in the patch. I wonder if Junio could help remove that part from the patch. Thanks, Lidong ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v5] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-09 4:22 ` [PATCH v4] " Lidong Yan 2025-08-09 7:40 ` Lidong Yan @ 2025-08-11 6:01 ` Lidong Yan 2025-08-11 15:56 ` Junio C Hamano 1 sibling, 1 reply; 16+ messages in thread From: Lidong Yan @ 2025-08-11 6:01 UTC (permalink / raw) To: yldhome2d2; +Cc: git, gitster, stolee, ttaylorr When traversing commits, a pathspec item can be used to limit the traversal to commits that modify the specified paths. And the commit-graph includes a Bloom filter to exclude commits that definitely did not modify a given pathspec item. During commit traversal, the Bloom filter can significantly improve performance. However, it is disabled if the specified pathspec item contains wildcard characters or magic signatures. For performance reason, enable Bloom filter even if a pathspec item contains wildcard characters by filtering only the non-wildcard part of the pathspec item. The function of pathspec magic signature is generally to narrow down the path specified by the pathspecs. So, enable Bloom filter when the magic signature is "top", "glob", "attr", "--depth" or "literal". "exclude" is used to select paths other than the specified path, rather than serving as a filtering function, so it cannot be used together with the Bloom filter. Since Bloom filter is not case insensitive even in case insensitive system (e.g. MacOS), it cannot be used together with "icase" magic. With this optimization, we get some improvements for pathspecs with wildcards or magic signatures. First, in the Git repository we see these modest results: git log -100 -- "t/*" Benchmark 1: new Time (mean ± σ): 20.4 ms ± 0.6 ms Range (min … max): 19.3 ms … 24.4 ms Benchmark 2: old Time (mean ± σ): 23.4 ms ± 0.5 ms Range (min … max): 22.5 ms … 24.7 ms git log -100 -- ":(top)t" Benchmark 1: new Time (mean ± σ): 16.2 ms ± 0.4 ms Range (min … max): 15.3 ms … 17.2 ms Benchmark 2: old Time (mean ± σ): 18.6 ms ± 0.5 ms Range (min … max): 17.6 ms … 20.4 ms But in a larger repo, such as the LLVM project repo below, we get even better results: git log -100 -- "libc/*" Benchmark 1: new Time (mean ± σ): 16.0 ms ± 0.6 ms Range (min … max): 14.7 ms … 17.8 ms Benchmark 2: old Time (mean ± σ): 26.7 ms ± 0.5 ms Range (min … max): 25.4 ms … 27.8 ms git log -100 -- ":(top)libc" Benchmark 1: new Time (mean ± σ): 15.6 ms ± 0.6 ms Range (min … max): 14.4 ms … 17.7 ms Benchmark 2: old Time (mean ± σ): 19.6 ms ± 0.5 ms Range (min … max): 18.6 ms … 20.6 ms Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> [jc: avoid allocating zero length path in convert_pathspec_to_bloom_keyvec()] Signed-off-by: Junio C Hamano <gitster@pobox.com> --- revision.c | 42 +++++++++++++++++++++++++++++------------- t/t4216-log-bloom.sh | 31 +++++++++++++++++++++++++++---- 2 files changed, 56 insertions(+), 17 deletions(-) diff --git a/revision.c b/revision.c index 18f300d455..7449064def 100644 --- a/revision.c +++ b/revision.c @@ -671,12 +671,17 @@ static void trace2_bloom_filter_statistics_atexit(void) static int forbid_bloom_filters(struct pathspec *spec) { - if (spec->has_wildcard) - return 1; - if (spec->magic & ~PATHSPEC_LITERAL) + unsigned int allowed_magic = + PATHSPEC_FROMTOP | + PATHSPEC_MAXDEPTH | + PATHSPEC_LITERAL | + PATHSPEC_GLOB | + PATHSPEC_ATTR; + + if (spec->magic & ~allowed_magic) return 1; for (size_t nr = 0; nr < spec->nr; nr++) - if (spec->items[nr].magic & ~PATHSPEC_LITERAL) + if (spec->items[nr].magic & ~allowed_magic) return 1; return 0; @@ -691,23 +696,34 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, char *path_alloc = NULL; const char *path; size_t len; - int res = 0; + int res = -1; + len = pi->nowildcard_len; + if (len != pi->len) { + /* + * for path like "dir/file*", nowildcard part would be + * "dir/file", but only "dir" should be used for the + * bloom filter. + */ + while (len > 0 && pi->match[len - 1] != '/') + len--; + } /* remove single trailing slash from path, if needed */ - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { - path_alloc = xmemdupz(pi->match, pi->len - 1); + if (len > 0 && pi->match[len - 1] == '/') + len--; + + if (!len) + goto cleanup; + + if (len != pi->len) { + path_alloc = xmemdupz(pi->match, len); path = path_alloc; } else path = pi->match; - len = strlen(path); - if (!len) { - res = -1; - goto cleanup; - } - *out = bloom_keyvec_new(path, len, settings); + res = 0; cleanup: free(path_alloc); return res; diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index 639868ac56..1064990de3 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -154,11 +154,34 @@ test_expect_success 'git log with multiple literal paths uses Bloom filter' ' test_bloom_filters_used "-- file*" ' -test_expect_success 'git log with path contains a wildcard does not use Bloom filter' ' +test_expect_success 'git log with paths all contain non-wildcard part uses Bloom filter' ' + test_bloom_filters_used "-- A/\* file4" && + test_bloom_filters_used "-- A/file\*" && + test_bloom_filters_used "-- * A/\*" +' + +test_expect_success 'git log with path only contains wildcard part does not use Bloom filter' ' test_bloom_filters_not_used "-- file\*" && - test_bloom_filters_not_used "-- A/\* file4" && - test_bloom_filters_not_used "-- file4 A/\*" && - test_bloom_filters_not_used "-- * A/\*" + test_bloom_filters_not_used "-- file\* A/\*" && + test_bloom_filters_not_used "-- file\* *" && + test_bloom_filters_not_used "-- \*" +' + +test_expect_success 'git log with path contains various magic signatures' ' + cd A && + test_bloom_filters_used "-- \:\(top\)B" && + cd .. && + + test_bloom_filters_used "-- \:\(glob\)A/\*\*/C" && + test_bloom_filters_not_used "-- \:\(icase\)FILE4" && + test_bloom_filters_not_used "-- \:\(exclude\)A/B/C" && + + test_when_finished "rm -f .gitattributes" && + cat >.gitattributes <<-EOF && + A/file1 text + A/B/file2 -text + EOF + test_bloom_filters_used "-- \:\(attr\:text\)A" ' test_expect_success 'setup - add commit-graph to the chain without Bloom filters' ' -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v5] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-11 6:01 ` [PATCH v5] " Lidong Yan @ 2025-08-11 15:56 ` Junio C Hamano 2025-08-11 16:08 ` Lidong Yan 0 siblings, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2025-08-11 15:56 UTC (permalink / raw) To: Lidong Yan; +Cc: git, stolee, ttaylorr Lidong Yan <yldhome2d2@gmail.com> writes: > Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> > [jc: avoid allocating zero length path in > convert_pathspec_to_bloom_keyvec()] > Signed-off-by: Junio C Hamano <gitster@pobox.com> Instead just do Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> here. [who: comment] followed by a sign-off from that person is done by the person who is signing off the tweak, not by the original author. No need to resend; I'll fix it up locally. Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v5] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-11 15:56 ` Junio C Hamano @ 2025-08-11 16:08 ` Lidong Yan 0 siblings, 0 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-11 16:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, stolee, ttaylorr Junio C Hamano <gitster@pobox.com> writes: > > Lidong Yan <yldhome2d2@gmail.com> writes: > >> Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> >> [jc: avoid allocating zero length path in >> convert_pathspec_to_bloom_keyvec()] >> Signed-off-by: Junio C Hamano <gitster@pobox.com> > > Instead just do > > Helped-by: Junio C Hamano <gitster@pobox.com> > Signed-off-by: Lidong Yan <yldhome2d2@gmail.com> > > here. [who: comment] followed by a sign-off from that person is > done by the person who is signing off the tweak, not by the original > author. I see — when someone makes additions to another person’s commit, they’ll also modify the log message in the process. Thanks, Lidong ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-09 2:16 ` [PATCH v3] " Lidong Yan 2025-08-09 4:22 ` [PATCH v4] " Lidong Yan @ 2025-08-09 23:51 ` Junio C Hamano 2025-08-10 1:57 ` Lidong Yan 1 sibling, 1 reply; 16+ messages in thread From: Junio C Hamano @ 2025-08-09 23:51 UTC (permalink / raw) To: Lidong Yan; +Cc: git, stolee, ttaylorr Lidong Yan <yldhome2d2@gmail.com> writes: > [jc: avoid allocating zero length path in > convert_pathspec_to_bloom_keyvec()] This is different from what I did, though. > @@ -693,19 +698,31 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, > size_t len; > int res = 0; > > + len = pi->nowildcard_len; > + if (len != pi->len) { > + /* > + * for path like "dir/file*", nowildcard part would be > + * "dir/file", but only "dir" should be used for the > + * bloom filter > + */ A missing full-stop. > + while (len > 0 && pi->match[len - 1] != '/') > + len--; > + } > /* remove single trailing slash from path, if needed */ > - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { > - path_alloc = xmemdupz(pi->match, pi->len - 1); > - path = path_alloc; > - } else > - path = pi->match; > + if (len > 0 && pi->match[len - 1] == '/') > + len--; > > - len = strlen(path); > if (!len) { > res = -1; > goto cleanup; > } > > + if (len != pi->len) { > + path_alloc = xmemdupz(pi->match, len); > + path = path_alloc; > + } else > + path = pi->match; > + > *out = bloom_keyvec_new(path, len, settings); > > cleanup: Two comments. * For a function that finds an error condition in the middle and jumps to the "cleanup:" label at the end, it is more future-proof to start pessimistic (i.e. initialize 'res' to error(-1)) and flip 'res' to success(0) at the very end when everything went well. It would simplify the change necessary when we need to add _more_ early error return code paths to the function in the future. But this flip from "assume success" to "assume failure" is something that should be not be done as part of this patch; perhaps doing it a separate preliminary clean-up patch is a better way to do so. * I think the change from v3 (this one) to v4 makes the function worse; we found that it is a good practice to have a single place to release any resources we temporarily acquired and arrange exception handling code to just jump there during the course of this project. The current implementation may happen to have only one such early return (i.e. "len has become 0; we realize that we cannot use the Bloom filter"), but adding a new early return in the future would be easier if you kept the original arrangement. The new early return condition may have to be computed after we have acquired resources we need to release, so it may need more than a simple "return -1". ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] bloom: enable bloom filter with wildcard pathspec in revision traversal 2025-08-09 23:51 ` [PATCH v3] " Junio C Hamano @ 2025-08-10 1:57 ` Lidong Yan 0 siblings, 0 replies; 16+ messages in thread From: Lidong Yan @ 2025-08-10 1:57 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, stolee, ttaylorr Junio C Hamano <gitster@pobox.com> writes: > > Lidong Yan <yldhome2d2@gmail.com> writes: > >> [jc: avoid allocating zero length path in >> convert_pathspec_to_bloom_keyvec()] > > This is different from what I did, though. Sorry, I don’t fully understand what you mean — should I remove this line or rewrite it? > >> @@ -693,19 +698,31 @@ static int convert_pathspec_to_bloom_keyvec(struct bloom_keyvec **out, >> size_t len; >> int res = 0; >> >> + len = pi->nowildcard_len; >> + if (len != pi->len) { >> + /* >> + * for path like "dir/file*", nowildcard part would be >> + * "dir/file", but only "dir" should be used for the >> + * bloom filter >> + */ > > A missing full-stop. Will fix. > >> + while (len > 0 && pi->match[len - 1] != '/') >> + len--; >> + } >> /* remove single trailing slash from path, if needed */ >> - if (pi->len > 0 && pi->match[pi->len - 1] == '/') { >> - path_alloc = xmemdupz(pi->match, pi->len - 1); >> - path = path_alloc; >> - } else >> - path = pi->match; >> + if (len > 0 && pi->match[len - 1] == '/') >> + len--; >> >> - len = strlen(path); >> if (!len) { >> res = -1; >> goto cleanup; >> } >> >> + if (len != pi->len) { >> + path_alloc = xmemdupz(pi->match, len); >> + path = path_alloc; >> + } else >> + path = pi->match; >> + >> *out = bloom_keyvec_new(path, len, settings); >> >> cleanup: > > Two comments. > > * For a function that finds an error condition in the middle and > jumps to the "cleanup:" label at the end, it is more future-proof > to start pessimistic (i.e. initialize 'res' to error(-1)) and > flip 'res' to success(0) at the very end when everything went > well. It would simplify the change necessary when we need to add > _more_ early error return code paths to the function in the > future. > > But this flip from "assume success" to "assume failure" is > something that should be not be done as part of this patch; > perhaps doing it a separate preliminary clean-up patch is a > better way to do so. Ah, I’ve always thought that `ret = -1; goto cleanup;` was a kind of the everybody-should-use pattern. So when I saw you write `int ret = -1;` first, I was a bit puzzled. Now I understand what you mean, and I’ll add a cleanup patch. > > * I think the change from v3 (this one) to v4 makes the function > worse; we found that it is a good practice to have a single place > to release any resources we temporarily acquired and arrange > exception handling code to just jump there during the course of > this project. > > The current implementation may happen to have only one such early > return (i.e. "len has become 0; we realize that we cannot use the > Bloom filter"), but adding a new early return in the future would > be easier if you kept the original arrangement. The new early > return condition may have to be computed after we have acquired > resources we need to release, so it may need more than a simple > "return -1”. Understand. I was thinking about less code is better. I will add the cleanup part back. Thanks, Lidong ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-08-11 16:08 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-07 5:12 [PATCH] bloom: enable bloom filter with wildcard pathspec in revision traversal Lidong Yan 2025-08-07 6:49 ` Patrick Steinhardt 2025-08-07 8:59 ` Lidong Yan 2025-08-07 16:15 ` Junio C Hamano 2025-08-08 6:40 ` Lidong Yan 2025-08-08 6:58 ` [PATCH v2] " Lidong Yan 2025-08-08 15:50 ` Junio C Hamano 2025-08-09 2:06 ` Lidong Yan 2025-08-09 2:16 ` [PATCH v3] " Lidong Yan 2025-08-09 4:22 ` [PATCH v4] " Lidong Yan 2025-08-09 7:40 ` Lidong Yan 2025-08-11 6:01 ` [PATCH v5] " Lidong Yan 2025-08-11 15:56 ` Junio C Hamano 2025-08-11 16:08 ` Lidong Yan 2025-08-09 23:51 ` [PATCH v3] " Junio C Hamano 2025-08-10 1:57 ` Lidong Yan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).