* Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
@ 2025-06-23 17:58 Kai Koponen
2025-06-23 18:04 ` Kai Koponen
2025-06-23 19:36 ` Junio C Hamano
0 siblings, 2 replies; 7+ messages in thread
From: Kai Koponen @ 2025-06-23 17:58 UTC (permalink / raw)
To: git, Kai Koponen
Reproduce steps:
```
git clone https://github.com/golang/go.git
cd go
git config core.commitGraph true
git commit-graph write --split --reachable --changed-paths # Without
this, all calls equally slow (~1s)
time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
src/clean.bash > /dev/null # ~90ms
time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
src/Make.dist > /dev/null # ~100ms
time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
src/clean.bash src/Make.dist > /dev/null # ~650ms
```
The rev-list call with multiple paths takes over 3x longer than the
sum of individual calls to it for the same files.
Expectation: rev-list with multiple paths should take <= the sum of
the time it takes to call it with each path individually (ideally <,
since with the count limit it should be able to early-exit and search
less commits for either path).
Also reproduces without the -10 arg, or with a lower count (double
instead of triple w/ -1), but these results are perhaps most
surprising with a count present.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
2025-06-23 17:58 Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph Kai Koponen
@ 2025-06-23 18:04 ` Kai Koponen
2025-06-23 19:36 ` Junio C Hamano
1 sibling, 0 replies; 7+ messages in thread
From: Kai Koponen @ 2025-06-23 18:04 UTC (permalink / raw)
To: git, Kai Koponen
Re: rev-list perf bug, some `git bugreport` version information:
[System Info]
git version:
git version 2.50.0.714.g196bf9f422-goog
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
libcurl: 8.13.0
OpenSSL: OpenSSL 3.4.1 11 Feb 2025
zlib: 1.3.1
SHA-1: SHA1_DC
SHA-256: SHA256_BLK
compiler info: gnuc: 14.2
libc info: glibc: 2.41
On Mon, Jun 23, 2025 at 1:58 PM Kai Koponen <kaikoponen@google.com> wrote:
>
> Reproduce steps:
> ```
> git clone https://github.com/golang/go.git
> cd go
> git config core.commitGraph true
> git commit-graph write --split --reachable --changed-paths # Without
> this, all calls equally slow (~1s)
> time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> src/clean.bash > /dev/null # ~90ms
> time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> src/Make.dist > /dev/null # ~100ms
> time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> src/clean.bash src/Make.dist > /dev/null # ~650ms
> ```
>
> The rev-list call with multiple paths takes over 3x longer than the
> sum of individual calls to it for the same files.
>
> Expectation: rev-list with multiple paths should take <= the sum of
> the time it takes to call it with each path individually (ideally <,
> since with the count limit it should be able to early-exit and search
> less commits for either path).
>
> Also reproduces without the -10 arg, or with a lower count (double
> instead of triple w/ -1), but these results are perhaps most
> surprising with a count present.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
2025-06-23 17:58 Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph Kai Koponen
2025-06-23 18:04 ` Kai Koponen
@ 2025-06-23 19:36 ` Junio C Hamano
2025-06-23 20:19 ` Kai Koponen
1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2025-06-23 19:36 UTC (permalink / raw)
To: Kai Koponen; +Cc: git
Kai Koponen <kaikoponen@google.com> writes:
> Reproduce steps:
> ```
> git clone https://github.com/golang/go.git
> cd go
> git config core.commitGraph true
> git commit-graph write --split --reachable --changed-paths # Without
> this, all calls equally slow (~1s)
> time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> src/clean.bash > /dev/null # ~90ms
> time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> src/Make.dist > /dev/null # ~100ms
> time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> src/clean.bash src/Make.dist > /dev/null # ~650ms
> ```
>
> The rev-list call with multiple paths takes over 3x longer than the
> sum of individual calls to it for the same files.
>
> Expectation: rev-list with multiple paths should take <= the sum of
> the time it takes to call it with each path individually (ideally <,
> since with the count limit it should be able to early-exit and search
> less commits for either path).
>
> Also reproduces without the -10 arg, or with a lower count (double
> instead of triple w/ -1), but these results are perhaps most
> surprising with a count present.
I asked
How does "git log -- path" use the changed-paths bloom filter
stored in the commit-graph file?
to https://deepwiki.com/git/git (there is a text field in the bottom
of the page), and an early part of its answer explains why in a
fairly convincing way ;-)
When you run git log -- path, Git first prepares to use bloom
filters in the prepare_to_use_bloom_filter function. This function:
1. Validates the pathspec - It calls forbid_bloom_filters to check
if bloom filters can be used revision.c:674-686 . Bloom filters
are disabled for wildcards, multiple paths, or complex pathspec
magic.
...
In short, the changed-path filter is used only when following
pathspec with a single element that is not a wildcard. So the
observed result is (unfortunately) quite expected.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
2025-06-23 19:36 ` Junio C Hamano
@ 2025-06-23 20:19 ` Kai Koponen
2025-06-23 21:00 ` Junio C Hamano
0 siblings, 1 reply; 7+ messages in thread
From: Kai Koponen @ 2025-06-23 20:19 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
I see, more of a perf FR than a bug then.
I don't have much expertise here, but on the surface of it, it doesn't
seem to me like there would be any reason the algorithm couldn't check
each path's bloom filter in turn while searching, other than that this
would be a large and annoying change.
On Mon, Jun 23, 2025 at 3:36 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Kai Koponen <kaikoponen@google.com> writes:
>
> > Reproduce steps:
> > ```
> > git clone https://github.com/golang/go.git
> > cd go
> > git config core.commitGraph true
> > git commit-graph write --split --reachable --changed-paths # Without
> > this, all calls equally slow (~1s)
> > time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> > src/clean.bash > /dev/null # ~90ms
> > time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> > src/Make.dist > /dev/null # ~100ms
> > time git rev-list -10 3730814f2f2bf24550920c39a16841583de2dac1 --
> > src/clean.bash src/Make.dist > /dev/null # ~650ms
> > ```
> >
> > The rev-list call with multiple paths takes over 3x longer than the
> > sum of individual calls to it for the same files.
> >
> > Expectation: rev-list with multiple paths should take <= the sum of
> > the time it takes to call it with each path individually (ideally <,
> > since with the count limit it should be able to early-exit and search
> > less commits for either path).
> >
> > Also reproduces without the -10 arg, or with a lower count (double
> > instead of triple w/ -1), but these results are perhaps most
> > surprising with a count present.
>
> I asked
>
> How does "git log -- path" use the changed-paths bloom filter
> stored in the commit-graph file?
>
> to https://deepwiki.com/git/git (there is a text field in the bottom
> of the page), and an early part of its answer explains why in a
> fairly convincing way ;-)
>
> When you run git log -- path, Git first prepares to use bloom
> filters in the prepare_to_use_bloom_filter function. This function:
>
> 1. Validates the pathspec - It calls forbid_bloom_filters to check
> if bloom filters can be used revision.c:674-686 . Bloom filters
> are disabled for wildcards, multiple paths, or complex pathspec
> magic.
>
> ...
>
> In short, the changed-path filter is used only when following
> pathspec with a single element that is not a wildcard. So the
> observed result is (unfortunately) quite expected.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
2025-06-23 20:19 ` Kai Koponen
@ 2025-06-23 21:00 ` Junio C Hamano
2025-06-24 3:16 ` Lidong Yan
0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2025-06-23 21:00 UTC (permalink / raw)
To: Kai Koponen; +Cc: git
Kai Koponen <kaikoponen@google.com> writes:
> I see, more of a perf FR than a bug then.
> I don't have much expertise here, but on the surface of it, it doesn't
> seem to me like there would be any reason the algorithm couldn't check
> each path's bloom filter in turn while searching, other than that this
> would be a large and annoying change.
It looks like that the necessary changes are probably fairly well
isolated to two functions, i.e., prepare_to_use_bloom_filter() and
forbid_bloom_filters(). Right now, for a pathspec that has one
element "dir/file", the code uses two bloom keys for "dir" and
"dir/file", but if we have "dir1/file1" as well, then it does look
like a matter of using two more (and the bloom_keys[] array is
designed to be variable length).
But those who have more intimate knowledge in the area than I do may
point out what is missing in my "it looks like" gut feeling.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
2025-06-23 21:00 ` Junio C Hamano
@ 2025-06-24 3:16 ` Lidong Yan
2025-06-24 13:32 ` Junio C Hamano
0 siblings, 1 reply; 7+ messages in thread
From: Lidong Yan @ 2025-06-24 3:16 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Kai Koponen, git
Junio C Hamano <gitster@pobox.com> writes:
>
> Kai Koponen <kaikoponen@google.com> writes:
>
>> I see, more of a perf FR than a bug then.
>> I don't have much expertise here, but on the surface of it, it doesn't
>> seem to me like there would be any reason the algorithm couldn't check
>> each path's bloom filter in turn while searching, other than that this
>> would be a large and annoying change.
>
> It looks like that the necessary changes are probably fairly well
> isolated to two functions, i.e., prepare_to_use_bloom_filter() and
> forbid_bloom_filters(). Right now, for a pathspec that has one
> element "dir/file", the code uses two bloom keys for "dir" and
> "dir/file", but if we have "dir1/file1" as well, then it does look
> like a matter of using two more (and the bloom_keys[] array is
> designed to be variable length).
I believe the issue here is that revs->bloom_keys[] represents an
AND condition, whereas what we actually want is an OR. In Kai’s example,
we’re trying to identify commits that modified either src/Make.dist or
src/clean.bash. However, by adding src, Make.dist, and clean.bash to the
bloom_keys, we end up filtering for commits that modified all of these, rather
than any of them.
> But those who have more intimate knowledge in the area than I do may
> point out what is missing in my "it looks like" gut feeling.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph
2025-06-24 3:16 ` Lidong Yan
@ 2025-06-24 13:32 ` Junio C Hamano
0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2025-06-24 13:32 UTC (permalink / raw)
To: Lidong Yan; +Cc: Kai Koponen, git
Lidong Yan <yldhome2d2@gmail.com> writes:
>> It looks like that the necessary changes are probably fairly well
>> isolated to two functions, i.e., prepare_to_use_bloom_filter() and
>> forbid_bloom_filters(). Right now, for a pathspec that has one
>> element "dir/file", the code uses two bloom keys for "dir" and
>> "dir/file", but if we have "dir1/file1" as well, then it does look
>> like a matter of using two more (and the bloom_keys[] array is
>> designed to be variable length).
>
> I believe the issue here is that revs->bloom_keys[] represents an
> AND condition, whereas what we actually want is an OR.
Yeah, you're right. bloom.c:bloom_filter_contains() is called repeatedly
by check_maybe_different_in_bloom_filter() to see if all the bloom_keys[]
appear to judge if it is possible that the path is changed by the commit.
So if we wanted to extend in the way we discussed in the message you
are respoinding to, revs->bloom_keys[] needs to become an array of
bloom_keys[], one for each literal pathspec element, and then we can
extend check_maybe_different_in_bloom_filter() to run the current
logic for each literal pathspec element, and combine the results by
ORing them. The way revision.c:release_revisions() releases the
bloom keys also need to be updated.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-24 13:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23 17:58 Perf bug: rev-list w/ 2+ paths relatively slow with commit-graph Kai Koponen
2025-06-23 18:04 ` Kai Koponen
2025-06-23 19:36 ` Junio C Hamano
2025-06-23 20:19 ` Kai Koponen
2025-06-23 21:00 ` Junio C Hamano
2025-06-24 3:16 ` Lidong Yan
2025-06-24 13:32 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox