From: Derrick Stolee <stolee@gmail.com>
To: Kristofer Karlsson <krka@spotify.com>, git@vger.kernel.org
Subject: Re: [RFC] commit-reach: terminate merge-base walk when one paint side is exhausted
Date: Fri, 12 Jun 2026 08:52:50 -0400 [thread overview]
Message-ID: <0b3f7429-a4fb-4f7a-bf7b-5a0edeb1db52@gmail.com> (raw)
In-Reply-To: <CAL71e4Mp7ewv0UGS8j=iTq6quyxLXzrr0uNDbWR8JKaOsTSVyA@mail.gmail.com>
On 6/12/2026 7:15 AM, Kristofer Karlsson wrote:
> I have an idea to make it faster for fetching all merge-bases for
> common flows in large repos, as long as the commit graph is
> reasonably up to date.
>
> The key part is the exit condition in paint_down_to_common.
> Instead of waiting for the queue to only contain stale entries,
> it is enough to wait for one of the sides to be exhausted,
> i.e. side 1 is exhausted if no more commits exist in the
> traversal queue flagged with only PARENT1. For example, if
> the two sides are origin/HEAD and a small PR branch, the PR
> branch will quickly become exhausted at the merge-base, while
> the main side will continue.
Generally, you'd replace the queue_has_nonstale() condition
with a more generic queue_can_halt() condition.
> Now you may ask: why is that a safe condition?
>
> The traversal in paint_down_to_common has two logical phases
> due to the priority queue ordering:
>
> 1. Process all commits with infinite generation numbers.
> This includes all commits when there is no commit-graph.
> 2. Process all commits with finite generation numbers.
>
> These happen in strict order -- all INFINITY commits are popped
> before any finite-generation commit.
>
> The optimization only applies after the walk enters the second phase.
> In the first phase, the traversal behaves exactly as today
> and uses the existing termination condition.
This would mean that queue_can_halt() would need to know the
following:
1. If we peek at the top of the queue, is that a commit with
infinite generation number? If so, then we can only use
queue_has_nonstale().
2. Otherwise, we know that all commits in the queue are ordered
topologically and can use a different, faster check. To start,
we need to keep going as long as at least one commit has only
one side
> In the second phase, traversal follows strict topological
> order -- descendants are processed before ancestors. Paint flags
> propagate from each processed commit to its parents, which have
> strictly lower generation and are therefore not yet examined.
>
> A new merge-base candidate can only form when a PARENT1-only path
> meets a PARENT2-only path. Once a commit acquires both paint flags
> in this phase, any descendant carrying both paint flags would
> already have been processed.
>
> Once one side is exhausted from the queue, no new meeting between
> pure sides can occur. Any commit that subsequently acquires both
> paint flags must inherit them from a commit that already had both
> flags -- it is deeper in the graph and cannot affect the final
> merge-base set. We can stop.
The important thing to realize at this point is that commits in the
queue have received flags from their children (children were walked
topologically and "push" flags to their parents).
The STALE bit is pushed from commits that have bits for both sides
of the merge. This isn't something that we can learn from just
walking each side: we need some amount of walking within the
intersection.
This doesn't matter if we are looking for a single merge base, but
when we want the full set of independent merge bases, then the STALE
bit becomes very important.
> On a large monorepo with previously expensive merge-base and
> merge-tree queries, I observed speedups ranging from roughly 300x
> to 1000x. The nice thing is that this works for merge-base --all
> and every internal caller of paint_down_to_common -- we no longer
> have to restrict the optimization to finding just the first merge-base.
>
> Does the correctness argument above hold?
I think that it doesn't work when trying to get all merge bases. It
requires
A X
/| __/|
| |/ |
| B |
| | |
..........
| | __/
\|/
C
In this example, B can reach C through some long list of commits.
This makes B (and X) have much higher generation number than C.
After exhausting both sides of A...X, we have B and C in the queue
with both side bits and neither are stale. But we need to walk
from B to C to discover that C should be stale.
> Happy to come back with a patch later if the logic holds and the
> overall approach is wanted.
You are identifying a point where optimizations are possible, based
on your measurements of the time spent in this walk waiting for
queue_has_nonstale() to end the loop. Specifically, the cost of using
a BFS approach is costing time.
One place that I would recommend here is to take the work you are
doing to investigate the behavior of tips_reachable_from_bases()
or get_reachable_subset() to see if we can use a DFS-based approach
_in this case_ where we have exhausted both side and are only caring
about the STALE bit checking these cases.
Remember that the DFS idea only helps in the case where we find a
path between commits (B to C in this case) without walking all of the
commits above the minimum generation (generation of C). In an alternate
case where B and C are truly independent, this would not save any time.
But these "they are mutually unreachable" cases always require walking
the full set based on the generation number. The good news is that the
vast majority of cases do not actually have multiple independent merge
bases, so there is potential here.
Thanks,
-Stolee
next prev parent reply other threads:[~2026-06-12 12:52 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 11:15 [RFC] commit-reach: terminate merge-base walk when one paint side is exhausted Kristofer Karlsson
2026-06-12 12:52 ` Derrick Stolee [this message]
2026-06-12 14:32 ` Kristofer Karlsson
2026-06-12 15:04 ` Derrick Stolee
2026-06-12 15:21 ` Kristofer Karlsson
2026-06-12 15:48 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0b3f7429-a4fb-4f7a-bf7b-5a0edeb1db52@gmail.com \
--to=stolee@gmail.com \
--cc=git@vger.kernel.org \
--cc=krka@spotify.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox