Re: [PATCH] ref-filter: reuse --contains traversal results

Git development
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Tamir Duberstein <tamird@gmail.com>
Cc: git@vger.kernel.org, Karthik Nayak <karthik.188@gmail.com>,
	Junio C Hamano <gitster@pobox.com>,
	Victoria Dye <vdye@github.com>, Derrick Stolee <stolee@gmail.com>,
	Elijah Newren <newren@gmail.com>,
	Kristofer Karlsson <krka@spotify.com>
Subject: Re: [PATCH] ref-filter: reuse --contains traversal results
Date: Mon, 8 Jun 2026 18:34:30 -0400	[thread overview]
Message-ID: <20260608223430.GA340696@coredump.intra.peff.net> (raw)
In-Reply-To: <20260607-ref-filter-memoized-contains-v1-1-a1972dde9c76@gmail.com>

On Sun, Jun 07, 2026 at 08:33:29PM -0700, Tamir Duberstein wrote:

> git branch and git for-each-ref call repo_is_descendant_of() for each
> candidate selected by --contains or --no-contains. Each call starts a
> new graph walk, so refs with shared history repeatedly traverse the same
> commits.
> 
> ffc4b8012d (tag: speed up --contains calculation, 2011-06-11) introduced
> the tag traversal that caches positive and negative answers across
> candidates. ee2bd06b0f (ref-filter: implement '--contains' option,
> 2015-07-07) preserved the branch and tag implementations when ref-filter
> learned --contains. 008ed7df930 (tag.c: use the correct algorithm for
> the '--contains' option, 2015-10-18) noted that they should be unified.
> 
> Use the memoized traversal for every ref-filter contains check and
> remove the implementation selector. The cache records answers for one
> fixed target list, so document that callers must clear it before
> changing the list.

The subject line obfuscated the intent here (at least for me). I think a
more clear subject would just be: "ref-filter: always use
contains_tag_algo" or something.

But more importantly, I think the analysis above is missing a key point
about why we didn't make the tag algo the default in the first place: it
is depth first, and thus slower when the merge base can be found quickly
by the breadth-first traversal. For tags, you tend to have to look at
all of history anyway (because you have at least one old tag that
requires walking back that far), but that is often not true for
branches.

We are able to get the best of both worlds if we can cut off the
depth-first traversal early using generation numbers.

So I think a better rule here is to tweak the selection in
commit_contains() to select the depth-first algorithm when we have
generation numbers enabled. There's a patch in an old thread, which was
revived a week or two ago by Kristofer (cc'd):

  https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/

> The memoized depth-first walk assumes acyclic ancestry, but replacement
> refs can create cycles. Track commits while they are on the walk. If a
> cycle is found, discard partial cache entries and use
> repo_is_descendant_of() for that candidate.

I can believe that the depth-first code doesn't handle cycles well. But
if that's the case, then it's already a problem for "git tag
--contains". And we should fix it as a separate patch from enabling that
algorithm in more cases.

I'm not quite sure how ancestry should be defined in a cycle. How does
the algorithm behave now when it sees a cycle? If it loops infinitely,
we definitely would want to fix that. If not, then to some degree I
don't care too much what answer is provided, since the input is somewhat
nonsense in the first place. And if it is expensive to track, it might
not be worth inflicting that penalty on the sane cases. But it looks
like your solution is just setting an extra flag value in the slab,
which should be pretty cheap.

> The branch and for-each-ref path passed repo_is_descendant_of() through
> a Boolean interface. In configurations where it returned -1 for missing
> ancestry, ref-filter treated the error as "contains". The memoized path
> instead fails when ancestry cannot be parsed, as git tag already did.
> During review of the 2018 reachability series, making parse failures
> fatal was explicitly deferred because that series was intended to
> preserve behavior. Unifying the implementations now makes all callers
> fail consistently instead of preserving that accidental Boolean
> interpretation.

I think that's a good outcome.

> The added p1500 case uses up to 8,192 packed refs along one first-parent
> history. It improves from 0.68 to 0.03 seconds.
> 
> On a checkout with 62,174 remote-tracking refs, I ran:
> 
>     hyperfine --warmup 0 --runs 3 \
>         --command-name parent \
>         '"$parent" branch -r --contains c78ae85f3ce7e >/dev/null' \
>         --command-name this-commit \
>         '"$this" branch -r --contains c78ae85f3ce7e >/dev/null'
> 
> The results were:
> 
>              parent       this commit
>   elapsed    104.365 s     467.7 ms
>   user        93.702 s     220.2 ms
>   system       0.723 s     182.7 ms

I didn't time it, but the probable regression case is something like
this: a very deep history with a small number of branches diverging only
a few commits away. Without a commit-graph file (or one without
generation numbers), that probably makes "git branch --contains" slower.

-Peff

next prev parent reply	other threads:[~2026-06-08 22:34 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-08  3:33 [PATCH] ref-filter: reuse --contains traversal results Tamir Duberstein
2026-06-08 21:18 ` Karthik Nayak
2026-06-08 22:30   ` Tamir Duberstein
2026-06-08 22:34 ` Jeff King [this message]
2026-06-08 23:35   ` Tamir Duberstein
2026-06-08 23:52     ` Jeff King
2026-06-08 23:56       ` Tamir Duberstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260608223430.GA340696@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=karthik.188@gmail.com \
    --cc=krka@spotify.com \
    --cc=newren@gmail.com \
    --cc=stolee@gmail.com \
    --cc=tamird@gmail.com \
    --cc=vdye@github.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox