Git development
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Reuse --contains traversal results
@ 2026-06-09  2:36 Tamir Duberstein
  2026-06-09  2:36 ` [PATCH v2 1/2] commit-reach: handle cycles in contains walk Tamir Duberstein
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-09  2:36 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Tamir Duberstein

The memoized traversal used by git tag avoids repeating graph walks for
refs with shared history. Extend it to the other ref-filter users after
making the existing traversal safe for cycles introduced by replacement
refs.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>

---
Changes in v2:
- Split cycle handling into a preparatory patch.
- Exercise cycle handling through the existing git tag path.
- Move perf result verification out of setup.
- Link to v1: https://patch.msgid.link/20260607-ref-filter-memoized-contains-v1-1-a1972dde9c76@gmail.com

---
Tamir Duberstein (2):
      commit-reach: handle cycles in contains walk
      ref-filter: memoize --contains with generations

 commit-reach.c                 | 43 ++++++++++++++++++++++++++++++------
 commit-reach.h                 | 10 ++++++++-
 t/perf/p1500-graph-walks.sh    | 49 +++++++++++++++++++++++++++++++++++++++++-
 t/t6301-for-each-ref-errors.sh | 22 +++++++++++++++++++
 t/t7004-tag.sh                 | 21 ++++++++++++++++++
 5 files changed, 137 insertions(+), 8 deletions(-)
---
base-commit: 9ac3f193c05c2237e2b14ebaa1149e9fc8a1abe0
change-id: 20260607-ref-filter-memoized-contains-7cb6b3bccad1

Best regards,
--  
Tamir Duberstein <tamird@gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/2] commit-reach: handle cycles in contains walk
  2026-06-09  2:36 [PATCH v2 0/2] Reuse --contains traversal results Tamir Duberstein
@ 2026-06-09  2:36 ` Tamir Duberstein
  2026-06-11  7:29   ` Jeff King
  2026-06-09  2:36 ` [PATCH v2 2/2] ref-filter: memoize --contains with generations Tamir Duberstein
  2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
  2 siblings, 1 reply; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-09  2:36 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Tamir Duberstein

git tag --contains uses a memoized traversal that assumes commit
ancestry is acyclic. Replacement refs can violate that assumption,
causing the traversal to revisit a commit already on its stack
indefinitely.

Mark commits while they are active. If the traversal encounters an
active commit, discard the cache because it cannot distinguish answers
produced by the interrupted walk. Then fall back to the cycle-safe
reachability walk for that candidate.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c | 30 ++++++++++++++++++++++++++----
 commit-reach.h |  3 ++-
 t/t7004-tag.sh | 21 +++++++++++++++++++++
 3 files changed, 49 insertions(+), 5 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9b3ea46d6f..65b618959b 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -708,7 +708,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 
 /*
  * Test whether the candidate is contained in the list.
- * Do not recurse to find out, though, but return -1 if inconclusive.
+ * Do not recurse to find out, though, but return CONTAINS_UNKNOWN if
+ * inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
@@ -744,7 +745,7 @@ static void push_to_contains_stack(struct commit *candidate, struct contains_sta
 }
 
 static enum contains_result contains_tag_algo(struct commit *candidate,
-					      const struct commit_list *want,
+					      struct commit_list *want,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
@@ -765,6 +766,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
+	*contains_cache_at(cache, candidate) = CONTAINS_IN_PROGRESS;
 	push_to_contains_stack(candidate, &contains_stack);
 	while (contains_stack.nr) {
 		struct contains_stack_entry *entry = &contains_stack.contains_stack[contains_stack.nr - 1];
@@ -776,8 +778,8 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 			contains_stack.nr--;
 		}
 		/*
-		 * If we just popped the stack, parents->item has been marked,
-		 * therefore contains_test will return a meaningful yes/no.
+		 * A parent may have just been popped and marked, or may still
+		 * be active when replacement refs create a cycle.
 		 */
 		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
@@ -787,13 +789,33 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		case CONTAINS_NO:
 			entry->parents = parents->next;
 			break;
+		case CONTAINS_IN_PROGRESS:
+			/*
+			 * Partial negative answers are not safe across a cycle.
+			 * Discard them and use the cycle-safe reachability walk.
+			 */
+			goto cycle;
 		case CONTAINS_UNKNOWN:
+			*contains_cache_at(cache, parents->item) =
+				CONTAINS_IN_PROGRESS;
 			push_to_contains_stack(parents->item, &contains_stack);
 			break;
 		}
 	}
 	free(contains_stack.contains_stack);
 	return contains_test(candidate, want, cache, cutoff);
+
+cycle:
+	free(contains_stack.contains_stack);
+	clear_contains_cache(cache);
+	init_contains_cache(cache);
+
+	result = repo_is_descendant_of(the_repository, candidate, want);
+	if (result < 0)
+		exit(128);
+	*contains_cache_at(cache, candidate) =
+		result ? CONTAINS_YES : CONTAINS_NO;
+	return result ? CONTAINS_YES : CONTAINS_NO;
 }
 
 int commit_contains(struct ref_filter *filter, struct commit *commit,
diff --git a/commit-reach.h b/commit-reach.h
index 3f3a563d8a..f908d305b1 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -73,7 +73,8 @@ int ref_newer(const struct object_id *new_oid, const struct object_id *old_oid);
 enum contains_result {
 	CONTAINS_UNKNOWN = 0,
 	CONTAINS_NO,
-	CONTAINS_YES
+	CONTAINS_YES,
+	CONTAINS_IN_PROGRESS
 };
 
 define_commit_slab(contains_cache, enum contains_result);
diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index d918005dd9..1ed91bb66e 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -1611,6 +1611,27 @@ test_expect_success 'checking that first commit is in all tags (hash)' '
 	test_cmp expected actual
 '
 
+test_expect_success 'tag --contains handles cyclic replacement histories' '
+	first=$(git rev-parse HEAD~2) &&
+	second=$(git rev-parse HEAD~) &&
+	third=$(git rev-parse HEAD) &&
+	test_when_finished "
+		git replace -d $first
+		git replace -d $third
+		git tag -d cycle-a cycle-b
+	" &&
+	git tag cycle-a "$first" &&
+	git tag cycle-b "$third" &&
+	git replace --graft "$first" "$third" "$second" &&
+	git replace --graft "$third" "$first" &&
+	cat >expected <<-\EOF &&
+	cycle-a
+	cycle-b
+	EOF
+	git tag --contains="$second" --list "cycle-*" >actual &&
+	test_cmp expected actual
+'
+
 # other ways of specifying the commit
 test_expect_success 'checking that first commit is in all tags (tag)' '
 	cat >expected <<-\EOF &&

-- 
2.54.0.501.g0fb508de08


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-09  2:36 [PATCH v2 0/2] Reuse --contains traversal results Tamir Duberstein
  2026-06-09  2:36 ` [PATCH v2 1/2] commit-reach: handle cycles in contains walk Tamir Duberstein
@ 2026-06-09  2:36 ` Tamir Duberstein
  2026-06-10 11:47   ` Karthik Nayak
  2026-06-11  8:22   ` Jeff King
  2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
  2 siblings, 2 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-09  2:36 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Tamir Duberstein

git branch and git for-each-ref call repo_is_descendant_of() for
each candidate selected by --contains or --no-contains. Each call
starts a new graph walk, so refs with shared history repeatedly
traverse the same commits.

ffc4b8012d (tag: speed up --contains calculation, 2011-06-11)
introduced a depth-first walk for git tag that caches positive and
negative answers across candidates. ee2bd06b0f (ref-filter: implement
'--contains' option, 2015-07-07) preserved both implementations when
ref-filter learned --contains.

The memoized walk is not always faster. Without generation numbers,
a negative check can walk to the root even when the breadth-first
merge-base walk finds a nearby divergence. With generation numbers,
the depth-first walk can stop below the oldest target while still
reusing answers across candidates.

Keep the existing memoized selection for git tag. Select it for other
ref-filter callers when generation numbers are enabled, and retain
the breadth-first walk otherwise.

When generation numbers are unavailable, repo_is_descendant_of() can
return -1 if ancestry cannot be read. The ref-filter Boolean interface
treated that error as a match. Check it and exit instead. The memoized
path already dies on the same parse failure, so both selected paths now
fail rather than return a result.

Add p1500 cases for up to 8,192 packed refs along one first-parent
history and for sibling refs near the tip with generation numbers
forced off.

On a checkout with 62,174 remote-tracking refs and generation numbers
enabled, I ran:

    hyperfine --warmup 0 --runs 3 \
        --command-name parent \
        '"$parent" branch -r --contains c78ae85f3ce7e >/dev/null' \
        --command-name this-commit \
        '"$this" branch -r --contains c78ae85f3ce7e >/dev/null'

The results were:

             parent       this commit
  elapsed    104.365 s     467.7 ms
  user        93.702 s     220.2 ms
  system       0.723 s     182.7 ms

The wall-time standard deviations were 11.356 seconds and 133.8
milliseconds, respectively. Separate runs without redirection produced
the same output with SHA-256
2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.

Both revisions were built with the default -O2 flags using Apple
clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro (Mac16,6)
with a 16-core Apple M4 Max (12 performance and four efficiency
cores) and 128 GB RAM.

Link: https://lore.kernel.org/git/1445163904-24611-1-git-send-email-Karthik.188@gmail.com/
Link: https://lore.kernel.org/r/20230324191009.GA536967@coredump.intra.peff.net
Link: https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/
Link: https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c                 | 13 +++++++++--
 commit-reach.h                 |  7 ++++++
 t/perf/p1500-graph-walks.sh    | 49 +++++++++++++++++++++++++++++++++++++++++-
 t/t6301-for-each-ref-errors.sh | 22 +++++++++++++++++++
 4 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 65b618959b..83a48004ef 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -821,9 +821,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 int commit_contains(struct ref_filter *filter, struct commit *commit,
 		    struct commit_list *list, struct contains_cache *cache)
 {
-	if (filter->with_commit_tag_algo)
+	int result;
+
+	if (!list)
+		return 1;
+	if (filter->with_commit_tag_algo ||
+	    generation_numbers_enabled(the_repository))
 		return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
-	return repo_is_descendant_of(the_repository, commit, list);
+
+	result = repo_is_descendant_of(the_repository, commit, list);
+	if (result < 0)
+		exit(128);
+	return result;
 }
 
 int can_all_from_reach_with_flag(struct object_array *from,
diff --git a/commit-reach.h b/commit-reach.h
index f908d305b1..da6796a354 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -79,6 +79,13 @@ enum contains_result {
 
 define_commit_slab(contains_cache, enum contains_result);
 
+/*
+ * Return whether "commit" is a descendant of any commit in "list". An empty
+ * list matches.
+ *
+ * The memoized traversal records answers in "cache" for one fixed "list".
+ * Clear it before changing the list.
+ */
 int commit_contains(struct ref_filter *filter, struct commit *commit,
 		    struct commit_list *list, struct contains_cache *cache);
 
diff --git a/t/perf/p1500-graph-walks.sh b/t/perf/p1500-graph-walks.sh
index 5b23ce5db9..99b54e274b 100755
--- a/t/perf/p1500-graph-walks.sh
+++ b/t/perf/p1500-graph-walks.sh
@@ -32,12 +32,47 @@ test_expect_success 'setup' '
 		echo "X:$line" >>test-tool-tags || return 1
 	done &&
 
-	commit=$(git commit-tree $(git rev-parse HEAD^{tree})) &&
+	git rev-list --first-parent --max-count=8192 HEAD >contains-commits &&
+	test_file_not_empty contains-commits &&
+	git update-ref refs/contains-perf-base "$(tail -n 1 contains-commits)" &&
+	awk "{
+		printf \"update refs/contains-perf/%04d %s\\n\", NR, \$1
+	}" contains-commits |
+		git update-ref --stdin &&
+	git pack-refs --include "refs/contains-perf/*" &&
+
+	tree=$(git rev-parse HEAD^{tree}) &&
+	base=$(git rev-parse HEAD) &&
+	target=$(echo target | git commit-tree "$tree" -p "$base") &&
+	git update-ref refs/contains-diverged/target "$target" &&
+	for i in $(test_seq 1 4)
+	do
+		commit=$(echo candidate-$i |
+			git commit-tree "$tree" -p "$base") &&
+		git update-ref refs/contains-diverged/candidate-$i "$commit" ||
+		return 1
+	done &&
+
+	commit=$(git commit-tree "$tree") &&
 	git update-ref refs/heads/disjoint-base $commit &&
 
 	git commit-graph write --reachable
 '
 
+test_expect_success 'verify contains results' '
+	git for-each-ref --contains=refs/contains-perf-base \
+		refs/contains-perf/ >actual &&
+	test_line_count = $(wc -l <contains-commits) actual &&
+
+	echo refs/contains-diverged/target >expect &&
+	GIT_TEST_COMMIT_GRAPH=0 \
+		git -c core.commitGraph=false for-each-ref \
+			--format="%(refname)" \
+			--contains=refs/contains-diverged/target \
+			refs/contains-diverged/ >actual &&
+	test_cmp expect actual
+'
+
 test_perf 'ahead-behind counts: git for-each-ref' '
 	git for-each-ref --format="%(ahead-behind:HEAD)" --stdin <refs
 '
@@ -62,6 +97,18 @@ test_perf 'contains: git tag --merged' '
 	xargs git tag --merged=HEAD <tags
 '
 
+test_perf 'contains: git for-each-ref --contains' '
+	git for-each-ref --contains=refs/contains-perf-base \
+		refs/contains-perf/ >/dev/null
+'
+
+test_perf 'contains without generations: divergent refs' '
+	GIT_TEST_COMMIT_GRAPH=0 \
+		git -c core.commitGraph=false for-each-ref \
+			--contains=refs/contains-diverged/target \
+			refs/contains-diverged/ >/dev/null
+'
+
 test_perf 'is-base check: test-tool reach (refs)' '
 	test-tool reach get_branch_base_for_tip <test-tool-refs
 '
diff --git a/t/t6301-for-each-ref-errors.sh b/t/t6301-for-each-ref-errors.sh
index e06feb06e9..72b27c8be3 100755
--- a/t/t6301-for-each-ref-errors.sh
+++ b/t/t6301-for-each-ref-errors.sh
@@ -52,6 +52,28 @@ test_expect_success 'Missing objects are reported correctly' '
 	test_must_be_empty brief-err
 '
 
+test_expect_success 'missing ancestors are reported by contains filters' '
+	test_when_finished "git update-ref -d refs/heads/missing-parent" &&
+	{
+		echo "tree $(git rev-parse HEAD^{tree})" &&
+		echo "parent $MISSING" &&
+		git cat-file commit HEAD |
+			sed -n -e "/^author /p" -e "/^committer /p" &&
+		echo &&
+		echo "missing parent"
+	} >commit &&
+	broken=$(git hash-object -t commit -w commit) &&
+	git update-ref refs/heads/missing-parent "$broken" &&
+	for option in --contains --no-contains
+	do
+		test_must_fail git for-each-ref "$option=HEAD" \
+			refs/heads/missing-parent >out 2>err &&
+		test_must_be_empty out &&
+		test_grep "parse commit $MISSING" err ||
+		return 1
+	done
+'
+
 test_expect_success 'ahead-behind requires an argument' '
 	test_must_fail git for-each-ref \
 		--format="%(ahead-behind)" 2>err &&

-- 
2.54.0.501.g0fb508de08


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-09  2:36 ` [PATCH v2 2/2] ref-filter: memoize --contains with generations Tamir Duberstein
@ 2026-06-10 11:47   ` Karthik Nayak
  2026-06-10 12:20     ` Tamir Duberstein
  2026-06-11  8:22   ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: Karthik Nayak @ 2026-06-10 11:47 UTC (permalink / raw)
  To: Tamir Duberstein, git
  Cc: Jeff King, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 4109 bytes --]

Tamir Duberstein <tamird@gmail.com> writes:

> git branch and git for-each-ref call repo_is_descendant_of() for
> each candidate selected by --contains or --no-contains. Each call
> starts a new graph walk, so refs with shared history repeatedly
> traverse the same commits.
>
> ffc4b8012d (tag: speed up --contains calculation, 2011-06-11)
> introduced a depth-first walk for git tag that caches positive and
> negative answers across candidates. ee2bd06b0f (ref-filter: implement
> '--contains' option, 2015-07-07) preserved both implementations when
> ref-filter learned --contains.
>
> The memoized walk is not always faster. Without generation numbers,
> a negative check can walk to the root even when the breadth-first
> merge-base walk finds a nearby divergence. With generation numbers,
> the depth-first walk can stop below the oldest target while still
> reusing answers across candidates.
>
> Keep the existing memoized selection for git tag. Select it for other
> ref-filter callers when generation numbers are enabled, and retain
> the breadth-first walk otherwise.
>
> When generation numbers are unavailable, repo_is_descendant_of() can
> return -1 if ancestry cannot be read. The ref-filter Boolean interface
> treated that error as a match. Check it and exit instead. The memoized
> path already dies on the same parse failure, so both selected paths now
> fail rather than return a result.
>
> Add p1500 cases for up to 8,192 packed refs along one first-parent
> history and for sibling refs near the tip with generation numbers
> forced off.
>
> On a checkout with 62,174 remote-tracking refs and generation numbers
> enabled, I ran:
>
>     hyperfine --warmup 0 --runs 3 \
>         --command-name parent \
>         '"$parent" branch -r --contains c78ae85f3ce7e >/dev/null' \
>         --command-name this-commit \
>         '"$this" branch -r --contains c78ae85f3ce7e >/dev/null'
>
> The results were:
>
>              parent       this commit
>   elapsed    104.365 s     467.7 ms
>   user        93.702 s     220.2 ms
>   system       0.723 s     182.7 ms
>
> The wall-time standard deviations were 11.356 seconds and 133.8
> milliseconds, respectively. Separate runs without redirection produced
> the same output with SHA-256
> 2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.
>
> Both revisions were built with the default -O2 flags using Apple
> clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro (Mac16,6)
> with a 16-core Apple M4 Max (12 performance and four efficiency
> cores) and 128 GB RAM.
>
> Link: https://lore.kernel.org/git/1445163904-24611-1-git-send-email-Karthik.188@gmail.com/
> Link: https://lore.kernel.org/r/20230324191009.GA536967@coredump.intra.peff.net
> Link: https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/
> Link: https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net
> Suggested-by: Jeff King <peff@peff.net>
> Signed-off-by: Tamir Duberstein <tamird@gmail.com>
> ---
>  commit-reach.c                 | 13 +++++++++--
>  commit-reach.h                 |  7 ++++++
>  t/perf/p1500-graph-walks.sh    | 49 +++++++++++++++++++++++++++++++++++++++++-
>  t/t6301-for-each-ref-errors.sh | 22 +++++++++++++++++++
>  4 files changed, 88 insertions(+), 3 deletions(-)
>
> diff --git a/commit-reach.c b/commit-reach.c
> index 65b618959b..83a48004ef 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -821,9 +821,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>  int commit_contains(struct ref_filter *filter, struct commit *commit,
>  		    struct commit_list *list, struct contains_cache *cache)
>  {
> -	if (filter->with_commit_tag_algo)
> +	int result;
> +
> +	if (!list)
> +		return 1;
> +	if (filter->with_commit_tag_algo ||
> +	    generation_numbers_enabled(the_repository))

What's stopping us from dropping `filter->with_commit_tag_algo`
completely and then doing?

  if (generation_numbers_enabled(the_repository))
     return contains_algo(commit, list, cache) == CONTAINS_YES;
  return repo_is_descendant_of(the_repository, commit, list);

[snip]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-10 11:47   ` Karthik Nayak
@ 2026-06-10 12:20     ` Tamir Duberstein
  2026-06-11  8:16       ` Karthik Nayak
  0 siblings, 1 reply; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-10 12:20 UTC (permalink / raw)
  To: Karthik Nayak
  Cc: git, Jeff King, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

On Wed, Jun 10, 2026 at 4:47 AM Karthik Nayak <karthik.188@gmail.com> wrote:
>
> Tamir Duberstein <tamird@gmail.com> writes:
>
> > git branch and git for-each-ref call repo_is_descendant_of() for
> > each candidate selected by --contains or --no-contains. Each call
> > starts a new graph walk, so refs with shared history repeatedly
> > traverse the same commits.
> >
> > ffc4b8012d (tag: speed up --contains calculation, 2011-06-11)
> > introduced a depth-first walk for git tag that caches positive and
> > negative answers across candidates. ee2bd06b0f (ref-filter: implement
> > '--contains' option, 2015-07-07) preserved both implementations when
> > ref-filter learned --contains.
> >
> > The memoized walk is not always faster. Without generation numbers,
> > a negative check can walk to the root even when the breadth-first
> > merge-base walk finds a nearby divergence. With generation numbers,
> > the depth-first walk can stop below the oldest target while still
> > reusing answers across candidates.
> >
> > Keep the existing memoized selection for git tag. Select it for other
> > ref-filter callers when generation numbers are enabled, and retain
> > the breadth-first walk otherwise.
> >
> > When generation numbers are unavailable, repo_is_descendant_of() can
> > return -1 if ancestry cannot be read. The ref-filter Boolean interface
> > treated that error as a match. Check it and exit instead. The memoized
> > path already dies on the same parse failure, so both selected paths now
> > fail rather than return a result.
> >
> > Add p1500 cases for up to 8,192 packed refs along one first-parent
> > history and for sibling refs near the tip with generation numbers
> > forced off.
> >
> > On a checkout with 62,174 remote-tracking refs and generation numbers
> > enabled, I ran:
> >
> >     hyperfine --warmup 0 --runs 3 \
> >         --command-name parent \
> >         '"$parent" branch -r --contains c78ae85f3ce7e >/dev/null' \
> >         --command-name this-commit \
> >         '"$this" branch -r --contains c78ae85f3ce7e >/dev/null'
> >
> > The results were:
> >
> >              parent       this commit
> >   elapsed    104.365 s     467.7 ms
> >   user        93.702 s     220.2 ms
> >   system       0.723 s     182.7 ms
> >
> > The wall-time standard deviations were 11.356 seconds and 133.8
> > milliseconds, respectively. Separate runs without redirection produced
> > the same output with SHA-256
> > 2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.
> >
> > Both revisions were built with the default -O2 flags using Apple
> > clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro (Mac16,6)
> > with a 16-core Apple M4 Max (12 performance and four efficiency
> > cores) and 128 GB RAM.
> >
> > Link: https://lore.kernel.org/git/1445163904-24611-1-git-send-email-Karthik.188@gmail.com/
> > Link: https://lore.kernel.org/r/20230324191009.GA536967@coredump.intra.peff.net
> > Link: https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/
> > Link: https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net
> > Suggested-by: Jeff King <peff@peff.net>
> > Signed-off-by: Tamir Duberstein <tamird@gmail.com>
> > ---
> >  commit-reach.c                 | 13 +++++++++--
> >  commit-reach.h                 |  7 ++++++
> >  t/perf/p1500-graph-walks.sh    | 49 +++++++++++++++++++++++++++++++++++++++++-
> >  t/t6301-for-each-ref-errors.sh | 22 +++++++++++++++++++
> >  4 files changed, 88 insertions(+), 3 deletions(-)
> >
> > diff --git a/commit-reach.c b/commit-reach.c
> > index 65b618959b..83a48004ef 100644
> > --- a/commit-reach.c
> > +++ b/commit-reach.c
> > @@ -821,9 +821,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
> >  int commit_contains(struct ref_filter *filter, struct commit *commit,
> >                   struct commit_list *list, struct contains_cache *cache)
> >  {
> > -     if (filter->with_commit_tag_algo)
> > +     int result;
> > +
> > +     if (!list)
> > +             return 1;
> > +     if (filter->with_commit_tag_algo ||
> > +         generation_numbers_enabled(the_repository))
>
> What's stopping us from dropping `filter->with_commit_tag_algo`
> completely and then doing?
>
>   if (generation_numbers_enabled(the_repository))
>      return contains_algo(commit, list, cache) == CONTAINS_YES;
>   return repo_is_descendant_of(the_repository, commit, list);

Jeff raised this distinction during the v1 review:

https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net/

`with_commit_tag_algo` preserves the existing behavior of `git tag` when
generation numbers are unavailable. `git tag --contains` has used the
memoized walk since ffc4b8012d (tag: speed up --contains calculation,
2011-06-11). Dropping the flag would send it back through repeated
`repo_is_descendant_of()` walks in repositories without usable generation
numbers.

The condition in v2 implements the rule discussed there: retain the
existing memoized path for `git tag`, and use it for other ref-filter
callers when generation numbers make the depth-first walk reliably
advantageous.

This is probably my fault for breaking the threading between this and
v1. Sorry about that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] commit-reach: handle cycles in contains walk
  2026-06-09  2:36 ` [PATCH v2 1/2] commit-reach: handle cycles in contains walk Tamir Duberstein
@ 2026-06-11  7:29   ` Jeff King
  2026-06-12  2:40     ` Tamir Duberstein
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2026-06-11  7:29 UTC (permalink / raw)
  To: Tamir Duberstein
  Cc: git, Karthik Nayak, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

On Mon, Jun 08, 2026 at 07:36:34PM -0700, Tamir Duberstein wrote:

> @@ -744,7 +745,7 @@ static void push_to_contains_stack(struct commit *candidate, struct contains_sta
>  }
>  
>  static enum contains_result contains_tag_algo(struct commit *candidate,
> -					      const struct commit_list *want,
> +					      struct commit_list *want,
>  					      struct contains_cache *cache)

OK, we must lose the const here because repo_is_descendant_of() does not
have it. We could add const to that function, though that cascades down
to a few other helpers (see below). I'm not sure if that is making the
world a better place, or if it is just const pedantry.

diff --git a/commit-reach.c b/commit-reach.c
index 5df471a313..8cede01f01 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -563,7 +563,7 @@ int repo_get_merge_bases(struct repository *r,
  */
 int repo_is_descendant_of(struct repository *r,
 			  struct commit *commit,
-			  struct commit_list *with_commit)
+			  const struct commit_list *with_commit)
 {
 	if (!with_commit)
 		return 1;
@@ -955,11 +955,12 @@ int can_all_from_reach_with_flag(struct object_array *from,
 	return result;
 }
 
-int can_all_from_reach(struct commit_list *from, struct commit_list *to,
+int can_all_from_reach(const struct commit_list *from,
+		       const struct commit_list *to,
 		       int cutoff_by_min_date)
 {
 	struct object_array from_objs = OBJECT_ARRAY_INIT;
-	struct commit_list *from_iter = from, *to_iter = to;
+	const struct commit_list *from_iter = from, *to_iter = to;
 	int result;
 	timestamp_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
 	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
diff --git a/commit-reach.h b/commit-reach.h
index 3f3a563d8a..76e82f827e 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -37,7 +37,7 @@ int get_octopus_merge_bases(struct commit_list *in, struct commit_list **result)
 
 int repo_is_descendant_of(struct repository *r,
 			  struct commit *commit,
-			  struct commit_list *with_commit);
+			  const struct commit_list *with_commit);
 int repo_in_merge_bases(struct repository *r,
 			struct commit *commit,
 			struct commit *reference);
@@ -93,7 +93,8 @@ int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int assign_flag,
 				 timestamp_t min_commit_date,
 				 timestamp_t min_generation);
-int can_all_from_reach(struct commit_list *from, struct commit_list *to,
+int can_all_from_reach(const struct commit_list *from,
+		       const struct commit_list *to,
 		       int commit_date_cutoff);
 
 
> +cycle:
> +	free(contains_stack.contains_stack);
> +	clear_contains_cache(cache);
> +	init_contains_cache(cache);
> +
> +	result = repo_is_descendant_of(the_repository, candidate, want);
> +	if (result < 0)
> +		exit(128);

We are feeding the whole initial "want" list, so we should get a correct
answer regardless of how far we got into the cycle, which would run into
problems (e.g., if the cycle existed only on some branch of the
history). But going back to the initial list will always be correct.
Good.

Two small points, though.

One, the call to init_contains_cache() is redundant here; the clear
function is documented as making things ready for use (it's a little
hard to grep for, due to macros, but the docs are in commit-slab.h).
It's probably not hurting anything.

Two, the call to exit(128) is unusual for our code base (I'd guess it
was cribbed off of the top-level exits in builtin/pull.c). We'd usually
die() instead. Even if repo_is_descendant_of() produced its own error
message, it may be useful to mention that we were falling back to it due
to a cycle.

But even better is if we can return the error up the stack. We do not
return errors from contains_tag_algo() currently, but it has only one
caller. And that caller may also directly return the result of
repo_is_descendant_of(). So could we just pass that along?

Perhaps not. Looking at the callers of commit_contains(), they treat the
result as a pure boolean. So probably calling die() is reasonable, and
we already do so via parse_commit_or_die() elsewhere in the algorithm.
That does leave a potential lurking bug for the non-tag-algo code path.

> +	*contains_cache_at(cache, candidate) =
> +		result ? CONTAINS_YES : CONTAINS_NO;
> +	return result ? CONTAINS_YES : CONTAINS_NO;

So we actually cache our discovered value. Cute, and it might save us
from hitting the cycle again, though not always. E.g., two candidates A
and B share a parent P, and the cycle starts at P but does not include A
or B. We discover the cycle and cache the value for A, but discover it
again for B.

We do lose all of the existing non-cycle cached values when we call
clear_contains_cache(). But we have to at least clear out all of the
IN_PROGRESS commits. It is hard to care too much about optimizing the
outcome for this case which we expect to happen approximately never.
So I think doing the simplest correct thing is OK.

> +test_expect_success 'tag --contains handles cyclic replacement histories' '
> +	first=$(git rev-parse HEAD~2) &&
> +	second=$(git rev-parse HEAD~) &&
> +	third=$(git rev-parse HEAD) &&
> +	test_when_finished "
> +		git replace -d $first
> +		git replace -d $third
> +		git tag -d cycle-a cycle-b
> +	" &&

We usually &&-chain the commands inside test_when_finished. If they
fail, the test harness will note this and complain (if the test was not
otherwise failing). It's usually not a big deal either way, though
sometimes it can catch silly mistakes (e.g., if you wrote $second
instead of $third and the "replace -d" is quietly doing nothing at all).

I'm a little surprised that the chainlint checker doesn't catch this,
but I guess it doesn't know to recurse into the snippet handed to
test_when_finished. It probably is not really worth the trouble to teach
it to do so.

Otherwise the test looks good to me.

-Peff

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-10 12:20     ` Tamir Duberstein
@ 2026-06-11  8:16       ` Karthik Nayak
  2026-06-11 20:10         ` Tamir Duberstein
  0 siblings, 1 reply; 22+ messages in thread
From: Karthik Nayak @ 2026-06-11  8:16 UTC (permalink / raw)
  To: Tamir Duberstein
  Cc: git, Jeff King, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

[-- Attachment #1: Type: text/plain, Size: 5749 bytes --]

Tamir Duberstein <tamird@gmail.com> writes:

> On Wed, Jun 10, 2026 at 4:47 AM Karthik Nayak <karthik.188@gmail.com> wrote:
>>
>> Tamir Duberstein <tamird@gmail.com> writes:
>>
>> > git branch and git for-each-ref call repo_is_descendant_of() for
>> > each candidate selected by --contains or --no-contains. Each call
>> > starts a new graph walk, so refs with shared history repeatedly
>> > traverse the same commits.
>> >
>> > ffc4b8012d (tag: speed up --contains calculation, 2011-06-11)
>> > introduced a depth-first walk for git tag that caches positive and
>> > negative answers across candidates. ee2bd06b0f (ref-filter: implement
>> > '--contains' option, 2015-07-07) preserved both implementations when
>> > ref-filter learned --contains.
>> >
>> > The memoized walk is not always faster. Without generation numbers,
>> > a negative check can walk to the root even when the breadth-first
>> > merge-base walk finds a nearby divergence. With generation numbers,
>> > the depth-first walk can stop below the oldest target while still
>> > reusing answers across candidates.
>> >
>> > Keep the existing memoized selection for git tag. Select it for other
>> > ref-filter callers when generation numbers are enabled, and retain
>> > the breadth-first walk otherwise.
>> >
>> > When generation numbers are unavailable, repo_is_descendant_of() can
>> > return -1 if ancestry cannot be read. The ref-filter Boolean interface
>> > treated that error as a match. Check it and exit instead. The memoized
>> > path already dies on the same parse failure, so both selected paths now
>> > fail rather than return a result.
>> >
>> > Add p1500 cases for up to 8,192 packed refs along one first-parent
>> > history and for sibling refs near the tip with generation numbers
>> > forced off.
>> >
>> > On a checkout with 62,174 remote-tracking refs and generation numbers
>> > enabled, I ran:
>> >
>> >     hyperfine --warmup 0 --runs 3 \
>> >         --command-name parent \
>> >         '"$parent" branch -r --contains c78ae85f3ce7e >/dev/null' \
>> >         --command-name this-commit \
>> >         '"$this" branch -r --contains c78ae85f3ce7e >/dev/null'
>> >
>> > The results were:
>> >
>> >              parent       this commit
>> >   elapsed    104.365 s     467.7 ms
>> >   user        93.702 s     220.2 ms
>> >   system       0.723 s     182.7 ms
>> >
>> > The wall-time standard deviations were 11.356 seconds and 133.8
>> > milliseconds, respectively. Separate runs without redirection produced
>> > the same output with SHA-256
>> > 2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.
>> >
>> > Both revisions were built with the default -O2 flags using Apple
>> > clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro (Mac16,6)
>> > with a 16-core Apple M4 Max (12 performance and four efficiency
>> > cores) and 128 GB RAM.
>> >
>> > Link: https://lore.kernel.org/git/1445163904-24611-1-git-send-email-Karthik.188@gmail.com/
>> > Link: https://lore.kernel.org/r/20230324191009.GA536967@coredump.intra.peff.net
>> > Link: https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/
>> > Link: https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net
>> > Suggested-by: Jeff King <peff@peff.net>
>> > Signed-off-by: Tamir Duberstein <tamird@gmail.com>
>> > ---
>> >  commit-reach.c                 | 13 +++++++++--
>> >  commit-reach.h                 |  7 ++++++
>> >  t/perf/p1500-graph-walks.sh    | 49 +++++++++++++++++++++++++++++++++++++++++-
>> >  t/t6301-for-each-ref-errors.sh | 22 +++++++++++++++++++
>> >  4 files changed, 88 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/commit-reach.c b/commit-reach.c
>> > index 65b618959b..83a48004ef 100644
>> > --- a/commit-reach.c
>> > +++ b/commit-reach.c
>> > @@ -821,9 +821,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
>> >  int commit_contains(struct ref_filter *filter, struct commit *commit,
>> >                   struct commit_list *list, struct contains_cache *cache)
>> >  {
>> > -     if (filter->with_commit_tag_algo)
>> > +     int result;
>> > +
>> > +     if (!list)
>> > +             return 1;
>> > +     if (filter->with_commit_tag_algo ||
>> > +         generation_numbers_enabled(the_repository))
>>
>> What's stopping us from dropping `filter->with_commit_tag_algo`
>> completely and then doing?
>>
>>   if (generation_numbers_enabled(the_repository))
>>      return contains_algo(commit, list, cache) == CONTAINS_YES;
>>   return repo_is_descendant_of(the_repository, commit, list);
>
> Jeff raised this distinction during the v1 review:
>
> https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net/
>
> `with_commit_tag_algo` preserves the existing behavior of `git tag` when
> generation numbers are unavailable. `git tag --contains` has used the
> memoized walk since ffc4b8012d (tag: speed up --contains calculation,
> 2011-06-11). Dropping the flag would send it back through repeated
> `repo_is_descendant_of()` walks in repositories without usable generation
> numbers.
>

I did read that, my question is on top of that. Do we also want to use
the non-memoized walk for 'git tag' when there are no generation numbers
available or does that not work? If not, we should mention that too in
the commit message.

> The condition in v2 implements the rule discussed there: retain the
> existing memoized path for `git tag`, and use it for other ref-filter
> callers when generation numbers make the depth-first walk reliably
> advantageous.
>
> This is probably my fault for breaking the threading between this and
> v1. Sorry about that.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-09  2:36 ` [PATCH v2 2/2] ref-filter: memoize --contains with generations Tamir Duberstein
  2026-06-10 11:47   ` Karthik Nayak
@ 2026-06-11  8:22   ` Jeff King
  2026-06-12  2:40     ` Tamir Duberstein
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff King @ 2026-06-11  8:22 UTC (permalink / raw)
  To: Tamir Duberstein
  Cc: git, Karthik Nayak, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

On Mon, Jun 08, 2026 at 07:36:35PM -0700, Tamir Duberstein wrote:

> The wall-time standard deviations were 11.356 seconds and 133.8
> milliseconds, respectively. Separate runs without redirection produced
> the same output with SHA-256
> 2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.

Heh. Without the original repo, this sha256 hash is meaningless to us,
isn't it? Ditto for the sha1 the earlier command.

>  int commit_contains(struct ref_filter *filter, struct commit *commit,
>  		    struct commit_list *list, struct contains_cache *cache)
>  {
> -	if (filter->with_commit_tag_algo)
> +	int result;
> +
> +	if (!list)
> +		return 1;
> +	if (filter->with_commit_tag_algo ||
> +	    generation_numbers_enabled(the_repository))
>  		return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
> -	return repo_is_descendant_of(the_repository, commit, list);
> +
> +	result = repo_is_descendant_of(the_repository, commit, list);
> +	if (result < 0)
> +		exit(128);
> +	return result;

There's a little more going on here than I expected from the commit
message. Is it important for us to short-circuit the empty list and just
return 1? Or did the existing helper functions already handle that?

Looking at contains_tag_algo(), I think it would actually return
CONTAINS_NO here (though I didn't test it). So this is actually a change
in behavior for "git tag" if that's correct. I doubt it is triggerable
in practice, though, as we would simply never call commit_contains() in
the first place with an empty list. But if we are going to add in this
logic, I think it makes sense to do so as a separate commit (describing
what it is doing and why it's not (yet) a triggerable bug).

Checking the result of repo_is_descendant_of() makes sense, as discussed
earlier. But probably that should come as its own patch, since it's an
independent bug-fix. I'm also tempted to say it should call die()
instead of a direct exit, though it does look like the error exit paths
from repo_is_descendant_of() would all have produced their own messages.


And one side note. While looking at the implementation of
repo_is_descendant_of(), I did notice something curious: it also
switches algorithms based on the presence of generation numbers! So it
should also be cutting off the traversal early when possible. But I
guess its main problem is that we call it independently for each
candidate, so it may traverse the same (useful) stretch of history
multiple times.

So probably an alternative approach to this patch would be feeding all
of the candidates at once, the way we do with reach_filter() via
filter_refs(). I'm not sure if we have the right functions available for
that (naively, --contains and --merged are inversions of each other, so
swapping the arguments to tips_reachable_from_bases() might work, but I
didn't think very hard on it).

I wonder if that might perform better or worse. I'm content to leave it
for another day, though, as switching to the memoizing depth-first algo
here is a pretty easy change.

> -	commit=$(git commit-tree $(git rev-parse HEAD^{tree})) &&
> +	git rev-list --first-parent --max-count=8192 HEAD >contains-commits &&
> +	test_file_not_empty contains-commits &&
> +	git update-ref refs/contains-perf-base "$(tail -n 1 contains-commits)" &&
> +	awk "{
> +		printf \"update refs/contains-perf/%04d %s\\n\", NR, \$1
> +	}" contains-commits |
> +		git update-ref --stdin &&
> +	git pack-refs --include "refs/contains-perf/*" &&

My head almost exploded reading the embedded quoting in that awk
invocation. But I can't think offhand of a better way to do it. You
can't use test_seq because it needs both the number and the original
string. You can do it with sed, but it probably ends up even more
unreadable.

But OK, we are making a bunch of refs based on first-parent history.

> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	base=$(git rev-parse HEAD) &&
> +	target=$(echo target | git commit-tree "$tree" -p "$base") &&
> +	git update-ref refs/contains-diverged/target "$target" &&
> +	for i in $(test_seq 1 4)
> +	do
> +		commit=$(echo candidate-$i |
> +			git commit-tree "$tree" -p "$base") &&
> +		git update-ref refs/contains-diverged/candidate-$i "$commit" ||
> +		return 1
> +	done &&

And then a few candidate refs that are not reachable from other refs, or
from each other. OK.

I think you could just write:

  git commit-tree HEAD^{tree} -p HEAD

instead of doing separate rev-parses, but it's probably not a big deal
either way.

> +test_expect_success 'verify contains results' '
> +	git for-each-ref --contains=refs/contains-perf-base \
> +		refs/contains-perf/ >actual &&
> +	test_line_count = $(wc -l <contains-commits) actual &&
> +
> +	echo refs/contains-diverged/target >expect &&
> +	GIT_TEST_COMMIT_GRAPH=0 \
> +		git -c core.commitGraph=false for-each-ref \
> +			--format="%(refname)" \
> +			--contains=refs/contains-diverged/target \
> +			refs/contains-diverged/ >actual &&
> +	test_cmp expect actual
> +'

This is a funny test to have in the middle of a perf script (which
hardly anybody ever runs). If we are concerned about the correctness,
should this be in a non-perf test script? Though I'd imagine something
like it is already covered there.

There's a lot of subtlety in what we're verifying, too. In the first
half, we are checking that all of the commits in contains-perf contain
the base.  And that base is the final element of the contains-commits
list. Which made me wonder what happens in a branch history, since that
list is linearized. But because we used --first-parent to generate it,
it _is_ linear, and the results work out. So OK, I don't think it's
wrong, but I am struggling to understand the meaning of the test.

The second half is just checking that...the other refs which are not
contained in "target" are not mentioned? OK, but why do it only with
commit graphs off. Why not both off and on? Again, I'm not sure I
understand what we're trying to focus on here.

> +test_perf 'contains: git for-each-ref --contains' '
> +	git for-each-ref --contains=refs/contains-perf-base \
> +		refs/contains-perf/ >/dev/null
> +'

Yay, actual perf tests. Here we have a ton of matches, and they all walk
over the same chunk of history. Should get much faster, though it's
mostly a synthetic test.

For --merged, we already have separate tests with each of for-each-ref,
branch, and tag. Should we have the same here for --contains? And should
we be using the input repo data, rather than our synthetic test? It is
nice to show off the performance with the synthetic test, but ultimately
the point of the perf suite is feeding it real workloads and looking for
regressions.

> +test_perf 'contains without generations: divergent refs' '
> +	GIT_TEST_COMMIT_GRAPH=0 \
> +		git -c core.commitGraph=false for-each-ref \
> +			--contains=refs/contains-diverged/target \
> +			refs/contains-diverged/ >/dev/null
> +'

OK, and this one should find that most of them are not contained, but
the depth-first algorithm could walk all the way down to the roots. But
we don't run it at all, since we disable commit graphs!

So what are we trying to measure here? If it left commit graphs enabled,
I think we could demonstrate that using the depth-first algorithm with
generation numbers does not make anything _worse_. I.e., that
for-each-ref and branch did not regress from the change.

> +test_expect_success 'missing ancestors are reported by contains filters' '
> +	test_when_finished "git update-ref -d refs/heads/missing-parent" &&
> +	{
> +		echo "tree $(git rev-parse HEAD^{tree})" &&
> +		echo "parent $MISSING" &&
> +		git cat-file commit HEAD |
> +			sed -n -e "/^author /p" -e "/^committer /p" &&
> +		echo &&
> +		echo "missing parent"
> +	} >commit &&
> +	broken=$(git hash-object -t commit -w commit) &&
> +	git update-ref refs/heads/missing-parent "$broken" &&
> +	for option in --contains --no-contains
> +	do
> +		test_must_fail git for-each-ref "$option=HEAD" \
> +			refs/heads/missing-parent >out 2>err &&
> +		test_must_be_empty out &&
> +		test_grep "parse commit $MISSING" err ||
> +		return 1
> +	done
> +'

This is a great thing to test, but probably should be pulled out into
a separate patch along with the fix to check the return code.

The commit construction looks OK, and is nicer than corrupting the
repository by deleting a real object. Given that we are pulling the
idents from an existing commit, it might be simpler to just use the
whole commit as a template, like:

  git cat-file commit HEAD |
  sed "s/^parent /parent $MISSING/"

but it may be a matter of taste.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-11  8:16       ` Karthik Nayak
@ 2026-06-11 20:10         ` Tamir Duberstein
  0 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-11 20:10 UTC (permalink / raw)
  To: Karthik Nayak
  Cc: git, Jeff King, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

On Thu, Jun 11, 2026 at 1:16 AM Karthik Nayak <karthik.188@gmail.com> wrote:
>
> Tamir Duberstein <tamird@gmail.com> writes:
>
> > On Wed, Jun 10, 2026 at 4:47 AM Karthik Nayak <karthik.188@gmail.com> wrote:
> >>
> >> Tamir Duberstein <tamird@gmail.com> writes:
> >>
> >> > git branch and git for-each-ref call repo_is_descendant_of() for
> >> > each candidate selected by --contains or --no-contains. Each call
> >> > starts a new graph walk, so refs with shared history repeatedly
> >> > traverse the same commits.
> >> >
> >> > ffc4b8012d (tag: speed up --contains calculation, 2011-06-11)
> >> > introduced a depth-first walk for git tag that caches positive and
> >> > negative answers across candidates. ee2bd06b0f (ref-filter: implement
> >> > '--contains' option, 2015-07-07) preserved both implementations when
> >> > ref-filter learned --contains.
> >> >
> >> > The memoized walk is not always faster. Without generation numbers,
> >> > a negative check can walk to the root even when the breadth-first
> >> > merge-base walk finds a nearby divergence. With generation numbers,
> >> > the depth-first walk can stop below the oldest target while still
> >> > reusing answers across candidates.
> >> >
> >> > Keep the existing memoized selection for git tag. Select it for other
> >> > ref-filter callers when generation numbers are enabled, and retain
> >> > the breadth-first walk otherwise.
> >> >
> >> > When generation numbers are unavailable, repo_is_descendant_of() can
> >> > return -1 if ancestry cannot be read. The ref-filter Boolean interface
> >> > treated that error as a match. Check it and exit instead. The memoized
> >> > path already dies on the same parse failure, so both selected paths now
> >> > fail rather than return a result.
> >> >
> >> > Add p1500 cases for up to 8,192 packed refs along one first-parent
> >> > history and for sibling refs near the tip with generation numbers
> >> > forced off.
> >> >
> >> > On a checkout with 62,174 remote-tracking refs and generation numbers
> >> > enabled, I ran:
> >> >
> >> >     hyperfine --warmup 0 --runs 3 \
> >> >         --command-name parent \
> >> >         '"$parent" branch -r --contains c78ae85f3ce7e >/dev/null' \
> >> >         --command-name this-commit \
> >> >         '"$this" branch -r --contains c78ae85f3ce7e >/dev/null'
> >> >
> >> > The results were:
> >> >
> >> >              parent       this commit
> >> >   elapsed    104.365 s     467.7 ms
> >> >   user        93.702 s     220.2 ms
> >> >   system       0.723 s     182.7 ms
> >> >
> >> > The wall-time standard deviations were 11.356 seconds and 133.8
> >> > milliseconds, respectively. Separate runs without redirection produced
> >> > the same output with SHA-256
> >> > 2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.
> >> >
> >> > Both revisions were built with the default -O2 flags using Apple
> >> > clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro (Mac16,6)
> >> > with a 16-core Apple M4 Max (12 performance and four efficiency
> >> > cores) and 128 GB RAM.
> >> >
> >> > Link: https://lore.kernel.org/git/1445163904-24611-1-git-send-email-Karthik.188@gmail.com/
> >> > Link: https://lore.kernel.org/r/20230324191009.GA536967@coredump.intra.peff.net
> >> > Link: https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/
> >> > Link: https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net
> >> > Suggested-by: Jeff King <peff@peff.net>
> >> > Signed-off-by: Tamir Duberstein <tamird@gmail.com>
> >> > ---
> >> >  commit-reach.c                 | 13 +++++++++--
> >> >  commit-reach.h                 |  7 ++++++
> >> >  t/perf/p1500-graph-walks.sh    | 49 +++++++++++++++++++++++++++++++++++++++++-
> >> >  t/t6301-for-each-ref-errors.sh | 22 +++++++++++++++++++
> >> >  4 files changed, 88 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/commit-reach.c b/commit-reach.c
> >> > index 65b618959b..83a48004ef 100644
> >> > --- a/commit-reach.c
> >> > +++ b/commit-reach.c
> >> > @@ -821,9 +821,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
> >> >  int commit_contains(struct ref_filter *filter, struct commit *commit,
> >> >                   struct commit_list *list, struct contains_cache *cache)
> >> >  {
> >> > -     if (filter->with_commit_tag_algo)
> >> > +     int result;
> >> > +
> >> > +     if (!list)
> >> > +             return 1;
> >> > +     if (filter->with_commit_tag_algo ||
> >> > +         generation_numbers_enabled(the_repository))
> >>
> >> What's stopping us from dropping `filter->with_commit_tag_algo`
> >> completely and then doing?
> >>
> >>   if (generation_numbers_enabled(the_repository))
> >>      return contains_algo(commit, list, cache) == CONTAINS_YES;
> >>   return repo_is_descendant_of(the_repository, commit, list);
> >
> > Jeff raised this distinction during the v1 review:
> >
> > https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net/
> >
> > `with_commit_tag_algo` preserves the existing behavior of `git tag` when
> > generation numbers are unavailable. `git tag --contains` has used the
> > memoized walk since ffc4b8012d (tag: speed up --contains calculation,
> > 2011-06-11). Dropping the flag would send it back through repeated
> > `repo_is_descendant_of()` walks in repositories without usable generation
> > numbers.
> >
>
> I did read that, my question is on top of that. Do we also want to use
> the non-memoized walk for 'git tag' when there are no generation numbers
> available or does that not work? If not, we should mention that too in
> the commit message.

We should keep the memoized walk for git tag. In git.git, with commit
graphs disabled, hyperfine measured:

    git -c core.commitGraph=false tag --contains HEAD~200
    git -c core.commitGraph=false for-each-ref \
        --contains HEAD~200 refs/tags/

at 478.9 ms and 4.861 s, respectively. The second command takes the
non-memoized path, so memoization is about 10 times faster for this
workload. I added the rationale to the commit message in v3.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/2] commit-reach: handle cycles in contains walk
  2026-06-11  7:29   ` Jeff King
@ 2026-06-12  2:40     ` Tamir Duberstein
  0 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12  2:40 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Karthik Nayak, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

On Thu, Jun 11, 2026 at 12:29 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Jun 08, 2026 at 07:36:34PM -0700, Tamir Duberstein wrote:
>
> > @@ -744,7 +745,7 @@ static void push_to_contains_stack(struct commit *candidate, struct contains_sta
> >  }
> >
> >  static enum contains_result contains_tag_algo(struct commit *candidate,
> > -                                           const struct commit_list *want,
> > +                                           struct commit_list *want,
> >                                             struct contains_cache *cache)
>
> OK, we must lose the const here because repo_is_descendant_of() does not
> have it. We could add const to that function, though that cascades down
> to a few other helpers (see below). I'm not sure if that is making the
> world a better place, or if it is just const pedantry.

I left the signature change local rather than propagating const
through the other reachability helpers.

>
> diff --git a/commit-reach.c b/commit-reach.c
> index 5df471a313..8cede01f01 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -563,7 +563,7 @@ int repo_get_merge_bases(struct repository *r,
>   */
>  int repo_is_descendant_of(struct repository *r,
>                           struct commit *commit,
> -                         struct commit_list *with_commit)
> +                         const struct commit_list *with_commit)
>  {
>         if (!with_commit)
>                 return 1;
> @@ -955,11 +955,12 @@ int can_all_from_reach_with_flag(struct object_array *from,
>         return result;
>  }
>
> -int can_all_from_reach(struct commit_list *from, struct commit_list *to,
> +int can_all_from_reach(const struct commit_list *from,
> +                      const struct commit_list *to,
>                        int cutoff_by_min_date)
>  {
>         struct object_array from_objs = OBJECT_ARRAY_INIT;
> -       struct commit_list *from_iter = from, *to_iter = to;
> +       const struct commit_list *from_iter = from, *to_iter = to;
>         int result;
>         timestamp_t min_commit_date = cutoff_by_min_date ? from->item->date : 0;
>         timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
> diff --git a/commit-reach.h b/commit-reach.h
> index 3f3a563d8a..76e82f827e 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -37,7 +37,7 @@ int get_octopus_merge_bases(struct commit_list *in, struct commit_list **result)
>
>  int repo_is_descendant_of(struct repository *r,
>                           struct commit *commit,
> -                         struct commit_list *with_commit);
> +                         const struct commit_list *with_commit);
>  int repo_in_merge_bases(struct repository *r,
>                         struct commit *commit,
>                         struct commit *reference);
> @@ -93,7 +93,8 @@ int can_all_from_reach_with_flag(struct object_array *from,
>                                  unsigned int assign_flag,
>                                  timestamp_t min_commit_date,
>                                  timestamp_t min_generation);
> -int can_all_from_reach(struct commit_list *from, struct commit_list *to,
> +int can_all_from_reach(const struct commit_list *from,
> +                      const struct commit_list *to,
>                        int commit_date_cutoff);
>
>
> > +cycle:
> > +     free(contains_stack.contains_stack);
> > +     clear_contains_cache(cache);
> > +     init_contains_cache(cache);
> > +
> > +     result = repo_is_descendant_of(the_repository, candidate, want);
> > +     if (result < 0)
> > +             exit(128);
>
> We are feeding the whole initial "want" list, so we should get a correct
> answer regardless of how far we got into the cycle, which would run into
> problems (e.g., if the cycle existed only on some branch of the
> history). But going back to the initial list will always be correct.
> Good.
>
> Two small points, though.
>
> One, the call to init_contains_cache() is redundant here; the clear
> function is documented as making things ready for use (it's a little
> hard to grep for, due to macros, but the docs are in commit-slab.h).
> It's probably not hurting anything.
>
> Two, the call to exit(128) is unusual for our code base (I'd guess it
> was cribbed off of the top-level exits in builtin/pull.c). We'd usually
> die() instead. Even if repo_is_descendant_of() produced its own error
> message, it may be useful to mention that we were falling back to it due
> to a cycle.

I removed the redundant initialization and replaced exit(128) with
die(), adding context that the failure occurred after detecting a
cycle.

>
> But even better is if we can return the error up the stack. We do not
> return errors from contains_tag_algo() currently, but it has only one
> caller. And that caller may also directly return the result of
> repo_is_descendant_of(). So could we just pass that along?
>
> Perhaps not. Looking at the callers of commit_contains(), they treat the
> result as a pure boolean. So probably calling die() is reasonable, and
> we already do so via parse_commit_or_die() elsewhere in the algorithm.
> That does leave a potential lurking bug for the non-tag-algo code path.

I traced the callers. Returning an error from commit_contains() would
only move the fatal check into apply_ref_filter(): its NULL return
already means “filtered out”, filter_and_format_refs() returns void, and
the branch caller ignores filter_refs()'s return value. Propagating the
error to the command would require changing that whole chain, and none
of the commands can recover from an unreadable commit.

The series therefore makes both cases fail explicitly. Patch 1 calls
die() if the cycle fallback cannot read the ancestry. Patch 3 calls
die() when the ordinary non-memoized walk returns -1.

>
> > +     *contains_cache_at(cache, candidate) =
> > +             result ? CONTAINS_YES : CONTAINS_NO;
> > +     return result ? CONTAINS_YES : CONTAINS_NO;
>
> So we actually cache our discovered value. Cute, and it might save us
> from hitting the cycle again, though not always. E.g., two candidates A
> and B share a parent P, and the cycle starts at P but does not include A
> or B. We discover the cycle and cache the value for A, but discover it
> again for B.
>
> We do lose all of the existing non-cycle cached values when we call
> clear_contains_cache(). But we have to at least clear out all of the
> IN_PROGRESS commits. It is hard to care too much about optimizing the
> outcome for this case which we expect to happen approximately never.
> So I think doing the simplest correct thing is OK.
>
> > +test_expect_success 'tag --contains handles cyclic replacement histories' '
> > +     first=$(git rev-parse HEAD~2) &&
> > +     second=$(git rev-parse HEAD~) &&
> > +     third=$(git rev-parse HEAD) &&
> > +     test_when_finished "
> > +             git replace -d $first
> > +             git replace -d $third
> > +             git tag -d cycle-a cycle-b
> > +     " &&
>
> We usually &&-chain the commands inside test_when_finished. If they
> fail, the test harness will note this and complain (if the test was not
> otherwise failing). It's usually not a big deal either way, though
> sometimes it can catch silly mistakes (e.g., if you wrote $second
> instead of $third and the "replace -d" is quietly doing nothing at all).

Fixed in v3.




>
> I'm a little surprised that the chainlint checker doesn't catch this,
> but I guess it doesn't know to recurse into the snippet handed to
> test_when_finished. It probably is not really worth the trouble to teach
> it to do so.
>
> Otherwise the test looks good to me.
>
> -Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/2] ref-filter: memoize --contains with generations
  2026-06-11  8:22   ` Jeff King
@ 2026-06-12  2:40     ` Tamir Duberstein
  0 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12  2:40 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Karthik Nayak, Junio C Hamano, Victoria Dye, Derrick Stolee,
	Elijah Newren

On Thu, Jun 11, 2026 at 1:22 AM Jeff King <peff@peff.net> wrote:
>
> On Mon, Jun 08, 2026 at 07:36:35PM -0700, Tamir Duberstein wrote:
>
> > The wall-time standard deviations were 11.356 seconds and 133.8
> > milliseconds, respectively. Separate runs without redirection produced
> > the same output with SHA-256
> > 2466f6e2b72aa16b1a2126eddb81c8a1b2764ee251204ac034c191a925aa896f.
>
> Heh. Without the original repo, this sha256 hash is meaningless to us,
> isn't it? Ditto for the sha1 the earlier command.

Yeah, AI slop. Removed.

>
> >  int commit_contains(struct ref_filter *filter, struct commit *commit,
> >                   struct commit_list *list, struct contains_cache *cache)
> >  {
> > -     if (filter->with_commit_tag_algo)
> > +     int result;
> > +
> > +     if (!list)
> > +             return 1;
> > +     if (filter->with_commit_tag_algo ||
> > +         generation_numbers_enabled(the_repository))
> >               return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
> > -     return repo_is_descendant_of(the_repository, commit, list);
> > +
> > +     result = repo_is_descendant_of(the_repository, commit, list);
> > +     if (result < 0)
> > +             exit(128);
> > +     return result;
>
> There's a little more going on here than I expected from the commit
> message. Is it important for us to short-circuit the empty list and just
> return 1? Or did the existing helper functions already handle that?
>
> Looking at contains_tag_algo(), I think it would actually return
> CONTAINS_NO here (though I didn't test it). So this is actually a change
> in behavior for "git tag" if that's correct. I doubt it is triggerable
> in practice, though, as we would simply never call commit_contains() in
> the first place with an empty list. But if we are going to add in this
> logic, I think it makes sense to do so as a separate commit (describing
> what it is doing and why it's not (yet) a triggerable bug).
>
> Checking the result of repo_is_descendant_of() makes sense, as discussed
> earlier. But probably that should come as its own patch, since it's an
> independent bug-fix. I'm also tempted to say it should call die()
> instead of a direct exit, though it does look like the error exit paths
> from repo_is_descendant_of() would all have produced their own messages.

I dropped the empty-list change. The error check is now a separate
patch and uses die().

>
>
> And one side note. While looking at the implementation of
> repo_is_descendant_of(), I did notice something curious: it also
> switches algorithms based on the presence of generation numbers! So it
> should also be cutting off the traversal early when possible. But I
> guess its main problem is that we call it independently for each
> candidate, so it may traverse the same (useful) stretch of history
> multiple times.
>
> So probably an alternative approach to this patch would be feeding all
> of the candidates at once, the way we do with reach_filter() via
> filter_refs(). I'm not sure if we have the right functions available for
> that (naively, --contains and --merged are inversions of each other, so
> swapping the arguments to tips_reachable_from_bases() might work, but I
> didn't think very hard on it).

I tried the suggested argument swap, but tips_reachable_from_bases()
only reports whether a tip is reachable from any base. It cannot report
which candidate refs contain a target, which is what --contains needs.
I did not find an existing batched reachability helper that returns
those per-candidate answers.

>
> I wonder if that might perform better or worse. I'm content to leave it
> for another day, though, as switching to the memoizing depth-first algo
> here is a pretty easy change.
>
> > -     commit=$(git commit-tree $(git rev-parse HEAD^{tree})) &&
> > +     git rev-list --first-parent --max-count=8192 HEAD >contains-commits &&
> > +     test_file_not_empty contains-commits &&
> > +     git update-ref refs/contains-perf-base "$(tail -n 1 contains-commits)" &&
> > +     awk "{
> > +             printf \"update refs/contains-perf/%04d %s\\n\", NR, \$1
> > +     }" contains-commits |
> > +             git update-ref --stdin &&
> > +     git pack-refs --include "refs/contains-perf/*" &&
>
> My head almost exploded reading the embedded quoting in that awk
> invocation. But I can't think offhand of a better way to do it. You
> can't use test_seq because it needs both the number and the original
> string. You can do it with sed, but it probably ends up even more
> unreadable.
>
> But OK, we are making a bunch of refs based on first-parent history.
>
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     base=$(git rev-parse HEAD) &&
> > +     target=$(echo target | git commit-tree "$tree" -p "$base") &&
> > +     git update-ref refs/contains-diverged/target "$target" &&
> > +     for i in $(test_seq 1 4)
> > +     do
> > +             commit=$(echo candidate-$i |
> > +                     git commit-tree "$tree" -p "$base") &&
> > +             git update-ref refs/contains-diverged/candidate-$i "$commit" ||
> > +             return 1
> > +     done &&
>
> And then a few candidate refs that are not reachable from other refs, or
> from each other. OK.
>
> I think you could just write:
>
>   git commit-tree HEAD^{tree} -p HEAD
>
> instead of doing separate rev-parses, but it's probably not a big deal
> either way.
>
> > +test_expect_success 'verify contains results' '
> > +     git for-each-ref --contains=refs/contains-perf-base \
> > +             refs/contains-perf/ >actual &&
> > +     test_line_count = $(wc -l <contains-commits) actual &&
> > +
> > +     echo refs/contains-diverged/target >expect &&
> > +     GIT_TEST_COMMIT_GRAPH=0 \
> > +             git -c core.commitGraph=false for-each-ref \
> > +                     --format="%(refname)" \
> > +                     --contains=refs/contains-diverged/target \
> > +                     refs/contains-diverged/ >actual &&
> > +     test_cmp expect actual
> > +'
>
> This is a funny test to have in the middle of a perf script (which
> hardly anybody ever runs). If we are concerned about the correctness,
> should this be in a non-perf test script? Though I'd imagine something
> like it is already covered there.

I deleted that block rather than moving it. It only rechecked ordinary
--contains semantics already covered by t3201, t6302, and t7004; with
GIT_TEST_COMMIT_GRAPH=1, those tests exercise the newly selected
memoized path for branch and for-each-ref.

The series adds functional tests for the behavior that is actually new:
t7004 covers cyclic replacement histories, and t6301 covers unreadable
ancestry. The p1500 additions now measure performance only.

>
> There's a lot of subtlety in what we're verifying, too. In the first
> half, we are checking that all of the commits in contains-perf contain
> the base.  And that base is the final element of the contains-commits
> list. Which made me wonder what happens in a branch history, since that
> list is linearized. But because we used --first-parent to generate it,
> it _is_ linear, and the results work out. So OK, I don't think it's
> wrong, but I am struggling to understand the meaning of the test.
>
> The second half is just checking that...the other refs which are not
> contained in "target" are not mentioned? OK, but why do it only with
> commit graphs off. Why not both off and on? Again, I'm not sure I
> understand what we're trying to focus on here.
>
> > +test_perf 'contains: git for-each-ref --contains' '
> > +     git for-each-ref --contains=refs/contains-perf-base \
> > +             refs/contains-perf/ >/dev/null
> > +'
>
> Yay, actual perf tests. Here we have a ton of matches, and they all walk
> over the same chunk of history. Should get much faster, though it's
> mostly a synthetic test.
>
> For --merged, we already have separate tests with each of for-each-ref,
> branch, and tag. Should we have the same here for --contains? And should
> we be using the input repo data, rather than our synthetic test? It is
> nice to show off the performance with the synthetic test, but ultimately
> the point of the perf suite is feeding it real workloads and looking for
> regressions.

I added p1500 cases for all three frontends using refs from the input
repository, while retaining the synthetic shared-history case.

>
> > +test_perf 'contains without generations: divergent refs' '
> > +     GIT_TEST_COMMIT_GRAPH=0 \
> > +             git -c core.commitGraph=false for-each-ref \
> > +                     --contains=refs/contains-diverged/target \
> > +                     refs/contains-diverged/ >/dev/null
> > +'
>
> OK, and this one should find that most of them are not contained, but
> the depth-first algorithm could walk all the way down to the roots. But
> we don't run it at all, since we disable commit graphs!
>
> So what are we trying to measure here? If it left commit graphs enabled,
> I think we could demonstrate that using the depth-first algorithm with
> generation numbers does not make anything _worse_. I.e., that
> for-each-ref and branch did not regress from the change.

The divergent-ref test did not exercise the changed path, so I removed
it.

>
> > +test_expect_success 'missing ancestors are reported by contains filters' '
> > +     test_when_finished "git update-ref -d refs/heads/missing-parent" &&
> > +     {
> > +             echo "tree $(git rev-parse HEAD^{tree})" &&
> > +             echo "parent $MISSING" &&
> > +             git cat-file commit HEAD |
> > +                     sed -n -e "/^author /p" -e "/^committer /p" &&
> > +             echo &&
> > +             echo "missing parent"
> > +     } >commit &&
> > +     broken=$(git hash-object -t commit -w commit) &&
> > +     git update-ref refs/heads/missing-parent "$broken" &&
> > +     for option in --contains --no-contains
> > +     do
> > +             test_must_fail git for-each-ref "$option=HEAD" \
> > +                     refs/heads/missing-parent >out 2>err &&
> > +             test_must_be_empty out &&
> > +             test_grep "parse commit $MISSING" err ||
> > +             return 1
> > +     done
> > +'
>
> This is a great thing to test, but probably should be pulled out into
> a separate patch along with the fix to check the return code.

Done in v3.




>
> The commit construction looks OK, and is nicer than corrupting the
> repository by deleting a real object. Given that we are pulling the
> idents from an existing commit, it might be simpler to just use the
> whole commit as a template, like:
>
>   git cat-file commit HEAD |
>   sed "s/^parent /parent $MISSING/"
>
> but it may be a matter of taste.
>
> -Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 0/3] Reuse --contains traversal results
  2026-06-09  2:36 [PATCH v2 0/2] Reuse --contains traversal results Tamir Duberstein
  2026-06-09  2:36 ` [PATCH v2 1/2] commit-reach: handle cycles in contains walk Tamir Duberstein
  2026-06-09  2:36 ` [PATCH v2 2/2] ref-filter: memoize --contains with generations Tamir Duberstein
@ 2026-06-12  3:00 ` Tamir Duberstein
  2026-06-12  3:00   ` [PATCH v3 1/3] commit-reach: handle cycles in contains walk Tamir Duberstein
                     ` (3 more replies)
  2 siblings, 4 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12  3:00 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

git tag uses a memoized traversal for --contains, while git branch
and git for-each-ref repeat a reachability walk for each ref. Reuse
the memoized traversal when generation numbers can bound the walk.

The first patch makes the memoized traversal handle replacement
cycles. The last makes the non-memoized path report reachability
errors.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>

---
Changes in v3:
- Split missing-ancestor error handling into its own patch.
- Use die() for reachability errors, remove redundant cache setup, and
  chain cycle-test cleanup.
- Drop the unrelated empty-target-list behavior change.
- Explain why git tag retains memoization without generation numbers.
- Add p1500 coverage for all three frontends and a shared-history
  case.
- Remove correctness checks from p1500 and drop output hashes.
- Link to v2: https://patch.msgid.link/20260608-ref-filter-memoized-contains-v2-0-e72720344a7c@gmail.com

Changes in v2:
- Split cycle handling into a preparatory patch.
- Exercise cycle handling through the existing git tag path.
- Move perf result verification out of setup.
- Link to v1: https://patch.msgid.link/20260607-ref-filter-memoized-contains-v1-1-a1972dde9c76@gmail.com

---
Tamir Duberstein (3):
      commit-reach: handle cycles in contains walk
      ref-filter: memoize --contains with generations
      commit-reach: die on contains walk errors

 commit-reach.c                 | 40 ++++++++++++++++++++++++++++++++++------
 commit-reach.h                 |  3 ++-
 t/perf/p1500-graph-walks.sh    | 28 +++++++++++++++++++++++++++-
 t/t6301-for-each-ref-errors.sh | 22 ++++++++++++++++++++++
 t/t7004-tag.sh                 | 21 +++++++++++++++++++++
 5 files changed, 106 insertions(+), 8 deletions(-)
---
base-commit: 9ac3f193c05c2237e2b14ebaa1149e9fc8a1abe0
change-id: 20260607-ref-filter-memoized-contains-7cb6b3bccad1

Best regards,
--  
Tamir Duberstein <tamird@gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/3] commit-reach: handle cycles in contains walk
  2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
@ 2026-06-12  3:00   ` Tamir Duberstein
  2026-06-12  6:53     ` Kristofer Karlsson
  2026-06-12  3:00   ` [PATCH v3 2/3] ref-filter: memoize --contains with generations Tamir Duberstein
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12  3:00 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

The memoized contains traversal used by git tag assumes that commit
ancestry is acyclic. Replacement refs can violate that assumption,
causing it to keep pushing an already active commit until memory is
exhausted.

Mark commits while they are active. If the traversal encounters an
active commit, discard the cache and retry the candidate with the
cycle-safe reachability walk. Cache the candidate's result so a later
walk that reaches it can reuse the answer. Die if the fallback cannot
read ancestry.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c | 29 +++++++++++++++++++++++++----
 commit-reach.h |  3 ++-
 t/t7004-tag.sh | 21 +++++++++++++++++++++
 3 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9b3ea46d6f..1d34d66fe8 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -708,7 +708,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 
 /*
  * Test whether the candidate is contained in the list.
- * Do not recurse to find out, though, but return -1 if inconclusive.
+ * Do not recurse to find out, though, but return CONTAINS_UNKNOWN if
+ * inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
@@ -744,7 +745,7 @@ static void push_to_contains_stack(struct commit *candidate, struct contains_sta
 }
 
 static enum contains_result contains_tag_algo(struct commit *candidate,
-					      const struct commit_list *want,
+					      struct commit_list *want,
 					      struct contains_cache *cache)
 {
 	struct contains_stack contains_stack = { 0, 0, NULL };
@@ -765,6 +766,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
+	*contains_cache_at(cache, candidate) = CONTAINS_IN_PROGRESS;
 	push_to_contains_stack(candidate, &contains_stack);
 	while (contains_stack.nr) {
 		struct contains_stack_entry *entry = &contains_stack.contains_stack[contains_stack.nr - 1];
@@ -776,8 +778,8 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 			contains_stack.nr--;
 		}
 		/*
-		 * If we just popped the stack, parents->item has been marked,
-		 * therefore contains_test will return a meaningful yes/no.
+		 * A parent may have just been popped and marked, or may still
+		 * be active when replacement refs create a cycle.
 		 */
 		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
@@ -787,13 +789,32 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		case CONTAINS_NO:
 			entry->parents = parents->next;
 			break;
+		case CONTAINS_IN_PROGRESS:
+			/*
+			 * Partial negative answers are not safe across a cycle.
+			 * Discard them and use the cycle-safe reachability walk.
+			 */
+			goto cycle;
 		case CONTAINS_UNKNOWN:
+			*contains_cache_at(cache, parents->item) =
+				CONTAINS_IN_PROGRESS;
 			push_to_contains_stack(parents->item, &contains_stack);
 			break;
 		}
 	}
 	free(contains_stack.contains_stack);
 	return contains_test(candidate, want, cache, cutoff);
+
+cycle:
+	free(contains_stack.contains_stack);
+	clear_contains_cache(cache);
+
+	result = repo_is_descendant_of(the_repository, candidate, want);
+	if (result < 0)
+		die(_("failed to check reachability after detecting a cycle"));
+	*contains_cache_at(cache, candidate) =
+		result ? CONTAINS_YES : CONTAINS_NO;
+	return result ? CONTAINS_YES : CONTAINS_NO;
 }
 
 int commit_contains(struct ref_filter *filter, struct commit *commit,
diff --git a/commit-reach.h b/commit-reach.h
index 3f3a563d8a..f908d305b1 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -73,7 +73,8 @@ int ref_newer(const struct object_id *new_oid, const struct object_id *old_oid);
 enum contains_result {
 	CONTAINS_UNKNOWN = 0,
 	CONTAINS_NO,
-	CONTAINS_YES
+	CONTAINS_YES,
+	CONTAINS_IN_PROGRESS
 };
 
 define_commit_slab(contains_cache, enum contains_result);
diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index d918005dd9..4044bab006 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -1611,6 +1611,27 @@ test_expect_success 'checking that first commit is in all tags (hash)' '
 	test_cmp expected actual
 '
 
+test_expect_success 'tag --contains handles cyclic replacement histories' '
+	first=$(git rev-parse HEAD~2) &&
+	second=$(git rev-parse HEAD~) &&
+	third=$(git rev-parse HEAD) &&
+	test_when_finished "
+		git replace -d $first &&
+		git replace -d $third &&
+		git tag -d cycle-a cycle-b
+	" &&
+	git tag cycle-a "$first" &&
+	git tag cycle-b "$third" &&
+	git replace --graft "$first" "$third" "$second" &&
+	git replace --graft "$third" "$first" &&
+	cat >expected <<-\EOF &&
+	cycle-a
+	cycle-b
+	EOF
+	git tag --contains="$second" --list "cycle-*" >actual &&
+	test_cmp expected actual
+'
+
 # other ways of specifying the commit
 test_expect_success 'checking that first commit is in all tags (tag)' '
 	cat >expected <<-\EOF &&

-- 
2.54.0.548.gbe7bb2469c


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/3] ref-filter: memoize --contains with generations
  2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
  2026-06-12  3:00   ` [PATCH v3 1/3] commit-reach: handle cycles in contains walk Tamir Duberstein
@ 2026-06-12  3:00   ` Tamir Duberstein
  2026-06-12  3:00   ` [PATCH v3 3/3] commit-reach: die on contains walk errors Tamir Duberstein
  2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
  3 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12  3:00 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

git branch and git for-each-ref run a separate reachability walk for
each ref considered by --contains and --no-contains. Refs with shared
history therefore traverse the same commits repeatedly.

git tag instead uses a depth-first walk that caches results across
refs. That walk can perform poorly without generation numbers: a
negative check may walk to the root instead of stopping at a nearby
divergence. Generation numbers let it stop below the oldest target.

Use the memoized walk for all ref-filter callers when generation
numbers are available. Keep git tag on its existing path without
generations. Caching still helps when many tags share deep history:
ffc4b8012d (tag: speed up --contains calculation, 2011-06-11) reduced
git tag --contains HEAD~200 in linux-2.6 from 15.417 to 5.329 seconds.

The new shared-history perf test improves from 0.72 to 0.03 seconds. In
a repository with 62,174 remote-tracking refs, running:

    git branch -r --contains c78ae85f3ce7e

improves from 104.365 seconds to 468 milliseconds.

Link: https://lore.kernel.org/git/1445163904-24611-1-git-send-email-Karthik.188@gmail.com/
Link: https://lore.kernel.org/r/20230324191009.GA536967@coredump.intra.peff.net
Link: https://lore.kernel.org/git/20260527070510.3510836-1-krka@spotify.com/
Link: https://lore.kernel.org/r/20260608223430.GA340696@coredump.intra.peff.net
Link: https://lore.kernel.org/r/CAOLa=ZSezQOj56-TezVaAcisUyczxhJmu4VghyFBHcBB_mKJ2A@mail.gmail.com
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c              |  3 ++-
 t/perf/p1500-graph-walks.sh | 28 +++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 1d34d66fe8..572d2d47ff 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -820,7 +820,8 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 int commit_contains(struct ref_filter *filter, struct commit *commit,
 		    struct commit_list *list, struct contains_cache *cache)
 {
-	if (filter->with_commit_tag_algo)
+	if (filter->with_commit_tag_algo ||
+	    generation_numbers_enabled(the_repository))
 		return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
 	return repo_is_descendant_of(the_repository, commit, list);
 }
diff --git a/t/perf/p1500-graph-walks.sh b/t/perf/p1500-graph-walks.sh
index 5b23ce5db9..d167b4f7e1 100755
--- a/t/perf/p1500-graph-walks.sh
+++ b/t/perf/p1500-graph-walks.sh
@@ -32,7 +32,16 @@ test_expect_success 'setup' '
 		echo "X:$line" >>test-tool-tags || return 1
 	done &&
 
-	commit=$(git commit-tree $(git rev-parse HEAD^{tree})) &&
+	git rev-list --first-parent --max-count=8192 HEAD >contains-commits &&
+	test_file_not_empty contains-commits &&
+	git update-ref refs/contains-perf-base "$(tail -n 1 contains-commits)" &&
+	awk "{
+		printf \"update refs/contains-perf/%04d %s\\n\", NR, \$1
+	}" contains-commits |
+		git update-ref --stdin &&
+	git pack-refs --include "refs/contains-perf/*" &&
+
+	commit=$(git commit-tree HEAD^{tree}) &&
 	git update-ref refs/heads/disjoint-base $commit &&
 
 	git commit-graph write --reachable
@@ -62,6 +71,23 @@ test_perf 'contains: git tag --merged' '
 	xargs git tag --merged=HEAD <tags
 '
 
+test_perf 'contains: git for-each-ref' '
+	git for-each-ref --contains=refs/contains-perf-base --stdin <refs
+'
+
+test_perf 'contains: git branch' '
+	xargs git branch --contains=refs/contains-perf-base <branches
+'
+
+test_perf 'contains: git tag' '
+	xargs git tag --contains=refs/contains-perf-base <tags
+'
+
+test_perf 'contains: synthetic shared history' '
+	git for-each-ref --contains=refs/contains-perf-base \
+		refs/contains-perf/ >/dev/null
+'
+
 test_perf 'is-base check: test-tool reach (refs)' '
 	test-tool reach get_branch_base_for_tip <test-tool-refs
 '

-- 
2.54.0.548.gbe7bb2469c


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/3] commit-reach: die on contains walk errors
  2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
  2026-06-12  3:00   ` [PATCH v3 1/3] commit-reach: handle cycles in contains walk Tamir Duberstein
  2026-06-12  3:00   ` [PATCH v3 2/3] ref-filter: memoize --contains with generations Tamir Duberstein
@ 2026-06-12  3:00   ` Tamir Duberstein
  2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
  3 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12  3:00 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

Without generation numbers, repo_is_descendant_of() can return -1 when
it cannot read commit ancestry. commit_contains() exposes that result
through a Boolean interface, so ref-filter treats it as true. This can
include a ref for --contains or exclude it for --no-contains without
failing the command.

Die when repo_is_descendant_of() reports an error. The memoized walk
already dies when it cannot parse a commit, so callers of the
non-memoized path no longer turn a failed walk into a match.

Reported-by: Jeff King <peff@peff.net>
Link: https://lore.kernel.org/r/20260611072942.GG2191159@coredump.intra.peff.net
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c                 |  8 +++++++-
 t/t6301-for-each-ref-errors.sh | 22 ++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/commit-reach.c b/commit-reach.c
index 572d2d47ff..af5563d70f 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -820,10 +820,16 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 int commit_contains(struct ref_filter *filter, struct commit *commit,
 		    struct commit_list *list, struct contains_cache *cache)
 {
+	int result;
+
 	if (filter->with_commit_tag_algo ||
 	    generation_numbers_enabled(the_repository))
 		return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
-	return repo_is_descendant_of(the_repository, commit, list);
+
+	result = repo_is_descendant_of(the_repository, commit, list);
+	if (result < 0)
+		die(_("failed to check reachability"));
+	return result;
 }
 
 int can_all_from_reach_with_flag(struct object_array *from,
diff --git a/t/t6301-for-each-ref-errors.sh b/t/t6301-for-each-ref-errors.sh
index e06feb06e9..72b27c8be3 100755
--- a/t/t6301-for-each-ref-errors.sh
+++ b/t/t6301-for-each-ref-errors.sh
@@ -52,6 +52,28 @@ test_expect_success 'Missing objects are reported correctly' '
 	test_must_be_empty brief-err
 '
 
+test_expect_success 'missing ancestors are reported by contains filters' '
+	test_when_finished "git update-ref -d refs/heads/missing-parent" &&
+	{
+		echo "tree $(git rev-parse HEAD^{tree})" &&
+		echo "parent $MISSING" &&
+		git cat-file commit HEAD |
+			sed -n -e "/^author /p" -e "/^committer /p" &&
+		echo &&
+		echo "missing parent"
+	} >commit &&
+	broken=$(git hash-object -t commit -w commit) &&
+	git update-ref refs/heads/missing-parent "$broken" &&
+	for option in --contains --no-contains
+	do
+		test_must_fail git for-each-ref "$option=HEAD" \
+			refs/heads/missing-parent >out 2>err &&
+		test_must_be_empty out &&
+		test_grep "parse commit $MISSING" err ||
+		return 1
+	done
+'
+
 test_expect_success 'ahead-behind requires an argument' '
 	test_must_fail git for-each-ref \
 		--format="%(ahead-behind)" 2>err &&

-- 
2.54.0.548.gbe7bb2469c


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/3] commit-reach: handle cycles in contains walk
  2026-06-12  3:00   ` [PATCH v3 1/3] commit-reach: handle cycles in contains walk Tamir Duberstein
@ 2026-06-12  6:53     ` Kristofer Karlsson
  2026-06-12 21:26       ` Tamir Duberstein
  0 siblings, 1 reply; 22+ messages in thread
From: Kristofer Karlsson @ 2026-06-12  6:53 UTC (permalink / raw)
  To: Tamir Duberstein
  Cc: git, Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren

On Fri, 12 Jun 2026 at 05:00, Tamir Duberstein <tamird@gmail.com> wrote:
>
> The memoized contains traversal used by git tag assumes that commit
> ancestry is acyclic. Replacement refs can violate that assumption,
> causing it to keep pushing an already active commit until memory is
> exhausted.
>

The cycle detection itself makes sense, but would it be simpler to
just die() when a cycle is found rather than falling back to a
second reachability walk?

A cycle in the commit graph means replacement refs are
misconfigured.  The existing code already loops forever when it
hits one, so detecting and dying is strictly an improvement.  The
fallback adds a second codepath through the function, discards all
cached results (so later candidates redo work), and papers over
what is really a broken invariant.

do_lookup_replace_object() already dies when replacement refs
chain deeper than MAXREPLACEDEPTH (which covers cycles), so the
existing contract treats this as a fatal configuration error.
parse_commit_or_die() sets the same precedent within the walk
itself.

Kristofer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/3] commit-reach: handle cycles in contains walk
  2026-06-12  6:53     ` Kristofer Karlsson
@ 2026-06-12 21:26       ` Tamir Duberstein
  0 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12 21:26 UTC (permalink / raw)
  To: Kristofer Karlsson
  Cc: git, Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren

On Fri, Jun 12, 2026 at 2:53 AM Kristofer Karlsson <krka@spotify.com> wrote:
>
> On Fri, 12 Jun 2026 at 05:00, Tamir Duberstein <tamird@gmail.com> wrote:
> >
> > The memoized contains traversal used by git tag assumes that commit
> > ancestry is acyclic. Replacement refs can violate that assumption,
> > causing it to keep pushing an already active commit until memory is
> > exhausted.
> >
>
> The cycle detection itself makes sense, but would it be simpler to
> just die() when a cycle is found rather than falling back to a
> second reachability walk?
>
> A cycle in the commit graph means replacement refs are
> misconfigured.  The existing code already loops forever when it
> hits one, so detecting and dying is strictly an improvement.  The
> fallback adds a second codepath through the function, discards all
> cached results (so later candidates redo work), and papers over
> what is really a broken invariant.
>
> do_lookup_replace_object() already dies when replacement refs
> chain deeper than MAXREPLACEDEPTH (which covers cycles), so the
> existing contract treats this as a fatal configuration error.
> parse_commit_or_die() sets the same precedent within the walk
> itself.

Yes. The test creates an ancestry cycle through replacement commit
parents, so MAXREPLACEDEPTH does not catch this particular cycle. But I
agree with the design conclusion: the history is malformed and the
fallback only adds complexity.

Done in v4.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 0/3] Reuse --contains traversal results
  2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
                     ` (2 preceding siblings ...)
  2026-06-12  3:00   ` [PATCH v3 3/3] commit-reach: die on contains walk errors Tamir Duberstein
@ 2026-06-12 21:49   ` Tamir Duberstein
  2026-06-12 21:49     ` [PATCH v4 1/3] commit-reach: reject cycles in contains walk Tamir Duberstein
                       ` (3 more replies)
  3 siblings, 4 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12 21:49 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

git tag uses a memoized traversal for --contains, while git branch
and git for-each-ref repeat a reachability walk for each ref. Reuse
the memoized traversal when generation numbers can bound the walk.

The first patch makes the memoized traversal reject cyclic replacement
histories. The last makes the non-memoized path report reachability
errors.

Signed-off-by: Tamir Duberstein <tamird@gmail.com>

---
Changes in v4:
- Die on cyclic ancestry instead of retrying another reachability walk.
- Update the cycle test and credit Kristofer Karlsson.
- Remove unexplained links to review messages.
- Link to v3: https://patch.msgid.link/20260611-ref-filter-memoized-contains-v3-0-b26af3dba285@gmail.com

Changes in v3:
- Split missing-ancestor error handling into its own patch.
- Use die() for reachability errors, remove redundant cache setup, and
  chain cycle-test cleanup.
- Drop the unrelated empty-target-list behavior change.
- Explain why git tag retains memoization without generation numbers.
- Add p1500 coverage for all three frontends and a shared-history
  case.
- Remove correctness checks from p1500 and drop output hashes.
- Link to v2: https://patch.msgid.link/20260608-ref-filter-memoized-contains-v2-0-e72720344a7c@gmail.com

Changes in v2:
- Split cycle handling into a preparatory patch.
- Exercise cycle handling through the existing git tag path.
- Move perf result verification out of setup.
- Link to v1: https://patch.msgid.link/20260607-ref-filter-memoized-contains-v1-1-a1972dde9c76@gmail.com

---
Tamir Duberstein (3):
      commit-reach: reject cycles in contains walk
      ref-filter: memoize --contains with generations
      commit-reach: die on contains walk errors

 commit-reach.c                 | 23 ++++++++++++++++++-----
 commit-reach.h                 |  3 ++-
 t/perf/p1500-graph-walks.sh    | 28 +++++++++++++++++++++++++++-
 t/t6301-for-each-ref-errors.sh | 22 ++++++++++++++++++++++
 t/t7004-tag.sh                 | 18 ++++++++++++++++++
 5 files changed, 87 insertions(+), 7 deletions(-)
---
base-commit: 9ac3f193c05c2237e2b14ebaa1149e9fc8a1abe0
change-id: 20260607-ref-filter-memoized-contains-7cb6b3bccad1

Best regards,
--  
Tamir Duberstein <tamird@gmail.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/3] commit-reach: reject cycles in contains walk
  2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
@ 2026-06-12 21:49     ` Tamir Duberstein
  2026-06-12 21:49     ` [PATCH v4 2/3] ref-filter: memoize --contains with generations Tamir Duberstein
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12 21:49 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

The memoized contains traversal used by git tag assumes that commit
ancestry is acyclic. Replacement refs can violate that assumption,
causing it to keep pushing an already active commit until memory is
exhausted.

Mark commits while they are active and die if the traversal encounters
an active commit. Other failures in this walk already die through
parse_commit_or_die(); using a second reachability walk would only add
a separate policy for malformed history.

Suggested-by: Kristofer Karlsson <krka@spotify.com>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c | 12 +++++++++---
 commit-reach.h |  3 ++-
 t/t7004-tag.sh | 18 ++++++++++++++++++
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9b3ea46d6f..e1bedc596d 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -708,7 +708,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 
 /*
  * Test whether the candidate is contained in the list.
- * Do not recurse to find out, though, but return -1 if inconclusive.
+ * Do not recurse to find out, though, but return CONTAINS_UNKNOWN if
+ * inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
 					  const struct commit_list *want,
@@ -765,6 +766,7 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 	if (result != CONTAINS_UNKNOWN)
 		return result;
 
+	*contains_cache_at(cache, candidate) = CONTAINS_IN_PROGRESS;
 	push_to_contains_stack(candidate, &contains_stack);
 	while (contains_stack.nr) {
 		struct contains_stack_entry *entry = &contains_stack.contains_stack[contains_stack.nr - 1];
@@ -776,8 +778,8 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 			contains_stack.nr--;
 		}
 		/*
-		 * If we just popped the stack, parents->item has been marked,
-		 * therefore contains_test will return a meaningful yes/no.
+		 * A parent may have just been popped and marked, or may still
+		 * be active when replacement refs create a cycle.
 		 */
 		else switch (contains_test(parents->item, want, cache, cutoff)) {
 		case CONTAINS_YES:
@@ -787,7 +789,11 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 		case CONTAINS_NO:
 			entry->parents = parents->next;
 			break;
+		case CONTAINS_IN_PROGRESS:
+			die(_("commit ancestry contains a cycle"));
 		case CONTAINS_UNKNOWN:
+			*contains_cache_at(cache, parents->item) =
+				CONTAINS_IN_PROGRESS;
 			push_to_contains_stack(parents->item, &contains_stack);
 			break;
 		}
diff --git a/commit-reach.h b/commit-reach.h
index 3f3a563d8a..f908d305b1 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -73,7 +73,8 @@ int ref_newer(const struct object_id *new_oid, const struct object_id *old_oid);
 enum contains_result {
 	CONTAINS_UNKNOWN = 0,
 	CONTAINS_NO,
-	CONTAINS_YES
+	CONTAINS_YES,
+	CONTAINS_IN_PROGRESS
 };
 
 define_commit_slab(contains_cache, enum contains_result);
diff --git a/t/t7004-tag.sh b/t/t7004-tag.sh
index d918005dd9..67309494d2 100755
--- a/t/t7004-tag.sh
+++ b/t/t7004-tag.sh
@@ -1611,6 +1611,24 @@ test_expect_success 'checking that first commit is in all tags (hash)' '
 	test_cmp expected actual
 '
 
+test_expect_success 'tag --contains rejects cyclic replacement histories' '
+	first=$(git rev-parse HEAD~2) &&
+	second=$(git rev-parse HEAD~) &&
+	third=$(git rev-parse HEAD) &&
+	test_when_finished "
+		git replace -d $first &&
+		git replace -d $third &&
+		git tag -d cycle-a cycle-b
+	" &&
+	git tag cycle-a "$first" &&
+	git tag cycle-b "$third" &&
+	git replace --graft "$first" "$third" "$second" &&
+	git replace --graft "$third" "$first" &&
+	test_must_fail git tag --contains="$second" --list "cycle-*" \
+		>/dev/null 2>err &&
+	test_grep "fatal: commit ancestry contains a cycle" err
+'
+
 # other ways of specifying the commit
 test_expect_success 'checking that first commit is in all tags (tag)' '
 	cat >expected <<-\EOF &&

-- 
2.54.0.548.gbe7bb2469c


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/3] ref-filter: memoize --contains with generations
  2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
  2026-06-12 21:49     ` [PATCH v4 1/3] commit-reach: reject cycles in contains walk Tamir Duberstein
@ 2026-06-12 21:49     ` Tamir Duberstein
  2026-06-12 21:49     ` [PATCH v4 3/3] commit-reach: die on contains walk errors Tamir Duberstein
  2026-06-29 20:40     ` [PATCH v4 0/3] Reuse --contains traversal results Junio C Hamano
  3 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12 21:49 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

git branch and git for-each-ref run a separate reachability walk for
each ref considered by --contains and --no-contains. Refs with shared
history therefore traverse the same commits repeatedly.

git tag instead uses a depth-first walk that caches results across
refs. That walk can perform poorly without generation numbers: a
negative check may walk to the root instead of stopping at a nearby
divergence. Generation numbers let it stop below the oldest target.

Use the memoized walk for all ref-filter callers when generation
numbers are available. Keep git tag on its existing path without
generations. Caching still helps when many tags share deep history:
ffc4b8012d (tag: speed up --contains calculation, 2011-06-11) reduced
git tag --contains HEAD~200 in linux-2.6 from 15.417 to 5.329 seconds.

The new shared-history perf test improves from 0.72 to 0.03 seconds. In
a repository with 62,174 remote-tracking refs, running:

    git branch -r --contains c78ae85f3ce7e

improves from 104.365 seconds to 468 milliseconds.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c              |  3 ++-
 t/perf/p1500-graph-walks.sh | 28 +++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index e1bedc596d..18fcd69113 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -805,7 +805,8 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 int commit_contains(struct ref_filter *filter, struct commit *commit,
 		    struct commit_list *list, struct contains_cache *cache)
 {
-	if (filter->with_commit_tag_algo)
+	if (filter->with_commit_tag_algo ||
+	    generation_numbers_enabled(the_repository))
 		return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
 	return repo_is_descendant_of(the_repository, commit, list);
 }
diff --git a/t/perf/p1500-graph-walks.sh b/t/perf/p1500-graph-walks.sh
index 5b23ce5db9..d167b4f7e1 100755
--- a/t/perf/p1500-graph-walks.sh
+++ b/t/perf/p1500-graph-walks.sh
@@ -32,7 +32,16 @@ test_expect_success 'setup' '
 		echo "X:$line" >>test-tool-tags || return 1
 	done &&
 
-	commit=$(git commit-tree $(git rev-parse HEAD^{tree})) &&
+	git rev-list --first-parent --max-count=8192 HEAD >contains-commits &&
+	test_file_not_empty contains-commits &&
+	git update-ref refs/contains-perf-base "$(tail -n 1 contains-commits)" &&
+	awk "{
+		printf \"update refs/contains-perf/%04d %s\\n\", NR, \$1
+	}" contains-commits |
+		git update-ref --stdin &&
+	git pack-refs --include "refs/contains-perf/*" &&
+
+	commit=$(git commit-tree HEAD^{tree}) &&
 	git update-ref refs/heads/disjoint-base $commit &&
 
 	git commit-graph write --reachable
@@ -62,6 +71,23 @@ test_perf 'contains: git tag --merged' '
 	xargs git tag --merged=HEAD <tags
 '
 
+test_perf 'contains: git for-each-ref' '
+	git for-each-ref --contains=refs/contains-perf-base --stdin <refs
+'
+
+test_perf 'contains: git branch' '
+	xargs git branch --contains=refs/contains-perf-base <branches
+'
+
+test_perf 'contains: git tag' '
+	xargs git tag --contains=refs/contains-perf-base <tags
+'
+
+test_perf 'contains: synthetic shared history' '
+	git for-each-ref --contains=refs/contains-perf-base \
+		refs/contains-perf/ >/dev/null
+'
+
 test_perf 'is-base check: test-tool reach (refs)' '
 	test-tool reach get_branch_base_for_tip <test-tool-refs
 '

-- 
2.54.0.548.gbe7bb2469c


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/3] commit-reach: die on contains walk errors
  2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
  2026-06-12 21:49     ` [PATCH v4 1/3] commit-reach: reject cycles in contains walk Tamir Duberstein
  2026-06-12 21:49     ` [PATCH v4 2/3] ref-filter: memoize --contains with generations Tamir Duberstein
@ 2026-06-12 21:49     ` Tamir Duberstein
  2026-06-29 20:40     ` [PATCH v4 0/3] Reuse --contains traversal results Junio C Hamano
  3 siblings, 0 replies; 22+ messages in thread
From: Tamir Duberstein @ 2026-06-12 21:49 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Karthik Nayak, Junio C Hamano, Victoria Dye,
	Derrick Stolee, Elijah Newren, Kristofer Karlsson,
	Tamir Duberstein

Without generation numbers, repo_is_descendant_of() can return -1 when
it cannot read commit ancestry. commit_contains() exposes that result
through a Boolean interface, so ref-filter treats it as true. This can
include a ref for --contains or exclude it for --no-contains without
failing the command.

Die when repo_is_descendant_of() reports an error. The memoized walk
already dies when it cannot parse a commit, so callers of the
non-memoized path no longer turn a failed walk into a match.

Reported-by: Jeff King <peff@peff.net>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
 commit-reach.c                 |  8 +++++++-
 t/t6301-for-each-ref-errors.sh | 22 ++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/commit-reach.c b/commit-reach.c
index 18fcd69113..37b66b6b21 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -805,10 +805,16 @@ static enum contains_result contains_tag_algo(struct commit *candidate,
 int commit_contains(struct ref_filter *filter, struct commit *commit,
 		    struct commit_list *list, struct contains_cache *cache)
 {
+	int result;
+
 	if (filter->with_commit_tag_algo ||
 	    generation_numbers_enabled(the_repository))
 		return contains_tag_algo(commit, list, cache) == CONTAINS_YES;
-	return repo_is_descendant_of(the_repository, commit, list);
+
+	result = repo_is_descendant_of(the_repository, commit, list);
+	if (result < 0)
+		die(_("failed to check reachability"));
+	return result;
 }
 
 int can_all_from_reach_with_flag(struct object_array *from,
diff --git a/t/t6301-for-each-ref-errors.sh b/t/t6301-for-each-ref-errors.sh
index e06feb06e9..72b27c8be3 100755
--- a/t/t6301-for-each-ref-errors.sh
+++ b/t/t6301-for-each-ref-errors.sh
@@ -52,6 +52,28 @@ test_expect_success 'Missing objects are reported correctly' '
 	test_must_be_empty brief-err
 '
 
+test_expect_success 'missing ancestors are reported by contains filters' '
+	test_when_finished "git update-ref -d refs/heads/missing-parent" &&
+	{
+		echo "tree $(git rev-parse HEAD^{tree})" &&
+		echo "parent $MISSING" &&
+		git cat-file commit HEAD |
+			sed -n -e "/^author /p" -e "/^committer /p" &&
+		echo &&
+		echo "missing parent"
+	} >commit &&
+	broken=$(git hash-object -t commit -w commit) &&
+	git update-ref refs/heads/missing-parent "$broken" &&
+	for option in --contains --no-contains
+	do
+		test_must_fail git for-each-ref "$option=HEAD" \
+			refs/heads/missing-parent >out 2>err &&
+		test_must_be_empty out &&
+		test_grep "parse commit $MISSING" err ||
+		return 1
+	done
+'
+
 test_expect_success 'ahead-behind requires an argument' '
 	test_must_fail git for-each-ref \
 		--format="%(ahead-behind)" 2>err &&

-- 
2.54.0.548.gbe7bb2469c


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/3] Reuse --contains traversal results
  2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
                       ` (2 preceding siblings ...)
  2026-06-12 21:49     ` [PATCH v4 3/3] commit-reach: die on contains walk errors Tamir Duberstein
@ 2026-06-29 20:40     ` Junio C Hamano
  3 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2026-06-29 20:40 UTC (permalink / raw)
  To: Tamir Duberstein
  Cc: git, Jeff King, Karthik Nayak, Victoria Dye, Derrick Stolee,
	Elijah Newren, Kristofer Karlsson

Tamir Duberstein <tamird@gmail.com> writes:

> git tag uses a memoized traversal for --contains, while git branch
> and git for-each-ref repeat a reachability walk for each ref. Reuse
> the memoized traversal when generation numbers can bound the walk.
>
> The first patch makes the memoized traversal reject cyclic replacement
> histories. The last makes the non-memoized path report reachability
> errors.

This unfortunately hasn't heard any responses since June 12th.  Are
there remaining issues with it?  Or do people fundamentally have
objections against this change?  Or things are too busy in general
that there are more patches than there are folks willing to review
them?


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-06-29 20:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09  2:36 [PATCH v2 0/2] Reuse --contains traversal results Tamir Duberstein
2026-06-09  2:36 ` [PATCH v2 1/2] commit-reach: handle cycles in contains walk Tamir Duberstein
2026-06-11  7:29   ` Jeff King
2026-06-12  2:40     ` Tamir Duberstein
2026-06-09  2:36 ` [PATCH v2 2/2] ref-filter: memoize --contains with generations Tamir Duberstein
2026-06-10 11:47   ` Karthik Nayak
2026-06-10 12:20     ` Tamir Duberstein
2026-06-11  8:16       ` Karthik Nayak
2026-06-11 20:10         ` Tamir Duberstein
2026-06-11  8:22   ` Jeff King
2026-06-12  2:40     ` Tamir Duberstein
2026-06-12  3:00 ` [PATCH v3 0/3] Reuse --contains traversal results Tamir Duberstein
2026-06-12  3:00   ` [PATCH v3 1/3] commit-reach: handle cycles in contains walk Tamir Duberstein
2026-06-12  6:53     ` Kristofer Karlsson
2026-06-12 21:26       ` Tamir Duberstein
2026-06-12  3:00   ` [PATCH v3 2/3] ref-filter: memoize --contains with generations Tamir Duberstein
2026-06-12  3:00   ` [PATCH v3 3/3] commit-reach: die on contains walk errors Tamir Duberstein
2026-06-12 21:49   ` [PATCH v4 0/3] Reuse --contains traversal results Tamir Duberstein
2026-06-12 21:49     ` [PATCH v4 1/3] commit-reach: reject cycles in contains walk Tamir Duberstein
2026-06-12 21:49     ` [PATCH v4 2/3] ref-filter: memoize --contains with generations Tamir Duberstein
2026-06-12 21:49     ` [PATCH v4 3/3] commit-reach: die on contains walk errors Tamir Duberstein
2026-06-29 20:40     ` [PATCH v4 0/3] Reuse --contains traversal results Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox