[PATCH 0/6] odb: track commit graphs via object source

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] odb: track commit graphs via object source
@ 2025-09-04 12:49 Patrick Steinhardt
  2025-09-04 12:49 ` [PATCH 1/6] blame: drop explicit check for commit graph Patrick Steinhardt
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:49 UTC (permalink / raw)
  To: git

Hi,

commit graphs are currently stored on the object database level. This
doesn't really make much sense conceptually, given that commit graphs
are specific to one object source. Furthermore, with the upcoming
pluggable object database effort, an object source's backend may not
evene have a commit graph in the first place but store that information
in a different format altogether.

This patch series prepares for that by moving the commit graph from
`struct object_database` into `struct odb_source`.

There's a trivial conflict with tc/last-modified that can be solved like
this:

diff --cc commit-graph.c
index 9929c1ed87,2f20f66cfd..0000000000
--- a/commit-graph.c
+++ b/commit-graph.c
@@@ -823,7 -812,12 +823,11 @@@ int corrected_commit_dates_enabled(stru
  
  struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
  {
 -	struct commit_graph *g;
 +	struct commit_graph *g = prepare_commit_graph(r);
+ 
 -	if (!prepare_commit_graph(r))
++	if (!g)
+ 	       return NULL;
+ 
 -	g = r->objects->commit_graph;
  	while (g) {
  		if (g->bloom_filter_settings)
  			return g->bloom_filter_settings;

Thanks!

Patrick

---
Patrick Steinhardt (6):
      blame: drop explicit check for commit graph
      revision: drop explicit check for commit graph
      commit-graph: return the prepared commit graph from `prepare_commit_graph()`
      commit-graph: return commit graph from `repo_find_commit_pos_in_graph()`
      commit-graph: pass graphs that are to be merged as parameter
      odb: move commit-graph into the object sources

 blame.c        |   3 -
 bloom.c        |   8 ++-
 commit-graph.c | 177 +++++++++++++++++++++++++++++++--------------------------
 commit-graph.h |  14 ++---
 odb.c          |   9 +--
 odb.h          |   6 +-
 packfile.c     |   3 +-
 revision.c     |   3 -
 8 files changed, 118 insertions(+), 105 deletions(-)


---
base-commit: 2462961280690837670d997bde64bd4ebf8ae66d
change-id: 20250904-b4-pks-commit-graph-via-source-b4e1a1edd214


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] blame: drop explicit check for commit graph
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
@ 2025-09-04 12:49 ` Patrick Steinhardt
  2025-09-11 22:09   ` Taylor Blau
  2025-09-04 12:49 ` [PATCH 2/6] revision: " Patrick Steinhardt
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:49 UTC (permalink / raw)
  To: git

Our blaming subsystem knows to use bloom filters from commit graphs to
speed up the whole computation. The setup of this happens in
`setup_blame_bloom_data()`, where we first verify that we even have a
commit graph in the first place. This check is redundant though, as we
call `get_bloom_filter_settings()` immediately afterwards which, which
already knows to return a `NULL` pointer in case we don't have a commit
graph.

Drop the redundant check.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 blame.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/blame.c b/blame.c
index f1c0670144..cb0b083423 100644
--- a/blame.c
+++ b/blame.c
@@ -2909,9 +2909,6 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb)
 	struct blame_bloom_data *bd;
 	struct bloom_filter_settings *bs;

-	if (!sb->repo->objects->commit_graph)
-		return;
-
 	bs = get_bloom_filter_settings(sb->repo);
 	if (!bs)
 		return;

-- 
2.51.0.417.g1ba7204a04.dirty

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] revision: drop explicit check for commit graph
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
  2025-09-04 12:49 ` [PATCH 1/6] blame: drop explicit check for commit graph Patrick Steinhardt
@ 2025-09-04 12:49 ` Patrick Steinhardt
  2025-09-11 22:16   ` Taylor Blau
  2025-09-04 12:49 ` [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()` Patrick Steinhardt
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:49 UTC (permalink / raw)
  To: git

When filtering down revisions by paths we know to use bloom filters from
the commit graph, if we have any. The entry point for this is in
`check_maybe_different_in_bloom_filter()`, where we first verify that:

  - We do have a commit graph.

  - That the commit is contained therein by checking that we have a
    proper generation number.

  - And that the graph contains a bloom filter.

The first check is somewhat redundant though: if we don't have a commit
graph, then the second check would already tell us that we don't have a
generation number for the specific commit.

In theory this could be seen as a performance optimization to
short-circuit for scenarios where there is no commit graph. But in
practice this shouldn't matter: if there is no commit graph, then the
commit graph data slab would also be unpopulated and thus a lookup of
the commit should happen in constant time.

Drop the unnecessary check.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 revision.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/revision.c b/revision.c
index 6ba8f67054..6018f30a99 100644
--- a/revision.c
+++ b/revision.c
@@ -774,9 +774,6 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 	struct bloom_filter *filter;
 	int result = 0;

-	if (!revs->repo->objects->commit_graph)
-		return -1;
-
 	if (commit_graph_generation(commit) == GENERATION_NUMBER_INFINITY)
 		return -1;

-- 
2.51.0.417.g1ba7204a04.dirty

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()`
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
  2025-09-04 12:49 ` [PATCH 1/6] blame: drop explicit check for commit graph Patrick Steinhardt
  2025-09-04 12:49 ` [PATCH 2/6] revision: " Patrick Steinhardt
@ 2025-09-04 12:49 ` Patrick Steinhardt
  2025-09-11 22:25   ` Taylor Blau
  2025-09-04 12:49 ` [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()` Patrick Steinhardt
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:49 UTC (permalink / raw)
  To: git

When making use of commit graphs, one needs to first prepare them by
calling `prepare_commit_graph()`. Once that function was called and the
commit graph was prepared successfully, the caller is now expected to
access the graph directly via `struct object_database::commit_graph`.

In a subsequent change, we're going to move the commit graph pointer
from `struct object_database` into `struct odb_source`. With this
change, semantics will change so that we use the commit graph of the
first source that has one. Consequently, all callers that currently
deference the `commit_graph` pointer would now have to loop around the
list of sources to find the commit graph.

This would become quite unwieldy. So instead of shifting the burden onto
such callers, adapt `prepare_commit_graph()` to return the prepared
commit graph, if any. Like this, callers are expected to call that
function and then use the returned commit graph.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 commit-graph.c | 82 +++++++++++++++++++++++-----------------------------------
 1 file changed, 32 insertions(+), 50 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 3cd9e73e2a..62260a2026 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -735,7 +735,7 @@ struct commit_graph *read_commit_graph_one(struct odb_source *source)
  * On the first invocation, this function attempts to load the commit
  * graph if the repository is configured to have one.
  */
-static int prepare_commit_graph(struct repository *r)
+static struct commit_graph *prepare_commit_graph(struct repository *r)
 {
 	struct odb_source *source;
 
@@ -747,10 +747,10 @@ static int prepare_commit_graph(struct repository *r)
 	 * we want to disable even an already-loaded graph file.
 	 */
 	if (!r->gitdir || r->commit_graph_disabled)
-		return 0;
+		return NULL;
 
 	if (r->objects->commit_graph_attempted)
-		return !!r->objects->commit_graph;
+		return r->objects->commit_graph;
 	r->objects->commit_graph_attempted = 1;
 
 	prepare_repo_settings(r);
@@ -763,10 +763,10 @@ static int prepare_commit_graph(struct repository *r)
 		 * so that commit graph loading is not attempted again for this
 		 * repository.)
 		 */
-		return 0;
+		return NULL;
 
 	if (!commit_graph_compatible(r))
-		return 0;
+		return NULL;
 
 	odb_prepare_alternates(r->objects);
 	for (source = r->objects->sources; source; source = source->next) {
@@ -775,20 +775,17 @@ static int prepare_commit_graph(struct repository *r)
 			break;
 	}
 
-	return !!r->objects->commit_graph;
+	return r->objects->commit_graph;
 }
 
 int generation_numbers_enabled(struct repository *r)
 {
 	uint32_t first_generation;
 	struct commit_graph *g;
-	if (!prepare_commit_graph(r))
-	       return 0;
 
-	g = r->objects->commit_graph;
-
-	if (!g->num_commits)
-		return 0;
+	g = prepare_commit_graph(r);
+	if (!g || !g->num_commits)
+	       return 0;
 
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_algo->rawsz + 8) >> 2;
@@ -799,12 +796,9 @@ int generation_numbers_enabled(struct repository *r)
 int corrected_commit_dates_enabled(struct repository *r)
 {
 	struct commit_graph *g;
-	if (!prepare_commit_graph(r))
-		return 0;
 
-	g = r->objects->commit_graph;
-
-	if (!g->num_commits)
+	g = prepare_commit_graph(r);
+	if (!g || !g->num_commits)
 		return 0;
 
 	return g->read_generation_data;
@@ -1012,23 +1006,26 @@ static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
 int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
 				  uint32_t *pos)
 {
-	if (!prepare_commit_graph(r))
+	struct commit_graph *g = prepare_commit_graph(r);
+	if (!g)
 		return 0;
-	return find_commit_pos_in_graph(c, r->objects->commit_graph, pos);
+	return find_commit_pos_in_graph(c, g, pos);
 }
 
 struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
 {
 	static int commit_graph_paranoia = -1;
+	struct commit_graph *g;
 	struct commit *commit;
 	uint32_t pos;
 
 	if (commit_graph_paranoia == -1)
 		commit_graph_paranoia = git_env_bool(GIT_COMMIT_GRAPH_PARANOIA, 0);
 
-	if (!prepare_commit_graph(repo))
+	g = prepare_commit_graph(repo);
+	if (!g)
 		return NULL;
-	if (!search_commit_pos_in_graph(id, repo->objects->commit_graph, &pos))
+	if (!search_commit_pos_in_graph(id, g, &pos))
 		return NULL;
 	if (commit_graph_paranoia && !odb_has_object(repo->objects, id, 0))
 		return NULL;
@@ -1039,7 +1036,7 @@ struct commit *lookup_commit_in_graph(struct repository *repo, const struct obje
 	if (commit->object.parsed)
 		return commit;
 
-	if (!fill_commit_in_graph(commit, repo->objects->commit_graph, pos))
+	if (!fill_commit_in_graph(commit, g, pos))
 		return NULL;
 
 	return commit;
@@ -1062,6 +1059,7 @@ static int parse_commit_in_graph_one(struct commit_graph *g,
 int parse_commit_in_graph(struct repository *r, struct commit *item)
 {
 	static int checked_env = 0;
+	struct commit_graph *g;
 
 	if (!checked_env &&
 	    git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
@@ -1069,9 +1067,10 @@ int parse_commit_in_graph(struct repository *r, struct commit *item)
 		    GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
 	checked_env = 1;
 
-	if (!prepare_commit_graph(r))
+	g = prepare_commit_graph(r);
+	if (!g)
 		return 0;
-	return parse_commit_in_graph_one(r->objects->commit_graph, item);
+	return parse_commit_in_graph_one(g, item);
 }
 
 void load_commit_graph_info(struct repository *r, struct commit *item)
@@ -2519,6 +2518,7 @@ int write_commit_graph(struct odb_source *source,
 	int replace = 0;
 	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	struct topo_level_slab topo_levels;
+	struct commit_graph *g;
 
 	prepare_repo_settings(r);
 	if (!r->settings.core_commit_graph) {
@@ -2547,23 +2547,13 @@ int write_commit_graph(struct odb_source *source,
 	init_topo_level_slab(&topo_levels);
 	ctx.topo_levels = &topo_levels;
 
-	prepare_commit_graph(ctx.r);
-	if (ctx.r->objects->commit_graph) {
-		struct commit_graph *g = ctx.r->objects->commit_graph;
-
-		while (g) {
-			g->topo_levels = &topo_levels;
-			g = g->base_graph;
-		}
-	}
+	g = prepare_commit_graph(ctx.r);
+	for (struct commit_graph *chain = g; chain; chain = chain->base_graph)
+		g->topo_levels = &topo_levels;
 
 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
 		ctx.changed_paths = 1;
 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
-		struct commit_graph *g;
-
-		g = ctx.r->objects->commit_graph;
-
 		/* We have changed-paths already. Keep them in the next graph */
 		if (g && g->bloom_filter_settings) {
 			ctx.changed_paths = 1;
@@ -2580,22 +2570,15 @@ int write_commit_graph(struct odb_source *source,
 	bloom_settings.hash_version = bloom_settings.hash_version == 2 ? 2 : 1;
 
 	if (ctx.split) {
-		struct commit_graph *g = ctx.r->objects->commit_graph;
-
-		while (g) {
+		for (struct commit_graph *chain = g; chain; chain = chain->base_graph)
 			ctx.num_commit_graphs_before++;
-			g = g->base_graph;
-		}
 
 		if (ctx.num_commit_graphs_before) {
 			ALLOC_ARRAY(ctx.commit_graph_filenames_before, ctx.num_commit_graphs_before);
 			i = ctx.num_commit_graphs_before;
-			g = ctx.r->objects->commit_graph;
 
-			while (g) {
-				ctx.commit_graph_filenames_before[--i] = xstrdup(g->filename);
-				g = g->base_graph;
-			}
+			for (struct commit_graph *chain = g; chain; chain = chain->base_graph)
+				ctx.commit_graph_filenames_before[--i] = xstrdup(chain->filename);
 		}
 
 		if (ctx.opts)
@@ -2604,8 +2587,7 @@ int write_commit_graph(struct odb_source *source,
 
 	ctx.approx_nr_objects = repo_approximate_object_count(r);
 
-	if (ctx.append && ctx.r->objects->commit_graph) {
-		struct commit_graph *g = ctx.r->objects->commit_graph;
+	if (ctx.append && g) {
 		for (i = 0; i < g->num_commits; i++) {
 			struct object_id oid;
 			oidread(&oid, g->chunk_oid_lookup + st_mult(g->hash_algo->rawsz, i),
@@ -2651,7 +2633,7 @@ int write_commit_graph(struct odb_source *source,
 	} else
 		ctx.num_commit_graphs_after = 1;
 
-	ctx.trust_generation_numbers = validate_mixed_generation_chain(ctx.r->objects->commit_graph);
+	ctx.trust_generation_numbers = validate_mixed_generation_chain(g);
 
 	compute_topological_levels(&ctx);
 	if (ctx.write_generation_data)

-- 
2.51.0.417.g1ba7204a04.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()`
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
                   ` (2 preceding siblings ...)
  2025-09-04 12:49 ` [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()` Patrick Steinhardt
@ 2025-09-04 12:49 ` Patrick Steinhardt
  2025-09-11 22:54   ` Taylor Blau
  2025-09-04 12:49 ` [PATCH 5/6] commit-graph: pass graphs that are to be merged as parameter Patrick Steinhardt
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:49 UTC (permalink / raw)
  To: git

The function `repo_find_commit_pos_in_graph()` takes a commit as input
and tries to figure out whether the given repository has a commit graph
that contains that specific commit. If so, it returns the corresponding
position of that commit inside the graph.

Right now though we only return the position, but not the actual graph
that the commit has been found in. This is sensible as repositories
always have the graph in `struct repository::objects::commit_graph`.
Consequently, the caller always knows where to find it.

But in a subsequent change we're going to move the graph into the object
sources. This would require callers of the function to loop through all
sources to find the relevant commit graph.

Refactor the code so that we instead return the commit-graph that the
commit has been found with.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 bloom.c        |  8 +++++---
 commit-graph.c | 18 ++++++++++++------
 commit-graph.h | 12 ++++++------
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/bloom.c b/bloom.c
index b86015f6d1..2d7b951e5b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -452,10 +452,12 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	filter = bloom_filter_slab_at(&bloom_filters, c);
 
 	if (!filter->data) {
+		struct commit_graph *g;
 		uint32_t graph_pos;
-		if (repo_find_commit_pos_in_graph(r, c, &graph_pos))
-			load_bloom_filter_from_graph(r->objects->commit_graph,
-						     filter, graph_pos);
+
+		g = repo_find_commit_pos_in_graph(r, c, &graph_pos);
+		if (g)
+			load_bloom_filter_from_graph(g, filter, graph_pos);
 	}
 
 	if (filter->data && filter->len) {
diff --git a/commit-graph.c b/commit-graph.c
index 62260a2026..16dfe58229 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1003,13 +1003,16 @@ static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
 	}
 }
 
-int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
-				  uint32_t *pos)
+struct commit_graph *repo_find_commit_pos_in_graph(struct repository *r,
+						   struct commit *c,
+						   uint32_t *pos)
 {
 	struct commit_graph *g = prepare_commit_graph(r);
 	if (!g)
-		return 0;
-	return find_commit_pos_in_graph(c, g, pos);
+		return NULL;
+	if (!find_commit_pos_in_graph(c, g, pos))
+		return NULL;
+	return g;
 }
 
 struct commit *lookup_commit_in_graph(struct repository *repo, const struct object_id *id)
@@ -1075,9 +1078,12 @@ int parse_commit_in_graph(struct repository *r, struct commit *item)
 
 void load_commit_graph_info(struct repository *r, struct commit *item)
 {
+	struct commit_graph *g;
 	uint32_t pos;
-	if (repo_find_commit_pos_in_graph(r, item, &pos))
-		fill_commit_graph_info(item, r->objects->commit_graph, pos);
+
+	g = repo_find_commit_pos_in_graph(r, item, &pos);
+	if (g)
+		fill_commit_graph_info(item, g, pos);
 }
 
 static struct tree *load_tree_for_commit(struct commit_graph *g,
diff --git a/commit-graph.h b/commit-graph.h
index 4899b54ef8..f6a5433641 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -48,10 +48,9 @@ int open_commit_graph_chain(const char *chain_file, int *fd, struct stat *st,
 int parse_commit_in_graph(struct repository *r, struct commit *item);
 
 /*
- * Fills `*pos` with the graph position of `c`, and returns 1 if `c` is
- * found in the commit-graph belonging to `r`, or 0 otherwise.
- * Initializes the commit-graph belonging to `r` if it hasn't been
- * already.
+ * Fills `*pos` with the graph position of `c`, and returns the graph `c` is
+ * found in, or NULL otherwise. Initializes the commit-graphs belonging to
+ * `r` if it hasn't been already.
  *
  * Note: this is a low-level helper that does not alter any slab data
  * associated with `c`. Useful in circumstances where the slab data is
@@ -59,8 +58,9 @@ int parse_commit_in_graph(struct repository *r, struct commit *item);
  *
  * In most cases, callers should use `parse_commit_in_graph()` instead.
  */
-int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
-				  uint32_t *pos);
+struct commit_graph *repo_find_commit_pos_in_graph(struct repository *r,
+						   struct commit *c,
+						   uint32_t *pos);
 
 /*
  * Look up the given commit ID in the commit-graph. This will only return a

-- 
2.51.0.417.g1ba7204a04.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] commit-graph: pass graphs that are to be merged as parameter
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
                   ` (3 preceding siblings ...)
  2025-09-04 12:49 ` [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()` Patrick Steinhardt
@ 2025-09-04 12:49 ` Patrick Steinhardt
  2025-09-04 12:50 ` [PATCH 6/6] odb: move commit-graph into the object sources Patrick Steinhardt
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:49 UTC (permalink / raw)
  To: git

When determining whether or not we want to merge a commit graph chain we
retrieve the graph that is to be merged via the context's repository.
With an upcoming change though it will become a bit more complex to
figure out the commit graph, which would lead to code duplication.

Prepare for this change by passing the graph that is to be merged as a
parameter.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 commit-graph.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 16dfe58229..0e25b14076 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2226,7 +2226,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	return 0;
 }
 
-static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
+static void split_graph_merge_strategy(struct write_commit_graph_context *ctx,
+				       struct commit_graph *graph_to_merge)
 {
 	struct commit_graph *g;
 	uint32_t num_commits;
@@ -2245,7 +2246,7 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		flags = ctx->opts->split_flags;
 	}
 
-	g = ctx->r->objects->commit_graph;
+	g = graph_to_merge;
 	num_commits = ctx->commits.nr;
 	if (flags == COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->num_commit_graphs_after = 1;
@@ -2297,7 +2298,7 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		ctx->commit_graph_filenames_after[i] = xstrdup(ctx->commit_graph_filenames_before[i]);
 
 	i = ctx->num_commit_graphs_before - 1;
-	g = ctx->r->objects->commit_graph;
+	g = graph_to_merge;
 
 	while (g) {
 		if (i < ctx->num_commit_graphs_after)
@@ -2395,9 +2396,9 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 	stop_progress(&ctx->progress);
 }
 
-static void merge_commit_graphs(struct write_commit_graph_context *ctx)
+static void merge_commit_graphs(struct write_commit_graph_context *ctx,
+				struct commit_graph *g)
 {
-	struct commit_graph *g = ctx->r->objects->commit_graph;
 	uint32_t current_graph_number = ctx->num_commit_graphs_before;
 
 	while (g && current_graph_number >= ctx->num_commit_graphs_after) {
@@ -2632,12 +2633,13 @@ int write_commit_graph(struct odb_source *source,
 		goto cleanup;
 
 	if (ctx.split) {
-		split_graph_merge_strategy(&ctx);
+		split_graph_merge_strategy(&ctx, g);
 
 		if (!replace)
-			merge_commit_graphs(&ctx);
-	} else
+			merge_commit_graphs(&ctx, g);
+	} else {
 		ctx.num_commit_graphs_after = 1;
+	}
 
 	ctx.trust_generation_numbers = validate_mixed_generation_chain(g);
 

-- 
2.51.0.417.g1ba7204a04.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] odb: move commit-graph into the object sources
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
                   ` (4 preceding siblings ...)
  2025-09-04 12:49 ` [PATCH 5/6] commit-graph: pass graphs that are to be merged as parameter Patrick Steinhardt
@ 2025-09-04 12:50 ` Patrick Steinhardt
  2025-09-11 23:00   ` Taylor Blau
  2025-09-04 23:12 ` [PATCH 0/6] odb: track commit graphs via object source Junio C Hamano
  2025-09-04 23:27 ` Junio C Hamano
  7 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-04 12:50 UTC (permalink / raw)
  To: git

Commit graphs are inherently tied to one specific object source.
Furthermore, with the upcoming pluggable object sources, it is not even
guaranteed that an object source may even have a commit graph as these
are specific to the actual on-disk data format.

Prepare for this future by moving the commit-graph pointer from `struct
object_database` to `struct odb_source`. Eventually, this will allow us
to make commit graphs an implementation detail of an object source's
backend.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 commit-graph.c | 65 +++++++++++++++++++++++++++++++++++++++++-----------------
 commit-graph.h |  2 +-
 odb.c          |  9 ++++----
 odb.h          |  6 +++---
 packfile.c     |  3 +--
 5 files changed, 56 insertions(+), 29 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 0e25b14076..9929c1ed87 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -721,11 +721,15 @@ static struct commit_graph *load_commit_graph_chain(struct odb_source *source)
 
 struct commit_graph *read_commit_graph_one(struct odb_source *source)
 {
-	struct commit_graph *g = load_commit_graph_v1(source);
+	struct commit_graph *g;
+
+	if (source->commit_graph_attempted)
+		return NULL;
+	source->commit_graph_attempted = true;
 
+	g = load_commit_graph_v1(source);
 	if (!g)
 		g = load_commit_graph_chain(source);
-
 	return g;
 }
 
@@ -737,6 +741,7 @@ struct commit_graph *read_commit_graph_one(struct odb_source *source)
  */
 static struct commit_graph *prepare_commit_graph(struct repository *r)
 {
+	bool all_attempted = true;
 	struct odb_source *source;
 
 	/*
@@ -749,9 +754,19 @@ static struct commit_graph *prepare_commit_graph(struct repository *r)
 	if (!r->gitdir || r->commit_graph_disabled)
 		return NULL;
 
-	if (r->objects->commit_graph_attempted)
-		return r->objects->commit_graph;
-	r->objects->commit_graph_attempted = 1;
+	odb_prepare_alternates(r->objects);
+	for (source = r->objects->sources; source; source = source->next) {
+		all_attempted &= source->commit_graph_attempted;
+		if (source->commit_graph)
+			return source->commit_graph;
+	}
+
+	/*
+	 * There is no point in re-trying to load commit graphs if we already
+	 * tried loading all of them beforehand.
+	 */
+	if (all_attempted)
+		return NULL;
 
 	prepare_repo_settings(r);
 
@@ -768,14 +783,16 @@ static struct commit_graph *prepare_commit_graph(struct repository *r)
 	if (!commit_graph_compatible(r))
 		return NULL;
 
-	odb_prepare_alternates(r->objects);
 	for (source = r->objects->sources; source; source = source->next) {
-		r->objects->commit_graph = read_commit_graph_one(source);
-		if (r->objects->commit_graph)
-			break;
+		if (source->commit_graph_attempted)
+			continue;
+
+		source->commit_graph = read_commit_graph_one(source);
+		if (source->commit_graph)
+			return source->commit_graph;
 	}
 
-	return r->objects->commit_graph;
+	return NULL;
 }
 
 int generation_numbers_enabled(struct repository *r)
@@ -806,7 +823,7 @@ int corrected_commit_dates_enabled(struct repository *r)
 
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 {
-	struct commit_graph *g = r->objects->commit_graph;
+	struct commit_graph *g = prepare_commit_graph(r);
 	while (g) {
 		if (g->bloom_filter_settings)
 			return g->bloom_filter_settings;
@@ -815,15 +832,16 @@ struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 	return NULL;
 }
 
-void close_commit_graph(struct object_database *o)
+void close_commit_graph(struct odb_source *source)
 {
-	if (!o->commit_graph)
+	if (!source->commit_graph)
 		return;
 
 	clear_commit_graph_data_slab(&commit_graph_data_slab);
 	deinit_bloom_filters();
-	free_commit_graph(o->commit_graph);
-	o->commit_graph = NULL;
+	free_commit_graph(source->commit_graph);
+	source->commit_graph = NULL;
+	source->commit_graph_attempted = 0;
 }
 
 static int bsearch_graph(struct commit_graph *g, const struct object_id *oid, uint32_t *pos)
@@ -1119,7 +1137,15 @@ static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
 
 struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit *c)
 {
-	return get_commit_tree_in_graph_one(r->objects->commit_graph, c);
+	struct odb_source *source;
+
+	for (source = r->objects->sources; source; source = source->next) {
+		if (!source->commit_graph)
+			continue;
+		return get_commit_tree_in_graph_one(source->commit_graph, c);
+	}
+
+	return NULL;
 }
 
 struct packed_commit_list {
@@ -2165,7 +2191,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		ctx->commit_graph_hash_after[ctx->num_commit_graphs_after - 2] = new_base_hash;
 	}
 
-	close_commit_graph(ctx->r->objects);
+	for (struct odb_source *s = ctx->r->objects->sources; s; s = s->next)
+		close_commit_graph(s);
 	finalize_hashfile(f, file_hash, FSYNC_COMPONENT_COMMIT_GRAPH,
 			  CSUM_HASH_IN_STREAM | CSUM_FSYNC);
 	free_chunkfile(cf);
@@ -2667,8 +2694,8 @@ int write_commit_graph(struct odb_source *source,
 	oid_array_clear(&ctx.oids);
 	clear_topo_level_slab(&topo_levels);
 
-	if (ctx.r->objects->commit_graph) {
-		struct commit_graph *g = ctx.r->objects->commit_graph;
+	if (source->commit_graph) {
+		struct commit_graph *g = source->commit_graph;
 
 		while (g) {
 			g->topo_levels = NULL;
diff --git a/commit-graph.h b/commit-graph.h
index f6a5433641..33cb6a7577 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -185,7 +185,7 @@ int write_commit_graph(struct odb_source *source,
 
 int verify_commit_graph(struct commit_graph *g, int flags);
 
-void close_commit_graph(struct object_database *);
+void close_commit_graph(struct odb_source *);
 void free_commit_graph(struct commit_graph *);
 
 /*
diff --git a/odb.c b/odb.c
index 2a92a018c4..fdcc6849a8 100644
--- a/odb.c
+++ b/odb.c
@@ -363,6 +363,11 @@ static void free_object_directory(struct odb_source *source)
 	free(source->path);
 	odb_clear_loose_cache(source);
 	loose_object_map_clear(&source->loose_map);
+
+	free_commit_graph(source->commit_graph);
+	source->commit_graph = NULL;
+	source->commit_graph_attempted = 0;
+
 	free(source);
 }
 
@@ -1023,10 +1028,6 @@ void odb_clear(struct object_database *o)
 	oidmap_clear(&o->replace_map, 1);
 	pthread_mutex_destroy(&o->replace_mutex);
 
-	free_commit_graph(o->commit_graph);
-	o->commit_graph = NULL;
-	o->commit_graph_attempted = 0;
-
 	free_object_directories(o);
 	o->sources_tail = NULL;
 	o->loaded_alternates = 0;
diff --git a/odb.h b/odb.h
index 3dfc66d75a..a4835db685 100644
--- a/odb.h
+++ b/odb.h
@@ -63,6 +63,9 @@ struct odb_source {
 	 */
 	struct multi_pack_index *midx;
 
+	struct commit_graph *commit_graph;
+	bool commit_graph_attempted; /* if loading has been attempted */
+
 	/*
 	 * This is a temporary object store created by the tmp_objdir
 	 * facility. Disable ref updates since the objects in the store
@@ -120,9 +123,6 @@ struct object_database {
 	unsigned replace_map_initialized : 1;
 	pthread_mutex_t replace_mutex; /* protect object replace functions */
 
-	struct commit_graph *commit_graph;
-	unsigned commit_graph_attempted : 1; /* if loading has been attempted */
-
 	/*
 	 * private data
 	 *
diff --git a/packfile.c b/packfile.c
index 5d73932f50..5d3a25816f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -374,9 +374,8 @@ void close_object_store(struct object_database *o)
 		if (source->midx)
 			close_midx(source->midx);
 		source->midx = NULL;
+		close_commit_graph(source);
 	}
-
-	close_commit_graph(o);
 }
 
 void unlink_pack_path(const char *pack_name, int force_delete)

-- 
2.51.0.417.g1ba7204a04.dirty


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
                   ` (5 preceding siblings ...)
  2025-09-04 12:50 ` [PATCH 6/6] odb: move commit-graph into the object sources Patrick Steinhardt
@ 2025-09-04 23:12 ` Junio C Hamano
  2025-09-05 18:29   ` Derrick Stolee
  2025-09-04 23:27 ` Junio C Hamano
  7 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2025-09-04 23:12 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> commit graphs are currently stored on the object database level. This
> doesn't really make much sense conceptually, given that commit graphs
> are specific to one object source. Furthermore, with the upcoming
> pluggable object database effort, an object source's backend may not
> evene have a commit graph in the first place but store that information
> in a different format altogether.
>
> This patch series prepares for that by moving the commit graph from
> `struct object_database` into `struct odb_source`.

Hmph, I am finding the above hard to agree with at the conceptual
level.  In some future, we may use multiple object stores in a
single repository.  Perhaps we would be storing older parts of
history in semi-online storage while newer parts are stored in
readily available storage.  But the side data structure that allows
us to quickly learn who are parents of one commit is without having
to go to the object store in order to parse the actualy commit
object can be stored for the entire history if we wanted to, or more
recent part of the history but not limited to the "readily available
storage" part.  IOW, where the boundary between the older and the
newer parts of the history lies and which commits the commit graph
covers should be pretty much independent.

So moving from object_database (i.e. the whole world) to individual
odb_source (i.e. where one particular subset of the history is
stored) feels like totally backwards to me.  Surely, a commit graph
file may be defined over a set of packfiles and remaining loose
object files, but it is not like an instance of the commit-graph
file is tied to packfiles in the sense that it uses the index into
some packfile instead of the actual object names to refer to
commits, or anything like that (this is quite different from other
files that are very specific to a single object store, like midx
that is tied to the packfiles it describes).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
                   ` (6 preceding siblings ...)
  2025-09-04 23:12 ` [PATCH 0/6] odb: track commit graphs via object source Junio C Hamano
@ 2025-09-04 23:27 ` Junio C Hamano
  2025-09-05  6:18   ` Patrick Steinhardt
  7 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2025-09-04 23:27 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

Patrick Steinhardt <ps@pks.im> writes:

> There's a trivial conflict with tc/last-modified that can be solved like
> this:
>
> diff --cc commit-graph.c
> index 9929c1ed87,2f20f66cfd..0000000000
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@@ -823,7 -812,12 +823,11 @@@ int corrected_commit_dates_enabled(stru
>   
>   struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
>   {
>  -	struct commit_graph *g;
>  +	struct commit_graph *g = prepare_commit_graph(r);
> + 
>  -	if (!prepare_commit_graph(r))
> ++	if (!g)
> + 	       return NULL;
> + 

The while (g) loop will be entirely skipped when g==NULL, and then
the function returns NULL after iterating the loop, so there is not
much reason to have these three lines for early-return, no?

>  -	g = r->objects->commit_graph;
>   	while (g) {
>   		if (g->bloom_filter_settings)
>   			return g->bloom_filter_settings;

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-04 23:27 ` Junio C Hamano
@ 2025-09-05  6:18   ` Patrick Steinhardt
  0 siblings, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-05  6:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Sep 04, 2025 at 04:27:41PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > diff --cc commit-graph.c
> > index 9929c1ed87,2f20f66cfd..0000000000
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@@ -823,7 -812,12 +823,11 @@@ int corrected_commit_dates_enabled(stru
> >   
> >   struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
> >   {
> >  -	struct commit_graph *g;
> >  +	struct commit_graph *g = prepare_commit_graph(r);
> > + 
> >  -	if (!prepare_commit_graph(r))
> > ++	if (!g)
> > + 	       return NULL;
> > + 
> 
> The while (g) loop will be entirely skipped when g==NULL, and then
> the function returns NULL after iterating the loop, so there is not
> much reason to have these three lines for early-return, no?

Ah, true. That's even better.

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-04 23:12 ` [PATCH 0/6] odb: track commit graphs via object source Junio C Hamano
@ 2025-09-05 18:29   ` Derrick Stolee
  2025-09-08 11:17     ` Patrick Steinhardt
  0 siblings, 1 reply; 26+ messages in thread
From: Derrick Stolee @ 2025-09-05 18:29 UTC (permalink / raw)
  To: Junio C Hamano, Patrick Steinhardt; +Cc: git

On 9/4/2025 7:12 PM, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
>> commit graphs are currently stored on the object database level. This
>> doesn't really make much sense conceptually, given that commit graphs
>> are specific to one object source. Furthermore, with the upcoming
>> pluggable object database effort, an object source's backend may not
>> evene have a commit graph in the first place but store that information
>> in a different format altogether.
>>
>> This patch series prepares for that by moving the commit graph from
>> `struct object_database` into `struct odb_source`.
> 
> Hmph, I am finding the above hard to agree with at the conceptual
> level.  In some future, we may use multiple object stores in a
> single repository.  Perhaps we would be storing older parts of
> history in semi-online storage while newer parts are stored in
> readily available storage.  But the side data structure that allows
> us to quickly learn who are parents of one commit is without having
> to go to the object store in order to parse the actualy commit
> object can be stored for the entire history if we wanted to, or more
> recent part of the history but not limited to the "readily available
> storage" part.  IOW, where the boundary between the older and the
> newer parts of the history lies and which commits the commit graph
> covers should be pretty much independent.
> 
> So moving from object_database (i.e. the whole world) to individual
> odb_source (i.e. where one particular subset of the history is
> stored) feels like totally backwards to me.  Surely, a commit graph
> file may be defined over a set of packfiles and remaining loose
> object files, but it is not like an instance of the commit-graph
> file is tied to packfiles in the sense that it uses the index into
> some packfile instead of the actual object names to refer to
> commits, or anything like that (this is quite different from other
> files that are very specific to a single object store, like midx
> that is tied to the packfiles it describes).

This is an interesting aspect to things, where the commit-graph file
is a "structured cache" of certain commit information. It happens to
be located within the object stores (either local or in an alternate)
but is conceptually different in a few ways.

The biggest difference is that you can only open one commit-graph
(or chain of commit-graphs). Having multiple files across different
object stores will not accumulate additional context. Instead, we
have a "first one wins" approach.

This does seem to be something that you are attempting to change
by including the ability to load a commit-graph for each odb (and
closing them in sequence as we close a repo).

So in this sense, the commit-graph lives at the repository level,
not an object store level. When doing I/O to write or read a graph,
we use a specific object store at a time.

The other direction to consider is what context we have when we
interact with a commit-graph. We generally are parsing commits from
a repository or loading Bloom filter data during file history walks.
Each of these do not have a predictable nature of which object store
will "own" the commit we are inspecting, so it wouldn't make sense
to restrict things like odb_parse_commit() over repo_parse_commit().

With these thoughts in mind, I have these big-picture thoughts:

1. Patches 1-5 are great. Nice cleanups.

2. Some of Patch 6 is great, including having the I/O methods use
   an odb_source to help focus the specific location of the files
   being read or written. However, the movement of the struct into
   the odb_source makes less sense and should still exist at the
   object_database level.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-05 18:29   ` Derrick Stolee
@ 2025-09-08 11:17     ` Patrick Steinhardt
  2025-09-08 14:46       ` Derrick Stolee
  2025-09-11 23:08       ` Taylor Blau
  0 siblings, 2 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-08 11:17 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, git

On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote:
> On 9/4/2025 7:12 PM, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> >> commit graphs are currently stored on the object database level. This
> >> doesn't really make much sense conceptually, given that commit graphs
> >> are specific to one object source. Furthermore, with the upcoming
> >> pluggable object database effort, an object source's backend may not
> >> evene have a commit graph in the first place but store that information
> >> in a different format altogether.
> >>
> >> This patch series prepares for that by moving the commit graph from
> >> `struct object_database` into `struct odb_source`.
> > 
> > Hmph, I am finding the above hard to agree with at the conceptual
> > level.  In some future, we may use multiple object stores in a
> > single repository.  Perhaps we would be storing older parts of
> > history in semi-online storage while newer parts are stored in
> > readily available storage.  But the side data structure that allows
> > us to quickly learn who are parents of one commit is without having
> > to go to the object store in order to parse the actualy commit
> > object can be stored for the entire history if we wanted to, or more
> > recent part of the history but not limited to the "readily available
> > storage" part.  IOW, where the boundary between the older and the
> > newer parts of the history lies and which commits the commit graph
> > covers should be pretty much independent.
> > 
> > So moving from object_database (i.e. the whole world) to individual
> > odb_source (i.e. where one particular subset of the history is
> > stored) feels like totally backwards to me.  Surely, a commit graph
> > file may be defined over a set of packfiles and remaining loose
> > object files, but it is not like an instance of the commit-graph
> > file is tied to packfiles in the sense that it uses the index into
> > some packfile instead of the actual object names to refer to
> > commits, or anything like that (this is quite different from other
> > files that are very specific to a single object store, like midx
> > that is tied to the packfiles it describes).
> 
> This is an interesting aspect to things, where the commit-graph file
> is a "structured cache" of certain commit information. It happens to
> be located within the object stores (either local or in an alternate)
> but is conceptually different in a few ways.
> 
> The biggest difference is that you can only open one commit-graph
> (or chain of commit-graphs). Having multiple files across different
> object stores will not accumulate additional context. Instead, we
> have a "first one wins" approach.
> 
> This does seem to be something that you are attempting to change
> by including the ability to load a commit-graph for each odb (and
> closing them in sequence as we close a repo).
> 
> So in this sense, the commit-graph lives at the repository level,
> not an object store level. When doing I/O to write or read a graph,
> we use a specific object store at a time.
> 
> The other direction to consider is what context we have when we
> interact with a commit-graph. We generally are parsing commits from
> a repository or loading Bloom filter data during file history walks.
> Each of these do not have a predictable nature of which object store
> will "own" the commit we are inspecting, so it wouldn't make sense
> to restrict things like odb_parse_commit() over repo_parse_commit().
> 
> With these thoughts in mind, I have these big-picture thoughts:
> 
> 1. Patches 1-5 are great. Nice cleanups.
> 
> 2. Some of Patch 6 is great, including having the I/O methods use
>    an odb_source to help focus the specific location of the files
>    being read or written. However, the movement of the struct into
>    the odb_source makes less sense and should still exist at the
>    object_database level.

I (probably unsurprisingly :)) don't quite agree with this.

Let's take a step back: why does the commit-graph exist in the first
place? It basically provides a caching mechanism to efficiently return
information that is otherwise more expensive to obtain:

  - It contains a cached representation of the graph so that we don't
    have to parse each commit from the object database.

  - It encodes generation numbers.

  - It contains bloom filters.

All of which makes sense with the current design of our object storage
format, because obtaining this information can be quite expensive. But
let's consider a different world where we for example store objects in a
proper database:

  - This database may have an efficient way to compute generation
    numbers on the fly, either when reading an object or when writing it
    to disk. We cannot currently store that information in the packfile
    right now, so it needs to be stored out-of-band. But with a database
    there is no reason why we couldn't immediately compute and store the
    generation number on each insert.

  - This database may have an efficient way to store bloom filters next
    to a specific commit directly, without requiring a separate file.

  - This database may be distributed. So why should every client now
    have to recompute a commit graph if we can instead store the data in
    the database and thus have it accessible to all clients thereof?

  - It may be _less_ efficient to use the commit graph to access data
    compared to what that database can provide.

So I would claim that the commit graph is specifically tied to the
actual storage format of objects, and it's not at all obvious that it
would need to exist if we had a different storage format.

The goal of this patch series is thus explicitly _not_ to allow loading
one commit graph per object source. In fact, the refactorings I did
ensure that we still only ever load a single commit graph.

Instead, the goal is to allow each object source to decide for itself
how this additional information is to be stored and retrieved. This
_may_ be a commit graph if that makes sense for a particular storage
format. But it may just as well _not_ be a commit graph, as other
storage formats may have way better solutions for making the commit
graph information accessible.

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-08 11:17     ` Patrick Steinhardt
@ 2025-09-08 14:46       ` Derrick Stolee
  2025-09-10 11:38         ` Patrick Steinhardt
  2025-09-11 23:08       ` Taylor Blau
  1 sibling, 1 reply; 26+ messages in thread
From: Derrick Stolee @ 2025-09-08 14:46 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Junio C Hamano, git

On 9/8/2025 7:17 AM, Patrick Steinhardt wrote:
> I (probably unsurprisingly :)) don't quite agree with this.

I think I can summarize the main point you seem to be making with
this quote:

> So I would claim that the commit graph is specifically tied to the
> actual storage format of objects, and it's not at all obvious that it
> would need to exist if we had a different storage format.

I think I agree in principle, if you are saying "different storage
format" means "different commit object information" which then means
we are talking about a completely different object type that is not
at all compatible with current Git.

You could store commit and object data in SQLite, in the cloud, or
via plaintext files on disk. As long as the data is still representing
commit objects as we format them today, the commit-graph is still a
cache that can be used as a faster way to fill 'struct commit' objects
in memory without navigating to that object database.

And you also mention that the commit-graph format itself could be
more efficient. You're right! I think the way we did it within Azure
DevOps is more efficient, because most of the commit walking algorithms
are built working directly on the integer labels within the in-memory
data structure instead of operating on commit structs. This allows for
less overhead when loading the graph (it's already cached in memory)
and when walking thousands of commits (we only translate to object IDs
if they are important for the output). But this is all the more reason
to keep the commit-graph structures outside of "the object store" since
a "commit-graph database" can be implemented without being tied to an
object store.

If you are saying "but our existing commit-graph format puts it in
'objects/info/commit-graph[s]'" then yes the storage of a commit-graph
is tied to our storage of objects. But the way we interact with it in
code is in some way a layer above that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-08 14:46       ` Derrick Stolee
@ 2025-09-10 11:38         ` Patrick Steinhardt
  2025-09-25 19:17           ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-10 11:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Junio C Hamano, git

On Mon, Sep 08, 2025 at 10:46:40AM -0400, Derrick Stolee wrote:
> On 9/8/2025 7:17 AM, Patrick Steinhardt wrote:
> > I (probably unsurprisingly :)) don't quite agree with this.
> 
> I think I can summarize the main point you seem to be making with
> this quote:
> 
> > So I would claim that the commit graph is specifically tied to the
> > actual storage format of objects, and it's not at all obvious that it
> > would need to exist if we had a different storage format.
> 
> I think I agree in principle, if you are saying "different storage
> format" means "different commit object information" which then means
> we are talking about a completely different object type that is not
> at all compatible with current Git.

I don't plan to introduce a new type as in commit, blob or tag, but
really only the possibility to have a different format for how exactly
the objects are stored. So it's not really "different commit object
information" that I want to store. It's rather "additional commit object
information" that we currently have to store out-of-band, which includes
items like generation numbers or bloom filters.

As an example, if the object format stores data in the cloud it would be
sensible to also store generation numbers there, as it would ensure that
we don't have to recompute them on every client. In other words, the
data that is currently stored locally in the commit-graph also becomes
fully distributed.

> You could store commit and object data in SQLite, in the cloud, or
> via plaintext files on disk. As long as the data is still representing
> commit objects as we format them today, the commit-graph is still a
> cache that can be used as a faster way to fill 'struct commit' objects
> in memory without navigating to that object database.
> 
> And you also mention that the commit-graph format itself could be
> more efficient. You're right! I think the way we did it within Azure
> DevOps is more efficient, because most of the commit walking algorithms
> are built working directly on the integer labels within the in-memory
> data structure instead of operating on commit structs. This allows for
> less overhead when loading the graph (it's already cached in memory)
> and when walking thousands of commits (we only translate to object IDs
> if they are important for the output). But this is all the more reason
> to keep the commit-graph structures outside of "the object store" since
> a "commit-graph database" can be implemented without being tied to an
> object store.
> 
> If you are saying "but our existing commit-graph format puts it in
> 'objects/info/commit-graph[s]'" then yes the storage of a commit-graph
> is tied to our storage of objects. But the way we interact with it in
> code is in some way a layer above that.

There is no inherent reason why a new backend would not be able to use
the existing commit-graph infrastructure indeed. But there are reasons
that specific backends may not want to do so. If objects are already
stored in a database table, then it may make way more sense to store
additional metadata that is currently stored in the commit-graph in a
secondary database table instead of in the commit graph.

So that raises the question who is going to decide what caching format
to use. That is, given a repository, do we want to use a commit-graph or
do we rather want to store that data in the database?

I think the most sensible way to decide this is by going via the backend
of a specific object source. The backend knows how objects are stored,
so it'll also know whether there is a better way to store metadata than
via commit graphs. If it is the "files" backend it will decide to
consult the commit-graph. If it is a SQlite backend it _may_ make sense
to use a commit-graph, as well. But if it's a remote database for
example it may make more sense to store the information in that database
directly.

By moving this logic into the object source we can tie this decision to
the object source and also abstract it away. In the pluggable object
database world we can make this data available via a couple of function
pointers that are per object source:

  - A function to check whether a cached representation of the graph
    exists in a specific source.

  - A function to load bloom filters and generation numbers,
    respectively via that source.

  - A function to load a commit via the cached representation of its
    source.

Thus, the details around the actual data format will be hidden away and
most of the code never has to care whether it's a commit graph that
contains the data or whether it's stored in a distributed database.

Now it's not impossible to make this work in your proposed world where
the commit graph continues to sit at the database level. But we'd still
have to abstract away logic so that we can have different ways to store
cached data. We could for example implement logic to ask every backend
whether it has a cached representation of the commit graph, and if so,
to provide an opaque data structure that contains the above function
pointers. We could then continue to store that structure on the database
level and use it whenever available. But I'm not sure whether that
design is in any way more obvious -- quite on the contrary, I expect
that it may be more complex.

This is roughly what I have in my head right now. And I realize that
this information really should be sitting in a design document. I'm
working on that, but still need to land two more patch series before I
want to send such a patch series to the list.

Thanks for the discussion by the way, really appreciate it!

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] blame: drop explicit check for commit graph
  2025-09-04 12:49 ` [PATCH 1/6] blame: drop explicit check for commit graph Patrick Steinhardt
@ 2025-09-11 22:09   ` Taylor Blau
  0 siblings, 0 replies; 26+ messages in thread
From: Taylor Blau @ 2025-09-11 22:09 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Thu, Sep 04, 2025 at 02:49:55PM +0200, Patrick Steinhardt wrote:
> diff --git a/blame.c b/blame.c
> index f1c0670144..cb0b083423 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -2909,9 +2909,6 @@ void setup_blame_bloom_data(struct blame_scoreboard *sb)
>  	struct blame_bloom_data *bd;
>  	struct bloom_filter_settings *bs;
>
> -	if (!sb->repo->objects->commit_graph)
> -		return;
> -
>  	bs = get_bloom_filter_settings(sb->repo);
>  	if (!bs)
>  		return;

Makes sense; get_bloom_filter_settings() (as you note) will return NULL
if r->objects->commit_graph is NULL, which will cause this function to
terminate early anyway.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] revision: drop explicit check for commit graph
  2025-09-04 12:49 ` [PATCH 2/6] revision: " Patrick Steinhardt
@ 2025-09-11 22:16   ` Taylor Blau
  0 siblings, 0 replies; 26+ messages in thread
From: Taylor Blau @ 2025-09-11 22:16 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Thu, Sep 04, 2025 at 02:49:56PM +0200, Patrick Steinhardt wrote:
> When filtering down revisions by paths we know to use bloom filters from
> the commit graph, if we have any. The entry point for this is in
> `check_maybe_different_in_bloom_filter()`, where we first verify that:
>
>   - We do have a commit graph.
>
>   - That the commit is contained therein by checking that we have a
>     proper generation number.
>
>   - And that the graph contains a bloom filter.
>
> The first check is somewhat redundant though: if we don't have a commit
> graph, then the second check would already tell us that we don't have a
> generation number for the specific commit.

Makes sense; and we already inspect the commit_graph_data_slab
regardless of whether or not we know we have a commit-graph, so this
clean-up makes sense to me, too.

> In theory this could be seen as a performance optimization to
> short-circuit for scenarios where there is no commit graph. But in
> practice this shouldn't matter: if there is no commit graph, then the
> commit graph data slab would also be unpopulated and thus a lookup of
> the commit should happen in constant time.

Yeah, this should be a few extra cycles at most. I think it should be
fine.

Taylor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()`
  2025-09-04 12:49 ` [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()` Patrick Steinhardt
@ 2025-09-11 22:25   ` Taylor Blau
  0 siblings, 0 replies; 26+ messages in thread
From: Taylor Blau @ 2025-09-11 22:25 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Thu, Sep 04, 2025 at 02:49:57PM +0200, Patrick Steinhardt wrote:
> When making use of commit graphs, one needs to first prepare them by
> calling `prepare_commit_graph()`. Once that function was called and the
> commit graph was prepared successfully, the caller is now expected to
> access the graph directly via `struct object_database::commit_graph`.
>
> In a subsequent change, we're going to move the commit graph pointer
> from `struct object_database` into `struct odb_source`. With this
> change, semantics will change so that we use the commit graph of the
> first source that has one. Consequently, all callers that currently
> deference the `commit_graph` pointer would now have to loop around the
> list of sources to find the commit graph.

> This would become quite unwieldy. So instead of shifting the burden onto
> such callers, adapt `prepare_commit_graph()` to return the prepared
> commit graph, if any. Like this, callers are expected to call that
> function and then use the returned commit graph.

Hmmph. I see what you're saying, though I'm not sure I agree with the
implication here. Presumably "r->objects->commit_graph" could be
rewritten as "repo_commit_graph(r->objects->sources)" or similar. But
I'm OK with this approach, too.

>  int generation_numbers_enabled(struct repository *r)
>  {
>  	uint32_t first_generation;
>  	struct commit_graph *g;
> -	if (!prepare_commit_graph(r))
> -	       return 0;
>
> -	g = r->objects->commit_graph;
> -
> -	if (!g->num_commits)
> -		return 0;
> +	g = prepare_commit_graph(r);
> +	if (!g || !g->num_commits)

Makes sense; this isn't an exact translation, since the conditional now
also checks for the NULL-ness of "g" first. But that's necessary, since
if the (now-removed) earlier call to prepare_commit_graph() succeeded,
we know that "g" is non-NULL here.

Since that function is now responsible for handing us the commit_graph
itself, checking for success means that we have to see if "g" is
non-NULL first before doing something with it.

> @@ -799,12 +796,9 @@ int generation_numbers_enabled(struct repository *r)
>  int corrected_commit_dates_enabled(struct repository *r)
>  {
>  	struct commit_graph *g;
> -	if (!prepare_commit_graph(r))
> -		return 0;
>
> -	g = r->objects->commit_graph;
> -
> -	if (!g->num_commits)
> +	g = prepare_commit_graph(r);
> +	if (!g || !g->num_commits)

Same here.

> @@ -1012,23 +1006,26 @@ static int find_commit_pos_in_graph(struct commit *item, struct commit_graph *g,
>  int repo_find_commit_pos_in_graph(struct repository *r, struct commit *c,
>  				  uint32_t *pos)
>  {
> -	if (!prepare_commit_graph(r))
> +	struct commit_graph *g = prepare_commit_graph(r);
> +	if (!g)
>  		return 0;
> -	return find_commit_pos_in_graph(c, r->objects->commit_graph, pos);
> +	return find_commit_pos_in_graph(c, g, pos);

These and other changes may have read a little bit cleaner if there were
a preparatory commit which introduced "g" as a variable on the stack,
since that would change:

    struct commit_graph *g;

    if (!prepare_commit_graph(r))
        return 0;

    g = the_repository->objects->commit_graph;

into:

    struct commit_graph *g = prepare_commit_graph(r);
    if (!g)
        return 0;

, without affecting the rest of the function, keeping the diff at least
easier to read (or smaller) by eliminating the "r->objects->commit_graph"
to "g" change.

Not a big deal at all, just a thought that I had while reviewing.

> @@ -2519,6 +2518,7 @@ int write_commit_graph(struct odb_source *source,
>  	int replace = 0;
>  	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  	struct topo_level_slab topo_levels;
> +	struct commit_graph *g;
>
>  	prepare_repo_settings(r);
>  	if (!r->settings.core_commit_graph) {
> @@ -2547,23 +2547,13 @@ int write_commit_graph(struct odb_source *source,
>  	init_topo_level_slab(&topo_levels);
>  	ctx.topo_levels = &topo_levels;
>
> -	prepare_commit_graph(ctx.r);
> -	if (ctx.r->objects->commit_graph) {
> -		struct commit_graph *g = ctx.r->objects->commit_graph;
> -
> -		while (g) {
> -			g->topo_levels = &topo_levels;
> -			g = g->base_graph;
> -		}
> -	}
> +	g = prepare_commit_graph(ctx.r);
> +	for (struct commit_graph *chain = g; chain; chain = chain->base_graph)
> +		g->topo_levels = &topo_levels;

Makes sense.

The rest looks all good.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()`
  2025-09-04 12:49 ` [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()` Patrick Steinhardt
@ 2025-09-11 22:54   ` Taylor Blau
  0 siblings, 0 replies; 26+ messages in thread
From: Taylor Blau @ 2025-09-11 22:54 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Thu, Sep 04, 2025 at 02:49:58PM +0200, Patrick Steinhardt wrote:
> ---
>  bloom.c        |  8 +++++---
>  commit-graph.c | 18 ++++++++++++------
>  commit-graph.h | 12 ++++++------
>  3 files changed, 23 insertions(+), 15 deletions(-)

All makes sense and looks good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] odb: move commit-graph into the object sources
  2025-09-04 12:50 ` [PATCH 6/6] odb: move commit-graph into the object sources Patrick Steinhardt
@ 2025-09-11 23:00   ` Taylor Blau
  0 siblings, 0 replies; 26+ messages in thread
From: Taylor Blau @ 2025-09-11 23:00 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Thu, Sep 04, 2025 at 02:50:00PM +0200, Patrick Steinhardt wrote:
> Commit graphs are inherently tied to one specific object source.
> Furthermore, with the upcoming pluggable object sources, it is not even
> guaranteed that an object source may even have a commit graph as these
> are specific to the actual on-disk data format.
>
> Prepare for this future by moving the commit-graph pointer from `struct
> object_database` to `struct odb_source`. Eventually, this will allow us
> to make commit graphs an implementation detail of an object source's
> backend.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  commit-graph.c | 65 +++++++++++++++++++++++++++++++++++++++++-----------------
>  commit-graph.h |  2 +-
>  odb.c          |  9 ++++----
>  odb.h          |  6 +++---
>  packfile.c     |  3 +--
>  5 files changed, 56 insertions(+), 29 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 0e25b14076..9929c1ed87 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -721,11 +721,15 @@ static struct commit_graph *load_commit_graph_chain(struct odb_source *source)
>
>  struct commit_graph *read_commit_graph_one(struct odb_source *source)
>  {
> -	struct commit_graph *g = load_commit_graph_v1(source);
> +	struct commit_graph *g;
> +
> +	if (source->commit_graph_attempted)
> +		return NULL;

I wonder if some callers will find this surprising. The current analog
of this function is commit-graph.c::prepare_commit_graph(), which does:

    if (r->objects->commit_graph_attempted)
        return r->objects->commit_graph;

, but this function returns NULL on the second (and subsequent) calls,
even if the object source in question has a commit graph that was loaded
in an earlier step.

> +	source->commit_graph_attempted = true;
>
> +	g = load_commit_graph_v1(source);
>  	if (!g)
>  		g = load_commit_graph_chain(source);
> -
>  	return g;
>  }
>
> @@ -737,6 +741,7 @@ struct commit_graph *read_commit_graph_one(struct odb_source *source)
>   */
>  static struct commit_graph *prepare_commit_graph(struct repository *r)
>  {
> +	bool all_attempted = true;
>  	struct odb_source *source;
>
>  	/*
> @@ -749,9 +754,19 @@ static struct commit_graph *prepare_commit_graph(struct repository *r)
>  	if (!r->gitdir || r->commit_graph_disabled)
>  		return NULL;
>
> -	if (r->objects->commit_graph_attempted)
> -		return r->objects->commit_graph;
> -	r->objects->commit_graph_attempted = 1;
> +	odb_prepare_alternates(r->objects);
> +	for (source = r->objects->sources; source; source = source->next) {
> +		all_attempted &= source->commit_graph_attempted;
> +		if (source->commit_graph)
> +			return source->commit_graph;
> +	}
> +
> +	/*
> +	 * There is no point in re-trying to load commit graphs if we already
> +	 * tried loading all of them beforehand.
> +	 */
> +	if (all_attempted)
> +		return NULL;
>
>  	prepare_repo_settings(r);
>
> @@ -768,14 +783,16 @@ static struct commit_graph *prepare_commit_graph(struct repository *r)
>  	if (!commit_graph_compatible(r))
>  		return NULL;
>
> -	odb_prepare_alternates(r->objects);
>  	for (source = r->objects->sources; source; source = source->next) {
> -		r->objects->commit_graph = read_commit_graph_one(source);
> -		if (r->objects->commit_graph)
> -			break;
> +		if (source->commit_graph_attempted)
> +			continue;
> +
> +		source->commit_graph = read_commit_graph_one(source);
> +		if (source->commit_graph)
> +			return source->commit_graph;

Is this what callers expect? The pre-image here returned the
commit-graph belonging to the current repository's object database, not
any of its alternate(s).

I'm not opposed to the idea that perhaps loading commit-graphs from
alternates would be useful, but I am uncomfortable making that change in
this series.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-08 11:17     ` Patrick Steinhardt
  2025-09-08 14:46       ` Derrick Stolee
@ 2025-09-11 23:08       ` Taylor Blau
  1 sibling, 0 replies; 26+ messages in thread
From: Taylor Blau @ 2025-09-11 23:08 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Derrick Stolee, Junio C Hamano, git

On Mon, Sep 08, 2025 at 01:17:39PM +0200, Patrick Steinhardt wrote:
> On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote:
> > On 9/4/2025 7:12 PM, Junio C Hamano wrote:
> > > Patrick Steinhardt <ps@pks.im> writes:
> > >
> > >> commit graphs are currently stored on the object database level. This
> > >> doesn't really make much sense conceptually, given that commit graphs
> > >> are specific to one object source. Furthermore, with the upcoming
> > >> pluggable object database effort, an object source's backend may not
> > >> evene have a commit graph in the first place but store that information
> > >> in a different format altogether.
> > >>
> > >> This patch series prepares for that by moving the commit graph from
> > >> `struct object_database` into `struct odb_source`.
> > >
> > > Hmph, I am finding the above hard to agree with at the conceptual
> > > level.  In some future, we may use multiple object stores in a
> > > single repository.  Perhaps we would be storing older parts of
> > > history in semi-online storage while newer parts are stored in
> > > readily available storage.  But the side data structure that allows
> > > us to quickly learn who are parents of one commit is without having
> > > to go to the object store in order to parse the actualy commit
> > > object can be stored for the entire history if we wanted to, or more
> > > recent part of the history but not limited to the "readily available
> > > storage" part.  IOW, where the boundary between the older and the
> > > newer parts of the history lies and which commits the commit graph
> > > covers should be pretty much independent.
> > >
> > > So moving from object_database (i.e. the whole world) to individual
> > > odb_source (i.e. where one particular subset of the history is
> > > stored) feels like totally backwards to me.  Surely, a commit graph
> > > file may be defined over a set of packfiles and remaining loose
> > > object files, but it is not like an instance of the commit-graph
> > > file is tied to packfiles in the sense that it uses the index into
> > > some packfile instead of the actual object names to refer to
> > > commits, or anything like that (this is quite different from other
> > > files that are very specific to a single object store, like midx
> > > that is tied to the packfiles it describes).
> >
> > This is an interesting aspect to things, where the commit-graph file
> > is a "structured cache" of certain commit information. It happens to
> > be located within the object stores (either local or in an alternate)
> > but is conceptually different in a few ways.
> >
> > The biggest difference is that you can only open one commit-graph
> > (or chain of commit-graphs). Having multiple files across different
> > object stores will not accumulate additional context. Instead, we
> > have a "first one wins" approach.
> >
> > This does seem to be something that you are attempting to change
> > by including the ability to load a commit-graph for each odb (and
> > closing them in sequence as we close a repo).
> >
> > So in this sense, the commit-graph lives at the repository level,
> > not an object store level. When doing I/O to write or read a graph,
> > we use a specific object store at a time.
> >
> > The other direction to consider is what context we have when we
> > interact with a commit-graph. We generally are parsing commits from
> > a repository or loading Bloom filter data during file history walks.
> > Each of these do not have a predictable nature of which object store
> > will "own" the commit we are inspecting, so it wouldn't make sense
> > to restrict things like odb_parse_commit() over repo_parse_commit().
> >
> > With these thoughts in mind, I have these big-picture thoughts:
> >
> > 1. Patches 1-5 are great. Nice cleanups.
> >
> > 2. Some of Patch 6 is great, including having the I/O methods use
> >    an odb_source to help focus the specific location of the files
> >    being read or written. However, the movement of the struct into
> >    the odb_source makes less sense and should still exist at the
> >    object_database level.
>
> I (probably unsurprisingly :)) don't quite agree with this.
>
> Let's take a step back: why does the commit-graph exist in the first
> place? It basically provides a caching mechanism to efficiently return
> information that is otherwise more expensive to obtain:
>
>   - It contains a cached representation of the graph so that we don't
>     have to parse each commit from the object database.
>
>   - It encodes generation numbers.
>
>   - It contains bloom filters.

I don't think these three things are all on the same conceptual footing.

In the current object storage format, yes, part of the commit-graph's
responsibility is to provide fast access to the repository's commits and
their ancestors. That is a pure optimization that is inherently tied to
the object storage format, which necessitates such an optimization for
certain operations to be efficient.

Generation numbers and changed-path Bloom filters are different. Yes,
they *can* be used to optimize certain operations, but they also provide
*new* information that cannot be otherwise stored in the existing object
format.

Suppose that we have some future object storage backend that stores
commit objects in such a way that querying them natively through
whatever mechanism that store provides is already as efficient as using
a commit-graph file on-disk today. That eliminates the "pure
optimization" aspect of the commit-graph.

But that object store implementation will still want to implement
additional features like generation numbers and changed-path Bloom
filters, because those provide new information that is not purely about
optimizing reads.

> All of which makes sense with the current design of our object storage
> format, because obtaining this information can be quite expensive. But
> let's consider a different world where we for example store objects in a
> proper database:
>
>   - This database may have an efficient way to compute generation
>     numbers on the fly, either when reading an object or when writing it
>     to disk. We cannot currently store that information in the packfile
>     right now, so it needs to be stored out-of-band. But with a database
>     there is no reason why we couldn't immediately compute and store the
>     generation number on each insert.

Yes, but that does not mean that other parts of the code are not
themselves interested in what generation number a commit has for their
own purposes.

>   - This database may have an efficient way to store bloom filters next
>     to a specific commit directly, without requiring a separate file.

Same thought as above.

> So I would claim that the commit graph is specifically tied to the
> actual storage format of objects, and it's not at all obvious that it
> would need to exist if we had a different storage format.

I am not sure I agree here. What I'm trying to say is that though it's
easy to imagine an object storage format that stores objects in such a
way that the commit-graph file's representation of commits is
unnecessary, there is additional information in what we currently call
the commit-graph that is useful regardless of how fast the actual object
store is.

> The goal of this patch series is thus explicitly _not_ to allow loading
> one commit graph per object source. In fact, the refactorings I did
> ensure that we still only ever load a single commit graph.

Right, though if I understand correctly I think there *is* a behavior
change here in that we will happily fall back to an alternate ODB's
commit-graph if the main object store does not have one. That is not how
it works today.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-10 11:38         ` Patrick Steinhardt
@ 2025-09-25 19:17           ` Junio C Hamano
  2025-09-26  5:18             ` Patrick Steinhardt
  2025-10-02 11:21             ` Patrick Steinhardt
  0 siblings, 2 replies; 26+ messages in thread
From: Junio C Hamano @ 2025-09-25 19:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Derrick Stolee, git, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> There is no inherent reason why a new backend would not be able to use
> the existing commit-graph infrastructure indeed. But there are reasons
> that specific backends may not want to do so. If objects are already
> stored in a database table, then it may make way more sense to store
> additional metadata that is currently stored in the commit-graph in a
> secondary database table instead of in the commit graph.
> ...
> This is roughly what I have in my head right now. And I realize that
> this information really should be sitting in a design document. I'm
> working on that, but still need to land two more patch series before I
> want to send such a patch series to the list.

So is everybody happy with this line of thought that makes it
mandatory for each backend to decide and implement the commit-graph
support if they want to?

My reading of the later part of Taylor's message[*] tells me that at
least Taylor does not agree with that position, and I am not sure
about this design choice, either.  Surely, each backend can have its
own optimization, but looking at the way data from the commit-graph
and other auxiliary data files are used to optimize real operations
(like populating the essential fields of the commit object first
from the graph, only to read other things lazily from the object
database, or switching to completely different traversal machinery
when reachability bitmap is available), we cannot say that each
backend can store whatever side data they please and leave it at
that.  The code paths that are supposed to be generic need to be
aware of these side data used for optimization to some degree, so
conceptually it is much cleaner (well, at least to my eyes, that is)
to declare that the auxiliary data files like commit-graph and
reachability bitmaps are defined on the objects in the repository,
no matter what backend is used to store them.

[Reference]

 * https://lore.kernel.org/git/aMNWgD4wKa1a5R8g@nand.local/ 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-25 19:17           ` Junio C Hamano
@ 2025-09-26  5:18             ` Patrick Steinhardt
  2025-10-02 11:21             ` Patrick Steinhardt
  1 sibling, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-09-26  5:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git, Taylor Blau

On Thu, Sep 25, 2025 at 12:17:50PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > There is no inherent reason why a new backend would not be able to use
> > the existing commit-graph infrastructure indeed. But there are reasons
> > that specific backends may not want to do so. If objects are already
> > stored in a database table, then it may make way more sense to store
> > additional metadata that is currently stored in the commit-graph in a
> > secondary database table instead of in the commit graph.
> > ...
> > This is roughly what I have in my head right now. And I realize that
> > this information really should be sitting in a design document. I'm
> > working on that, but still need to land two more patch series before I
> > want to send such a patch series to the list.
> 
> So is everybody happy with this line of thought that makes it
> mandatory for each backend to decide and implement the commit-graph
> support if they want to?
> 
> My reading of the later part of Taylor's message[*] tells me that at
> least Taylor does not agree with that position, and I am not sure
> about this design choice, either.  Surely, each backend can have its
> own optimization, but looking at the way data from the commit-graph
> and other auxiliary data files are used to optimize real operations
> (like populating the essential fields of the commit object first
> from the graph, only to read other things lazily from the object
> database, or switching to completely different traversal machinery
> when reachability bitmap is available), we cannot say that each
> backend can store whatever side data they please and leave it at
> that.  The code paths that are supposed to be generic need to be
> aware of these side data used for optimization to some degree, so
> conceptually it is much cleaner (well, at least to my eyes, that is)
> to declare that the auxiliary data files like commit-graph and
> reachability bitmaps are defined on the objects in the repository,
> no matter what backend is used to store them.

Quick update: I haven't found the time to reply to review feedback yet.
I'll pick this topic up again in one or two weeks.

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-09-25 19:17           ` Junio C Hamano
  2025-09-26  5:18             ` Patrick Steinhardt
@ 2025-10-02 11:21             ` Patrick Steinhardt
  2025-10-02 11:35               ` Patrick Steinhardt
  2025-10-02 16:49               ` Junio C Hamano
  1 sibling, 2 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-10-02 11:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git, Taylor Blau

On Thu, Sep 25, 2025 at 12:17:50PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
> > There is no inherent reason why a new backend would not be able to use
> > the existing commit-graph infrastructure indeed. But there are reasons
> > that specific backends may not want to do so. If objects are already
> > stored in a database table, then it may make way more sense to store
> > additional metadata that is currently stored in the commit-graph in a
> > secondary database table instead of in the commit graph.
> > ...
> > This is roughly what I have in my head right now. And I realize that
> > this information really should be sitting in a design document. I'm
> > working on that, but still need to land two more patch series before I
> > want to send such a patch series to the list.
> 
> So is everybody happy with this line of thought that makes it
> mandatory for each backend to decide and implement the commit-graph
> support if they want to?
> 
> My reading of the later part of Taylor's message[*] tells me that at
> least Taylor does not agree with that position, and I am not sure
> about this design choice, either.  Surely, each backend can have its
> own optimization, but looking at the way data from the commit-graph
> and other auxiliary data files are used to optimize real operations
> (like populating the essential fields of the commit object first
> from the graph, only to read other things lazily from the object
> database, or switching to completely different traversal machinery
> when reachability bitmap is available), we cannot say that each
> backend can store whatever side data they please and leave it at
> that.  The code paths that are supposed to be generic need to be
> aware of these side data used for optimization to some degree, so
> conceptually it is much cleaner (well, at least to my eyes, that is)
> to declare that the auxiliary data files like commit-graph and
> reachability bitmaps are defined on the objects in the repository,
> no matter what backend is used to store them.

My intent here is mostly to allow us to swap out how exactly the data is
being cached. During the Git Merge I heard from some JJ developer (I
think) that they also have a pluggable cache, but they approach the
issue differently: instead of making the cache a property of the object
backend, they instead make the cache itself pluggable.

I think that's a worthwhile angle to explore. The cache would still sit
on the repository level, and it wouldn't have to care at all whether we
use loose objects/packfiles or any other backend. But in theory, we can
still swap it out for a different representation as desired.

Which overall means that we can defer this to a later point in time, as
we can make it pluggable independent from making the object database
itself pluggable.

So I'd propose to merge the first six patches, as everyone seemed to
agree that they improve the status quo, but drop the last patch that
moves the commit-graph into the ODB sources.

Does that seem reasonable to everyone? If so, I don't really see a
reason to reroll at this point. But please let me know in case I miss
anything that needs addressing.

Thanks!

Patrick

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-10-02 11:21             ` Patrick Steinhardt
@ 2025-10-02 11:35               ` Patrick Steinhardt
  2025-10-02 16:49               ` Junio C Hamano
  1 sibling, 0 replies; 26+ messages in thread
From: Patrick Steinhardt @ 2025-10-02 11:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git, Taylor Blau

On Thu, Oct 02, 2025 at 01:21:34PM +0200, Patrick Steinhardt wrote:
> On Thu, Sep 25, 2025 at 12:17:50PM -0700, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> > 
> > > There is no inherent reason why a new backend would not be able to use
> > > the existing commit-graph infrastructure indeed. But there are reasons
> > > that specific backends may not want to do so. If objects are already
> > > stored in a database table, then it may make way more sense to store
> > > additional metadata that is currently stored in the commit-graph in a
> > > secondary database table instead of in the commit graph.
> > > ...
> > > This is roughly what I have in my head right now. And I realize that
> > > this information really should be sitting in a design document. I'm
> > > working on that, but still need to land two more patch series before I
> > > want to send such a patch series to the list.
> > 
> > So is everybody happy with this line of thought that makes it
> > mandatory for each backend to decide and implement the commit-graph
> > support if they want to?
> > 
> > My reading of the later part of Taylor's message[*] tells me that at
> > least Taylor does not agree with that position, and I am not sure
> > about this design choice, either.  Surely, each backend can have its
> > own optimization, but looking at the way data from the commit-graph
> > and other auxiliary data files are used to optimize real operations
> > (like populating the essential fields of the commit object first
> > from the graph, only to read other things lazily from the object
> > database, or switching to completely different traversal machinery
> > when reachability bitmap is available), we cannot say that each
> > backend can store whatever side data they please and leave it at
> > that.  The code paths that are supposed to be generic need to be
> > aware of these side data used for optimization to some degree, so
> > conceptually it is much cleaner (well, at least to my eyes, that is)
> > to declare that the auxiliary data files like commit-graph and
> > reachability bitmaps are defined on the objects in the repository,
> > no matter what backend is used to store them.
> 
> My intent here is mostly to allow us to swap out how exactly the data is
> being cached. During the Git Merge I heard from some JJ developer (I
> think) that they also have a pluggable cache, but they approach the
> issue differently: instead of making the cache a property of the object
> backend, they instead make the cache itself pluggable.
> 
> I think that's a worthwhile angle to explore. The cache would still sit
> on the repository level, and it wouldn't have to care at all whether we
> use loose objects/packfiles or any other backend. But in theory, we can
> still swap it out for a different representation as desired.
> 
> Which overall means that we can defer this to a later point in time, as
> we can make it pluggable independent from making the object database
> itself pluggable.
> 
> So I'd propose to merge the first six patches, as everyone seemed to

Correction: first five patches, of course :)

Patrick

> agree that they improve the status quo, but drop the last patch that
> moves the commit-graph into the ODB sources.
> 
> Does that seem reasonable to everyone? If so, I don't really see a
> reason to reroll at this point. But please let me know in case I miss
> anything that needs addressing.
> 
> Thanks!
> 
> Patrick
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-10-02 11:21             ` Patrick Steinhardt
  2025-10-02 11:35               ` Patrick Steinhardt
@ 2025-10-02 16:49               ` Junio C Hamano
  2025-10-03 16:56                 ` Derrick Stolee
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2025-10-02 16:49 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Derrick Stolee, git, Taylor Blau

Patrick Steinhardt <ps@pks.im> writes:

> My intent here is mostly to allow us to swap out how exactly the data is
> being cached. During the Git Merge I heard from some JJ developer (I
> think) that they also have a pluggable cache, but they approach the
> issue differently: instead of making the cache a property of the object
> backend, they instead make the cache itself pluggable.
>
> I think that's a worthwhile angle to explore. The cache would still sit
> on the repository level, and it wouldn't have to care at all whether we
> use loose objects/packfiles or any other backend. But in theory, we can
> still swap it out for a different representation as desired.

The idea to allow these "caches" being pluggable to the system
independently from object store or reference store backends does
make quite a lot of sense.  If there is only one that is plugged,
that degenerates into the "side data like commit-graph and
reachability bitmaps do not belong to a specific object store
backend, but are defined over the objects known to the repository" I
was talking about in the message you are replying to, I think.

> Which overall means that we can defer this to a later point in time, as
> we can make it pluggable independent from making the object database
> itself pluggable.
>
> So I'd propose to merge the first six patches, as everyone seemed to
> agree that they improve the status quo, but drop the last patch that
> moves the commit-graph into the ODB sources.
>
> Does that seem reasonable to everyone? If so, I don't really see a
> reason to reroll at this point. But please let me know in case I miss
> anything that needs addressing.

I am fine with that, but let's hear from everyone ;-).

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] odb: track commit graphs via object source
  2025-10-02 16:49               ` Junio C Hamano
@ 2025-10-03 16:56                 ` Derrick Stolee
  0 siblings, 0 replies; 26+ messages in thread
From: Derrick Stolee @ 2025-10-03 16:56 UTC (permalink / raw)
  To: Junio C Hamano, Patrick Steinhardt; +Cc: git, Taylor Blau

On 10/2/2025 12:49 PM, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> 
>> My intent here is mostly to allow us to swap out how exactly the data is
>> being cached. During the Git Merge I heard from some JJ developer (I
>> think) that they also have a pluggable cache, but they approach the
>> issue differently: instead of making the cache a property of the object
>> backend, they instead make the cache itself pluggable.
>>
>> I think that's a worthwhile angle to explore. The cache would still sit
>> on the repository level, and it wouldn't have to care at all whether we
>> use loose objects/packfiles or any other backend. But in theory, we can
>> still swap it out for a different representation as desired.
> 
> The idea to allow these "caches" being pluggable to the system
> independently from object store or reference store backends does
> make quite a lot of sense.  If there is only one that is plugged,
> that degenerates into the "side data like commit-graph and
> reachability bitmaps do not belong to a specific object store
> backend, but are defined over the objects known to the repository" I
> was talking about in the message you are replying to, I think.
> 
>> Which overall means that we can defer this to a later point in time, as
>> we can make it pluggable independent from making the object database
>> itself pluggable.
>>
>> So I'd propose to merge the first six patches, as everyone seemed to
>> agree that they improve the status quo, but drop the last patch that
>> moves the commit-graph into the ODB sources.
>>
>> Does that seem reasonable to everyone? If so, I don't really see a
>> reason to reroll at this point. But please let me know in case I miss
>> anything that needs addressing.
> 
> I am fine with that, but let's hear from everyone ;-).

I am also fine with that approach. Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-10-03 16:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
2025-09-04 12:49 ` [PATCH 1/6] blame: drop explicit check for commit graph Patrick Steinhardt
2025-09-11 22:09   ` Taylor Blau
2025-09-04 12:49 ` [PATCH 2/6] revision: " Patrick Steinhardt
2025-09-11 22:16   ` Taylor Blau
2025-09-04 12:49 ` [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()` Patrick Steinhardt
2025-09-11 22:25   ` Taylor Blau
2025-09-04 12:49 ` [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()` Patrick Steinhardt
2025-09-11 22:54   ` Taylor Blau
2025-09-04 12:49 ` [PATCH 5/6] commit-graph: pass graphs that are to be merged as parameter Patrick Steinhardt
2025-09-04 12:50 ` [PATCH 6/6] odb: move commit-graph into the object sources Patrick Steinhardt
2025-09-11 23:00   ` Taylor Blau
2025-09-04 23:12 ` [PATCH 0/6] odb: track commit graphs via object source Junio C Hamano
2025-09-05 18:29   ` Derrick Stolee
2025-09-08 11:17     ` Patrick Steinhardt
2025-09-08 14:46       ` Derrick Stolee
2025-09-10 11:38         ` Patrick Steinhardt
2025-09-25 19:17           ` Junio C Hamano
2025-09-26  5:18             ` Patrick Steinhardt
2025-10-02 11:21             ` Patrick Steinhardt
2025-10-02 11:35               ` Patrick Steinhardt
2025-10-02 16:49               ` Junio C Hamano
2025-10-03 16:56                 ` Derrick Stolee
2025-09-11 23:08       ` Taylor Blau
2025-09-04 23:27 ` Junio C Hamano
2025-09-05  6:18   ` Patrick Steinhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).