From: Patrick Steinhardt <ps@pks.im>
To: Derrick Stolee <stolee@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [PATCH 0/6] odb: track commit graphs via object source
Date: Mon, 8 Sep 2025 13:17:39 +0200 [thread overview]
Message-ID: <aL67U0-tw7O-y6_X@pks.im> (raw)
In-Reply-To: <cf7aeda1-297a-4805-b0ae-e379ce11bbcf@gmail.com>
On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote:
> On 9/4/2025 7:12 PM, Junio C Hamano wrote:
> > Patrick Steinhardt <ps@pks.im> writes:
> >
> >> commit graphs are currently stored on the object database level. This
> >> doesn't really make much sense conceptually, given that commit graphs
> >> are specific to one object source. Furthermore, with the upcoming
> >> pluggable object database effort, an object source's backend may not
> >> evene have a commit graph in the first place but store that information
> >> in a different format altogether.
> >>
> >> This patch series prepares for that by moving the commit graph from
> >> `struct object_database` into `struct odb_source`.
> >
> > Hmph, I am finding the above hard to agree with at the conceptual
> > level. In some future, we may use multiple object stores in a
> > single repository. Perhaps we would be storing older parts of
> > history in semi-online storage while newer parts are stored in
> > readily available storage. But the side data structure that allows
> > us to quickly learn who are parents of one commit is without having
> > to go to the object store in order to parse the actualy commit
> > object can be stored for the entire history if we wanted to, or more
> > recent part of the history but not limited to the "readily available
> > storage" part. IOW, where the boundary between the older and the
> > newer parts of the history lies and which commits the commit graph
> > covers should be pretty much independent.
> >
> > So moving from object_database (i.e. the whole world) to individual
> > odb_source (i.e. where one particular subset of the history is
> > stored) feels like totally backwards to me. Surely, a commit graph
> > file may be defined over a set of packfiles and remaining loose
> > object files, but it is not like an instance of the commit-graph
> > file is tied to packfiles in the sense that it uses the index into
> > some packfile instead of the actual object names to refer to
> > commits, or anything like that (this is quite different from other
> > files that are very specific to a single object store, like midx
> > that is tied to the packfiles it describes).
>
> This is an interesting aspect to things, where the commit-graph file
> is a "structured cache" of certain commit information. It happens to
> be located within the object stores (either local or in an alternate)
> but is conceptually different in a few ways.
>
> The biggest difference is that you can only open one commit-graph
> (or chain of commit-graphs). Having multiple files across different
> object stores will not accumulate additional context. Instead, we
> have a "first one wins" approach.
>
> This does seem to be something that you are attempting to change
> by including the ability to load a commit-graph for each odb (and
> closing them in sequence as we close a repo).
>
> So in this sense, the commit-graph lives at the repository level,
> not an object store level. When doing I/O to write or read a graph,
> we use a specific object store at a time.
>
> The other direction to consider is what context we have when we
> interact with a commit-graph. We generally are parsing commits from
> a repository or loading Bloom filter data during file history walks.
> Each of these do not have a predictable nature of which object store
> will "own" the commit we are inspecting, so it wouldn't make sense
> to restrict things like odb_parse_commit() over repo_parse_commit().
>
> With these thoughts in mind, I have these big-picture thoughts:
>
> 1. Patches 1-5 are great. Nice cleanups.
>
> 2. Some of Patch 6 is great, including having the I/O methods use
> an odb_source to help focus the specific location of the files
> being read or written. However, the movement of the struct into
> the odb_source makes less sense and should still exist at the
> object_database level.
I (probably unsurprisingly :)) don't quite agree with this.
Let's take a step back: why does the commit-graph exist in the first
place? It basically provides a caching mechanism to efficiently return
information that is otherwise more expensive to obtain:
- It contains a cached representation of the graph so that we don't
have to parse each commit from the object database.
- It encodes generation numbers.
- It contains bloom filters.
All of which makes sense with the current design of our object storage
format, because obtaining this information can be quite expensive. But
let's consider a different world where we for example store objects in a
proper database:
- This database may have an efficient way to compute generation
numbers on the fly, either when reading an object or when writing it
to disk. We cannot currently store that information in the packfile
right now, so it needs to be stored out-of-band. But with a database
there is no reason why we couldn't immediately compute and store the
generation number on each insert.
- This database may have an efficient way to store bloom filters next
to a specific commit directly, without requiring a separate file.
- This database may be distributed. So why should every client now
have to recompute a commit graph if we can instead store the data in
the database and thus have it accessible to all clients thereof?
- It may be _less_ efficient to use the commit graph to access data
compared to what that database can provide.
So I would claim that the commit graph is specifically tied to the
actual storage format of objects, and it's not at all obvious that it
would need to exist if we had a different storage format.
The goal of this patch series is thus explicitly _not_ to allow loading
one commit graph per object source. In fact, the refactorings I did
ensure that we still only ever load a single commit graph.
Instead, the goal is to allow each object source to decide for itself
how this additional information is to be stored and retrieved. This
_may_ be a commit graph if that makes sense for a particular storage
format. But it may just as well _not_ be a commit graph, as other
storage formats may have way better solutions for making the commit
graph information accessible.
Patrick
next prev parent reply other threads:[~2025-09-08 11:17 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-04 12:49 [PATCH 0/6] odb: track commit graphs via object source Patrick Steinhardt
2025-09-04 12:49 ` [PATCH 1/6] blame: drop explicit check for commit graph Patrick Steinhardt
2025-09-11 22:09 ` Taylor Blau
2025-09-04 12:49 ` [PATCH 2/6] revision: " Patrick Steinhardt
2025-09-11 22:16 ` Taylor Blau
2025-09-04 12:49 ` [PATCH 3/6] commit-graph: return the prepared commit graph from `prepare_commit_graph()` Patrick Steinhardt
2025-09-11 22:25 ` Taylor Blau
2025-09-04 12:49 ` [PATCH 4/6] commit-graph: return commit graph from `repo_find_commit_pos_in_graph()` Patrick Steinhardt
2025-09-11 22:54 ` Taylor Blau
2025-09-04 12:49 ` [PATCH 5/6] commit-graph: pass graphs that are to be merged as parameter Patrick Steinhardt
2025-09-04 12:50 ` [PATCH 6/6] odb: move commit-graph into the object sources Patrick Steinhardt
2025-09-11 23:00 ` Taylor Blau
2025-09-04 23:12 ` [PATCH 0/6] odb: track commit graphs via object source Junio C Hamano
2025-09-05 18:29 ` Derrick Stolee
2025-09-08 11:17 ` Patrick Steinhardt [this message]
2025-09-08 14:46 ` Derrick Stolee
2025-09-10 11:38 ` Patrick Steinhardt
2025-09-25 19:17 ` Junio C Hamano
2025-09-26 5:18 ` Patrick Steinhardt
2025-10-02 11:21 ` Patrick Steinhardt
2025-10-02 11:35 ` Patrick Steinhardt
2025-10-02 16:49 ` Junio C Hamano
2025-10-03 16:56 ` Derrick Stolee
2025-09-11 23:08 ` Taylor Blau
2025-09-04 23:27 ` Junio C Hamano
2025-09-05 6:18 ` Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aL67U0-tw7O-y6_X@pks.im \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).