From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f181.google.com (mail-yb1-f181.google.com [209.85.219.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7780C1A08A3 for ; Thu, 11 Sep 2025 23:08:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757632133; cv=none; b=mesLMr1pASGcnMXyEAq0OUB/HKfE891O7qcgwwdtWxyeq5GJt8XEsHrSqeH86ohoAS30+Zd7Wauk4+dYDYEN39yZUOOqKY2ZLGEMIo2231Pp+AKR08rL7pAVJHVBLTOFjnC3wVyaXzi/rQRQRugwk3xWWJjSjh8VBkRjr57Ugxg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757632133; c=relaxed/simple; bh=lVbxRLhP8ko029w11DTFXzqY55BJZ/M8Fxq3z/ZQfBw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bFEvNO5ry0G+1/bmlH5YzNw+tcyyG78GODDx/c+MVv0XBFYMmnT6skY69CGU182Hmhc+5ThsTYZcmFp1mWGIzTJezeaD8Qad2l130gcq3cgbHGD2uLpm7oGTQ54w5zruOnw5JFytoDM+0wCbW6t4v01B7cYykPgnj1xtPx6hUk8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=ttaylorr.com; spf=pass smtp.mailfrom=ttaylorr.com; dkim=pass (2048-bit key) header.d=ttaylorr.com header.i=@ttaylorr.com header.b=bJ38iMGy; arc=none smtp.client-ip=209.85.219.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=ttaylorr.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=ttaylorr.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ttaylorr.com header.i=@ttaylorr.com header.b="bJ38iMGy" Received: by mail-yb1-f181.google.com with SMTP id 3f1490d57ef6-e9baa0310cbso1343610276.2 for ; Thu, 11 Sep 2025 16:08:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ttaylorr.com; s=google; t=1757632130; x=1758236930; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=l+dscmgwI2M86panvaaLIzmE9ZrK5gUP2Q0sYAH9+GM=; b=bJ38iMGywleE+toL4JJxml7MZEb/nrt0vO51N6gE66O5IpSrDeNalfLKtEbNO/fKom ZyEmstQl4sDz+h53H7z4aA8hA9iFS+ozie9OH5KJmEktNAhrjJqYZ+0aKGk+UiEiN/4j 6XumhSa7S4af7Qu+/DLfDe0/9f3Sv2PI6foPz3VUZhOlg8mR1AmU83BTSHHgEPsXQyxF R98TsHyytliD37LzDh3VnnJkgyZfKBvieuWLXkhJ4bsKwooBTOPonOTifdeTDXnPR3d2 24qE+zKOXHhzvW5zBxLa6xbL5Pn3pMhw9sNFOEtJPBOiYT2Df7Nh3URW2WJMrLYkVq2/ qV/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757632130; x=1758236930; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=l+dscmgwI2M86panvaaLIzmE9ZrK5gUP2Q0sYAH9+GM=; b=LcCv8/TAClXYN+sSoJMEYVoh4a6xIXkoCLbqV5Z2DrJKrpEoF1Oz6y0v6xvTPtiVgU Jr2DpB2N4z3ea8EfQ4pw9tMfmMmFX66qO54DLrET9tCnI7gr2KBEo/IKCBZ8opsF9M+O 0U6UqtZ5DNZlq5eHuiU18SOeKynTF9LNZdaRBHdKglPQaP0d68gKqy2/7ZPL2j1wsYl6 MB93G3xsOL2WapGx0F1vJegVQjtTuMd5V2zumdeIxUBwprCPEq/dBL04iuLnCwO8P6Pf hz+hPJ+FKhxT1FgFjbgaIFPRSepz6iXkuSIx9+ZoUlt0BcLxX6HxxcfFQk9nO2DRHHVt hzjg== X-Forwarded-Encrypted: i=1; AJvYcCUsVAsanMo27TjGQhVQeuJWDj/nhe7ThWWugvYsr4XBVFEPfpCpNCIkpVmRblX8Qoro0II=@vger.kernel.org X-Gm-Message-State: AOJu0YyDaJxAQqgH6W0AtlnoCv8VUPfNkjkmbnj8QnurKCEpqQjD2ay3 Q+zZM+Fm1yC49GSViLXxMtnCk7N1Pp9OywMHRZK9fjFOfQdRmjuyw0jEcmr+C2BPeFA= X-Gm-Gg: ASbGncvQpFfNkU2ZoTD8NX7OEDhPgWQiNs1gQH1NTYvP96ohOzxP+htUGr+HilnXHGt +uvo6yiyfq94AShq3RCS9dcl9hlSPq0MwRZoVA1XfpoXBI0DS+G7CrX94qSL/dGAiIcibF4RF5d F2oDdSk4bY/fWHRJi+//OdHBmLmThj/a5YI5tJSvc5vlyun4RorLYvQ/bKmX7Uv4+8p9ZJ4w9pU EhYZtXtglcRu0gbbjUvvQJ6reFLPe3XjI5mjc54KEpdjFU4sz8TKsP5S2y8GXwovmF3nwa6pCzp vNcf/beB22KweF7mX9sDpeBlfiKeGOlfsw4Lca5ONfIO6EMAXqlE4CaXy7Cy6BmqFvyx17IkEFY S/fUnHCPmh/FJDndaoTHU4AYSHU6YJGtT2dnTup/6yTE3YwvkWbpC8DZX30+lIDHs+Iv7Wlixdn WupxJpwv0+EcS9XrsvI7xw7+hntw== X-Google-Smtp-Source: AGHT+IFItNrqlQH5dL9sNOwkaFMqchhgQ9OtWs0/AJplLYpeNiYNILxMEXoiDQSNOwNE9nSyQIWcEQ== X-Received: by 2002:a05:6902:5401:b0:e93:46dc:41ce with SMTP id 3f1490d57ef6-ea3d9a7aaeamr808801276.24.1757632130327; Thu, 11 Sep 2025 16:08:50 -0700 (PDT) Received: from localhost (104-178-186-189.lightspeed.milwwi.sbcglobal.net. [104.178.186.189]) by smtp.gmail.com with UTF8SMTPSA id 3f1490d57ef6-ea3cf26020asm871254276.29.2025.09.11.16.08.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Sep 2025 16:08:49 -0700 (PDT) Date: Thu, 11 Sep 2025 19:08:48 -0400 From: Taylor Blau To: Patrick Steinhardt Cc: Derrick Stolee , Junio C Hamano , git@vger.kernel.org Subject: Re: [PATCH 0/6] odb: track commit graphs via object source Message-ID: References: <20250904-b4-pks-commit-graph-via-source-v1-0-d932c2481e1a@pks.im> Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: On Mon, Sep 08, 2025 at 01:17:39PM +0200, Patrick Steinhardt wrote: > On Fri, Sep 05, 2025 at 02:29:50PM -0400, Derrick Stolee wrote: > > On 9/4/2025 7:12 PM, Junio C Hamano wrote: > > > Patrick Steinhardt writes: > > > > > >> commit graphs are currently stored on the object database level. This > > >> doesn't really make much sense conceptually, given that commit graphs > > >> are specific to one object source. Furthermore, with the upcoming > > >> pluggable object database effort, an object source's backend may not > > >> evene have a commit graph in the first place but store that information > > >> in a different format altogether. > > >> > > >> This patch series prepares for that by moving the commit graph from > > >> `struct object_database` into `struct odb_source`. > > > > > > Hmph, I am finding the above hard to agree with at the conceptual > > > level. In some future, we may use multiple object stores in a > > > single repository. Perhaps we would be storing older parts of > > > history in semi-online storage while newer parts are stored in > > > readily available storage. But the side data structure that allows > > > us to quickly learn who are parents of one commit is without having > > > to go to the object store in order to parse the actualy commit > > > object can be stored for the entire history if we wanted to, or more > > > recent part of the history but not limited to the "readily available > > > storage" part. IOW, where the boundary between the older and the > > > newer parts of the history lies and which commits the commit graph > > > covers should be pretty much independent. > > > > > > So moving from object_database (i.e. the whole world) to individual > > > odb_source (i.e. where one particular subset of the history is > > > stored) feels like totally backwards to me. Surely, a commit graph > > > file may be defined over a set of packfiles and remaining loose > > > object files, but it is not like an instance of the commit-graph > > > file is tied to packfiles in the sense that it uses the index into > > > some packfile instead of the actual object names to refer to > > > commits, or anything like that (this is quite different from other > > > files that are very specific to a single object store, like midx > > > that is tied to the packfiles it describes). > > > > This is an interesting aspect to things, where the commit-graph file > > is a "structured cache" of certain commit information. It happens to > > be located within the object stores (either local or in an alternate) > > but is conceptually different in a few ways. > > > > The biggest difference is that you can only open one commit-graph > > (or chain of commit-graphs). Having multiple files across different > > object stores will not accumulate additional context. Instead, we > > have a "first one wins" approach. > > > > This does seem to be something that you are attempting to change > > by including the ability to load a commit-graph for each odb (and > > closing them in sequence as we close a repo). > > > > So in this sense, the commit-graph lives at the repository level, > > not an object store level. When doing I/O to write or read a graph, > > we use a specific object store at a time. > > > > The other direction to consider is what context we have when we > > interact with a commit-graph. We generally are parsing commits from > > a repository or loading Bloom filter data during file history walks. > > Each of these do not have a predictable nature of which object store > > will "own" the commit we are inspecting, so it wouldn't make sense > > to restrict things like odb_parse_commit() over repo_parse_commit(). > > > > With these thoughts in mind, I have these big-picture thoughts: > > > > 1. Patches 1-5 are great. Nice cleanups. > > > > 2. Some of Patch 6 is great, including having the I/O methods use > > an odb_source to help focus the specific location of the files > > being read or written. However, the movement of the struct into > > the odb_source makes less sense and should still exist at the > > object_database level. > > I (probably unsurprisingly :)) don't quite agree with this. > > Let's take a step back: why does the commit-graph exist in the first > place? It basically provides a caching mechanism to efficiently return > information that is otherwise more expensive to obtain: > > - It contains a cached representation of the graph so that we don't > have to parse each commit from the object database. > > - It encodes generation numbers. > > - It contains bloom filters. I don't think these three things are all on the same conceptual footing. In the current object storage format, yes, part of the commit-graph's responsibility is to provide fast access to the repository's commits and their ancestors. That is a pure optimization that is inherently tied to the object storage format, which necessitates such an optimization for certain operations to be efficient. Generation numbers and changed-path Bloom filters are different. Yes, they *can* be used to optimize certain operations, but they also provide *new* information that cannot be otherwise stored in the existing object format. Suppose that we have some future object storage backend that stores commit objects in such a way that querying them natively through whatever mechanism that store provides is already as efficient as using a commit-graph file on-disk today. That eliminates the "pure optimization" aspect of the commit-graph. But that object store implementation will still want to implement additional features like generation numbers and changed-path Bloom filters, because those provide new information that is not purely about optimizing reads. > All of which makes sense with the current design of our object storage > format, because obtaining this information can be quite expensive. But > let's consider a different world where we for example store objects in a > proper database: > > - This database may have an efficient way to compute generation > numbers on the fly, either when reading an object or when writing it > to disk. We cannot currently store that information in the packfile > right now, so it needs to be stored out-of-band. But with a database > there is no reason why we couldn't immediately compute and store the > generation number on each insert. Yes, but that does not mean that other parts of the code are not themselves interested in what generation number a commit has for their own purposes. > - This database may have an efficient way to store bloom filters next > to a specific commit directly, without requiring a separate file. Same thought as above. > So I would claim that the commit graph is specifically tied to the > actual storage format of objects, and it's not at all obvious that it > would need to exist if we had a different storage format. I am not sure I agree here. What I'm trying to say is that though it's easy to imagine an object storage format that stores objects in such a way that the commit-graph file's representation of commits is unnecessary, there is additional information in what we currently call the commit-graph that is useful regardless of how fast the actual object store is. > The goal of this patch series is thus explicitly _not_ to allow loading > one commit graph per object source. In fact, the refactorings I did > ensure that we still only ever load a single commit graph. Right, though if I understand correctly I think there *is* a behavior change here in that we will happily fall back to an alternate ODB's commit-graph if the main object store does not have one. That is not how it works today. Thanks, Taylor