git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org
Subject: Re: [RFC PATCH 1/1] Define an extended tree format
Date: Wed, 1 Oct 2025 18:59:51 -0400	[thread overview]
Message-ID: <20251001225951.GA142037@peff.net> (raw)
In-Reply-To: <20251001211140.GA140550@peff.net>

On Wed, Oct 01, 2025 at 05:11:40PM -0400, Jeff King wrote:

> But there were two unexpected bits. The first is that the above doesn't
> just need a way to store the foreign-hash oid in a tree. It needs to be
> in the index, too.
> 
> We could perhaps use the same hackery there, storing sha1-...-foo in the
> index with the sentinel hash. Or perhaps we could extend the index to
> support holding foreign hashes. I'm not sure which would be less
> annoying. If the index and trees don't match, then tree-to-index
> comparisons get weird. If the index and filesystem don't match, then
> index to filesystem comparisons get weird. My gut feeling is that making
> index-to-filesystem comparisons weird (so putting the hacked name into
> the index) is probably going to be better, just because the filesystem
> is already out of our control. So we can do sorted comparisons between
> the index and trees, but filesystem operations are inherently asking
> about paths one by one. But I'd have to dig deeper to get more confident
> in that opinion.

I know everybody was on the edge of their seats, so here's how far I got
(I will probably be offline for travel for a day or two after this).

This patch generates the hacked-up index entry when you do a "submodule
add" (or even a regular "git add"). I suspect it is missing some corner
cases (in particular, I doubt that it would correctly update with "git
add -u" because the actual filesystem path appears untracked from the
naive perspective of the index).

It is not too ugly, I think:

diff --git a/read-cache.c b/read-cache.c
index 06ad74db22..77072df2f7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -708,6 +708,7 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 			  (intent_only ? ADD_CACHE_NEW_ONLY : 0));
 	unsigned hash_flags = pretend ? 0 : INDEX_WRITE_OBJECT;
 	struct object_id oid;
+	struct strbuf submodule_foreign_path = STRBUF_INIT;
 
 	if (flags & ADD_CACHE_RENORMALIZE)
 		hash_flags |= INDEX_RENORMALIZE;
@@ -719,6 +720,16 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 	if (S_ISDIR(st_mode)) {
 		if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0)
 			return error(_("'%s' does not have a commit checked out"), path);
+		if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) {
+			/* XXX maybe need to remove entry for "path"? */
+			strbuf_addf(&submodule_foreign_path, "%s-%s-%s",
+				    hash_algos[oid.algo].name,
+				    oid_to_hex(&oid),
+				    path);
+			path = submodule_foreign_path.buf;
+			namelen = submodule_foreign_path.len;
+		}
+
 		while (namelen && path[namelen-1] == '/')
 			namelen--;
 	}
@@ -768,7 +779,20 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 		}
 	}
 	if (!intent_only) {
-		if (index_path(istate, &ce->oid, path, st, hash_flags)) {
+		/*
+		 * We actually re-resolve submodules a second time in
+		 * index_path(). That's perhaps something we should
+		 * avoid in general (because it's racy), but it's
+		 * particularly important for our submodule name
+		 * hackery, since "path" is no longer something exists
+		 * in the filesystem at all, and we want to override it with
+		 * our fake sentinel OID anyway.
+		 */
+		if (submodule_foreign_path.len) {
+			/* probably a constant would be less horrific */
+			oidclr(&ce->oid, the_repository->hash_algo);
+			ce->oid.hash[the_repository->hash_algo->rawsz - 1] = 1;
+		} else if (index_path(istate, &ce->oid, path, st, hash_flags)) {
 			discard_cache_entry(ce);
 			return error(_("unable to index file '%s'"), path);
 		}


But then I took a look at the reading side, and it gets pretty ugly.
Often we have just a submodule name, and we end up referencing it in the
index. But of course we don't know if we need to use the hacked-up name
to do the lookup or not, because it depends on the sentinel hash, but we
don't always have a hash yet! So you'd have to speculatively look up
both names in lots of places.

So I think I do lean more towards the index having the full, correct
information, and just doing this hack at the tree level (and munging as
we read/write the trees).

Or of course deciding it's all too horrible and abandoning it. All of
this was meant to be an exploration. I do think it makes a lot of
potential problems and hassles go away if we can intermix submodules of
different hash algorithms. I'm undecided on whether the solution is
worth than the problem, though.

-Peff

  parent reply	other threads:[~2025-10-01 22:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-01  0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson
2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
2025-10-01 16:37   ` Junio C Hamano
2025-10-01 17:41   ` Jeff King
2025-10-01 21:11     ` Jeff King
2025-10-01 21:19       ` Junio C Hamano
2025-10-01 21:45       ` brian m. carlson
2025-10-01 23:00         ` Jeff King
2025-10-01 22:59       ` Jeff King [this message]
2025-10-01 21:21   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251001225951.GA142037@peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).