All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org
Subject: Re: [RFC PATCH 1/1] Define an extended tree format
Date: Wed, 1 Oct 2025 18:59:51 -0400	[thread overview]
Message-ID: <20251001225951.GA142037@peff.net> (raw)
In-Reply-To: <20251001211140.GA140550@peff.net>

On Wed, Oct 01, 2025 at 05:11:40PM -0400, Jeff King wrote:

> But there were two unexpected bits. The first is that the above doesn't
> just need a way to store the foreign-hash oid in a tree. It needs to be
> in the index, too.
> 
> We could perhaps use the same hackery there, storing sha1-...-foo in the
> index with the sentinel hash. Or perhaps we could extend the index to
> support holding foreign hashes. I'm not sure which would be less
> annoying. If the index and trees don't match, then tree-to-index
> comparisons get weird. If the index and filesystem don't match, then
> index to filesystem comparisons get weird. My gut feeling is that making
> index-to-filesystem comparisons weird (so putting the hacked name into
> the index) is probably going to be better, just because the filesystem
> is already out of our control. So we can do sorted comparisons between
> the index and trees, but filesystem operations are inherently asking
> about paths one by one. But I'd have to dig deeper to get more confident
> in that opinion.

I know everybody was on the edge of their seats, so here's how far I got
(I will probably be offline for travel for a day or two after this).

This patch generates the hacked-up index entry when you do a "submodule
add" (or even a regular "git add"). I suspect it is missing some corner
cases (in particular, I doubt that it would correctly update with "git
add -u" because the actual filesystem path appears untracked from the
naive perspective of the index).

It is not too ugly, I think:

diff --git a/read-cache.c b/read-cache.c
index 06ad74db22..77072df2f7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -708,6 +708,7 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 			  (intent_only ? ADD_CACHE_NEW_ONLY : 0));
 	unsigned hash_flags = pretend ? 0 : INDEX_WRITE_OBJECT;
 	struct object_id oid;
+	struct strbuf submodule_foreign_path = STRBUF_INIT;
 
 	if (flags & ADD_CACHE_RENORMALIZE)
 		hash_flags |= INDEX_RENORMALIZE;
@@ -719,6 +720,16 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 	if (S_ISDIR(st_mode)) {
 		if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0)
 			return error(_("'%s' does not have a commit checked out"), path);
+		if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) {
+			/* XXX maybe need to remove entry for "path"? */
+			strbuf_addf(&submodule_foreign_path, "%s-%s-%s",
+				    hash_algos[oid.algo].name,
+				    oid_to_hex(&oid),
+				    path);
+			path = submodule_foreign_path.buf;
+			namelen = submodule_foreign_path.len;
+		}
+
 		while (namelen && path[namelen-1] == '/')
 			namelen--;
 	}
@@ -768,7 +779,20 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 		}
 	}
 	if (!intent_only) {
-		if (index_path(istate, &ce->oid, path, st, hash_flags)) {
+		/*
+		 * We actually re-resolve submodules a second time in
+		 * index_path(). That's perhaps something we should
+		 * avoid in general (because it's racy), but it's
+		 * particularly important for our submodule name
+		 * hackery, since "path" is no longer something exists
+		 * in the filesystem at all, and we want to override it with
+		 * our fake sentinel OID anyway.
+		 */
+		if (submodule_foreign_path.len) {
+			/* probably a constant would be less horrific */
+			oidclr(&ce->oid, the_repository->hash_algo);
+			ce->oid.hash[the_repository->hash_algo->rawsz - 1] = 1;
+		} else if (index_path(istate, &ce->oid, path, st, hash_flags)) {
 			discard_cache_entry(ce);
 			return error(_("unable to index file '%s'"), path);
 		}


But then I took a look at the reading side, and it gets pretty ugly.
Often we have just a submodule name, and we end up referencing it in the
index. But of course we don't know if we need to use the hacked-up name
to do the lookup or not, because it depends on the sentinel hash, but we
don't always have a hash yet! So you'd have to speculatively look up
both names in lots of places.

So I think I do lean more towards the index having the full, correct
information, and just doing this hack at the tree level (and munging as
we read/write the trees).

Or of course deciding it's all too horrible and abandoning it. All of
this was meant to be an exploration. I do think it makes a lot of
potential problems and hassles go away if we can intermix submodules of
different hash algorithms. I'm undecided on whether the solution is
worth than the problem, though.

-Peff

  parent reply	other threads:[~2025-10-01 22:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-01  0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson
2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
2025-10-01 16:37   ` Junio C Hamano
2025-10-01 17:41   ` Jeff King
2025-10-01 21:11     ` Jeff King
2025-10-01 21:19       ` Junio C Hamano
2025-10-01 21:45       ` brian m. carlson
2025-10-01 23:00         ` Jeff King
2025-10-01 22:59       ` Jeff King [this message]
2025-10-01 21:21   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251001225951.GA142037@peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.