[RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts
@ 2025-10-01  0:58 brian m. carlson
  2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
  0 siblings, 1 reply; 10+ messages in thread
From: brian m. carlson @ 2025-10-01  0:58 UTC (permalink / raw)
  To: git; +Cc: Jeff King

Today at the Contributor Summit we discussed two different proposals
that might benefit from some extensions to trees: submodules of a
different hash algorithm and first-class conflicts.

Peff suggested that we could do something hideous to the tree format to
allow for extensions to submodules of different algorithms.  In the
interests of prototyping things to see how much people like or hate the
idea, I've written up an example spec and included a script below so we
can see how current Git implementations handle this.  (This was part of
my lunchtime activity.)

Git itself handles this data as a submodule, which seems to be the least
bad option for a new mode type.  `git fsck` complains about the mode, as
expected.

I don't intend to actually implement this proposal or do anything
serious with it, but I offer it as a discussion piece about what we
could do if we wanted in a minimally incompatible way.  If you really
hate it, that's okay; I agree it's a little gross.

----
#!/bin/sh -e

# This script requires a printf that supports hex escapes because I'm too lazy
# to convert things into octal.

rm -fr test-repo
git init --object-format=sha1 -b dev test-repo
cd test-repo
tree_oid=$(/usr/bin/printf '130000 \x91\x82\x80s256\x01\xdf\xac\xbf\xee\xdf\xac\xbf\xee\xdf\xac\xbf\xeesubmodule\x00\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef' | git hash-object -t tree -w --literally --stdin)
commit_oid=$(git commit-tree -m + "$tree_oid")
git reset --hard "$commit_oid"
git show HEAD
----

brian m. carlson (1):
  Define an extended tree format

 Documentation/Makefile                     |  1 +
 Documentation/gitformat-extended-tree.adoc | 77 ++++++++++++++++++++++
 Documentation/meson.build                  |  1 +
 3 files changed, 79 insertions(+)
 create mode 100644 Documentation/gitformat-extended-tree.adoc

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/1] Define an extended tree format
  2025-10-01  0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson
@ 2025-10-01  0:58 ` brian m. carlson
  2025-10-01 16:37   ` Junio C Hamano
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: brian m. carlson @ 2025-10-01  0:58 UTC (permalink / raw)
  To: git; +Cc: Jeff King

There are some cases in which we want to encode additional information
in a tree but there is currently no possible way to do so.  Define a
format for extended trees that uses mode 130000 plus some additional
nonzero bytes in the file name to encode additional data in a mostly
backwards compatible way.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/Makefile                     |  1 +
 Documentation/gitformat-extended-tree.adoc | 77 ++++++++++++++++++++++
 Documentation/meson.build                  |  1 +
 3 files changed, 79 insertions(+)
 create mode 100644 Documentation/gitformat-extended-tree.adoc

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 6fb83d0c6e..6d8ad220ed 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -33,6 +33,7 @@ MAN5_TXT += gitattributes.adoc
 MAN5_TXT += gitformat-bundle.adoc
 MAN5_TXT += gitformat-chunk.adoc
 MAN5_TXT += gitformat-commit-graph.adoc
+MAN5_TXT += gitformat-extended-tree.adoc
 MAN5_TXT += gitformat-index.adoc
 MAN5_TXT += gitformat-pack.adoc
 MAN5_TXT += gitformat-signature.adoc
diff --git a/Documentation/gitformat-extended-tree.adoc b/Documentation/gitformat-extended-tree.adoc
new file mode 100644
index 0000000000..fc28bb16c9
--- /dev/null
+++ b/Documentation/gitformat-extended-tree.adoc
@@ -0,0 +1,77 @@
+Extended Tree Format
+====================
+
+Rationale
+---------
+
+Git needs to store some additional types of data in repositories that we had
+previously not considered.  Unfortunately, we lack a good way to extend tree
+formats in a backwards compatible way.
+
+This document proposes an approach that adds an extended tree format that is
+backwards compatible with earlier versions of Git except that it will appear
+that the tree is improperly sorted.  This is done by encoding additional bytes
+that are guaranteed to be nonzero and that earlier versions of Git will
+consider part of the file name.
+
+Format
+------
+:ber: footnote:[This is the format used by the `w` pack format in Perl and Ruby.]
+
+An extended tree format consists of an entry that has mode 130000.
+
+A modified BER-encoded integer is a BER-encoded integer{ber} with the top bits
+of each byte flipped.  That is, values 0x00 through 0x7f are encoded as 0x80
+through 0xff, and larger values are encoded such that the bottom 7 bits of each
+byte include the value and the top bit is 0 if there is a subsequent byte of
+data and 1 if this is the last byte.  The shortest possible encoding must be
+used.  Note that the byte 0x00 is not valid in the encoded value.
+
+The first part of what is traditionally the file name consists of a modified
+BER-encoded integer representing the number of bytes in the extended
+information section not including this length itself; that is, the number of
+bytes which must be skipped to reach the file name.  This allows parsing
+unknown values in a graceful way.
+
+Following that is a modified BER-encoded integer representing the type of
+object and a modified BER-encoded integer consisting of flags.
+
+The object type is one of the following:
+
+0:: invalid
+1:: conflict tree (only valid at top level)
+2:: extended submodule
+
+The following flags are defined:
+
+1 (bit 0):: This entry is critical and failure to handle this object should be fatal.
+2 (bit 1):: Flags in this entry are critical and failure to understand them should be fatal.
+
+A conflict tree then has a modified BER-encoded integer representing the stage.
+
+Conflict trees are used to store a conflicted state.  The top-level tree
+contains only entries called stageN, where N is a decimal representation of the
+stage integer.  Resolved entries are always stored as stage 0.  The contents of
+each tree are those of the root tree at that stage.
+
+An extended submodule then has a 32-bit format ID representing the algorithm in
+question, followed by a nonzero byte called an offset.  If the object ID for
+the selected algorithm is longer than the algorithm for the tree, the remaining
+bytes of the object ID are encoded such that each one is XORed with the offset.
+The offset must be the smallest nonzero byte value such that all of the encoded
+bytes are also nonzero; an entry is malformed if this value is incorrect.
+
+If the object ID for the selected algorithm is shorter than the algorithm for
+the tree, the main object ID field is padded with zeros.
+
+Examples
+--------
+
+An extended submodule of a SHA-256 submodule in a SHA-1 repository might look
+like this if the submodule object name is
+deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef and the name
+is `submodule`:
+
+----
+130000 \x91\x82\x80s256\x01\xdf\xac\xbf\xee\xdf\xac\xbf\xee\xdf\xac\xbf\xeesubmodule\x00\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef
+----
diff --git a/Documentation/meson.build b/Documentation/meson.build
index e34965c5b0..e2257050e8 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -172,6 +172,7 @@ manpages = {
   'gitformat-bundle.adoc' : 5,
   'gitformat-chunk.adoc' : 5,
   'gitformat-commit-graph.adoc' : 5,
+  'gitformat-extended-tree.adoc' : 5,
   'gitformat-index.adoc' : 5,
   'gitformat-pack.adoc' : 5,
   'gitformat-signature.adoc' : 5,

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
@ 2025-10-01 16:37   ` Junio C Hamano
  2025-10-01 17:41   ` Jeff King
  2025-10-01 21:21   ` Elijah Newren
  2 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2025-10-01 16:37 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +The first part of what is traditionally the file name consists of a modified
> +BER-encoded integer representing the number of bytes in the extended
> +information section not including this length itself; that is, the number of
> +bytes which must be skipped to reach the file name.  This allows parsing
> +unknown values in a graceful way.

In short, you prepend extra bytes that contain no NUL in front of
the pathname part of the tree entry, but that extra bytes begins
with the length to allow skipping.  

The tree entries must be in sorted order, but I presume that this
extra information merely happens to be encoded inside pathname
field, but does not contribute to the ordering of the entries?

For debuggability, I am not sure if modified BER is the best format
(it is essentially binary gibberish).

For extensibility, I wonder if we should allow more than (type,
flags) to be given in the future?  I guess a new <type> will allow
extra things after <flags> (like conflict tree takes stage number),
so the format is already extensible (we do not want to see people
willy-nilly add extended stuff to the tree object anyway, since it
would allow the logically same object represented by multiple byte
streams that are not bit-for-bit identical, so extensibility is a
double-edged sword, but still necessary).

When we have more than one stage #1 entries (can happen in a
criss-cross merge) or multiple stage #3 entries (octopus), what sort
order do we apply to them for conflict tree entries?

As you said, it does look ugly ;-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
  2025-10-01 16:37   ` Junio C Hamano
@ 2025-10-01 17:41   ` Jeff King
  2025-10-01 21:11     ` Jeff King
  2025-10-01 21:21   ` Elijah Newren
  2 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2025-10-01 17:41 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

On Wed, Oct 01, 2025 at 12:58:14AM +0000, brian m. carlson wrote:

> There are some cases in which we want to encode additional information
> in a tree but there is currently no possible way to do so.  Define a
> format for extended trees that uses mode 130000 plus some additional
> nonzero bytes in the file name to encode additional data in a mostly
> backwards compatible way.

Thanks for writing this up. This was along the lines I was thinking, but
I have a few variant suggestions to consider.

Your proposal here covers a very general extension format which can be
used for solving several possible problems. But if we constrain
ourselves just to the gitlink question, it does open up some more
possibilities. Here you're using a custom mode to mark the entry, and
the downside is that the mode will be somewhat confusing to old versions
of Git. They won't recognize a gitlink as such, and their fsck will
complain about it. It also leaves the question of what goes into the
hash/object-id field of the entry. We can't represent the true hash
(that is the whole problem we are trying to solve). You could put in the
null sha1, but that is another fsck landmine. So probably we have some
dummy hash. But old versions, not knowing it's a submodule, will expect
that hash to be reachable and will complain that it's missing!

What if instead, we leave the mode as S_IFGITLINK, but use a dummy hash
to recognize the extended submodule? Let's say 0000...1. We risk
colliding with a real hash, but the chances are quite unlikely, and it's
a risk we already take with the null oid. With this scheme, the old
versions will still realize it's a submodule, and that we do not need to
have access to that hash. If an old version checks it out without
submodule recursion, we'll never even try to access that hash. And if it
does, it will try to clone the appropriate gitmodule and say "oops, we
don't have that hash". Which is probably the best outcome we can hope
for.

The obvious downside is that this does nothing to help the
stored-conflict case. _If_ we are going to come up with a solution for
that, I agree it might make sense to piggy-back on it for submodules.
But I do think what I outlined above degrades a bit more gracefully. And
the compatibility scenarios for these two use cases may be different.
Alternate-hash submodule entries will be buried in old history and seen
many people. Conflict markers are probably shorter-lived and less likely
to be seen by people who aren't using modern versions of Git to work
with them.

The other way in which your proposal differs from what I was thinking is
in the use of BER in the filename. It is nice that it can unambiguously
encode a chunk of data, but the fact that it contains binary bytes makes
me worried about a few things:

  1. When users see it, does it look like junk? It is nice, IMHO, if the
     name does leak out to the user for it to be a bit more
     self-explanatory.

  2. When a filesystem sees it, can it handle it? I didn't think hard on
     it, but I'd guess BER generally isn't valid UTF-8. What will HFS+
     do with such a filename?

  3. To what degree can BER encoded bytes conflict with meaningful
     path names? In particular I wonder if "/" can appear, which will
     cause confusion (both in git-fsck, but also when we try to work in
     a directory that does not exist). Or worse, if you can sneak
     ".git/" into a name in a way that is interpreted by _some_ versions
     of Git but not others.

So I was thinking instead of something more ASCII-ish. Something like
"<link_type>-<link_data>-<pathname>". So in a sha256 tree, an entry for
a sha1 submodule might look like (as shown by ls-tree):

  160000 00000000000000000000000000000001 sha1-12345abcde12345abcde12345abcde12345abcde-the-real-path

The sentinel sha256 oid "000...1" would be binary in the tree, of
course, but the sha1 in the filename would be hex. It's not as
efficient, but I don't think that submodules are common enough for this
to be a real issue.

I was vague about "link_data" above. But I think if it is interpreted in
a manner specific to link_type, then this scheme could also eventually
enable non-git submodules. E.g., for the true masochist, you could
imagine svn-<url>-foo (with <url> percent-encoded to avoid dashes and
slashes). I'm not sure if anyone would ever find that useful, and it's
not something I'd plan to look into myself, but maybe it's a bonus?

The other issue of concern is sorting. If sha1-...-foo sorts as "foo",
then older versions will complain the tree is wrongly sorted. And
probably will produce slightly wrong diffs. I think modern Git would
Just Work in most spots, though, provided the handling is done in
decode_tree_entry() or similar.

Sorting the other way, as "sha1-", is mostly going to be a nightmare for
modern Git. If we want it to truly show "foo" in a diff, for example, it
would have to scan forward for any instance of "sha1-...-aaa" before
showing "aab". Yuck.

But what if we accepted that these entries were a bit more user-visible?
That is, the diff itself would not do anything special at all, and you
really would get a delete/add for:

  -sha1-12345abcde12345abcde12345abcde12345abcde-foo
  +sha1-67890fedcb67890fedcb67890fedcb67890fedcb-foo

instead of the more correct path-changing diff. And then either we
accept that ugliness, or we fix up the diff for display via diffcore or
similar (dropping those ugly entries and replacing them with a single
diff_filepair of "foo").

And likewise the submodule code would leave these ugly entries until the
moment it is ready to act on them, when it would realize that we want to
touch "foo" in the checkout (and both here and in a diff, obviously it
is an error if there is a "foo" already, but that is true already for
any tree with duplicate entries).

My gut feeling is that would keep the gross-ness confined to a few bits
of code, and you could still operate on these natively with tools like
mktree, ls-tree, and so on. The ugly names will leak out to the user
sometimes, but most porcelain flows would do the right thing.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01 17:41   ` Jeff King
@ 2025-10-01 21:11     ` Jeff King
  2025-10-01 21:19       ` Junio C Hamano
                         ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jeff King @ 2025-10-01 21:11 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

On Wed, Oct 01, 2025 at 01:41:10PM -0400, Jeff King wrote:

> My gut feeling is that would keep the gross-ness confined to a few bits
> of code, and you could still operate on these natively with tools like
> mktree, ls-tree, and so on. The ugly names will leak out to the user
> sometimes, but most porcelain flows would do the right thing.

I was hoping make a proof-of-concept patch to get more familiar with
exactly what it would look like for various commands. So the scenario I
came up with was this:

-- >8 --
#!/bin/sh

# you can have a sha256 module in a sha1 repo or vice versa
sup_algo=$1; shift
sub_algo=$1; shift

rm -rf sub super clone

# a boring sub-project with one commit
git init --object-format=$sub_algo sub &&
(
	cd sub &&
	echo content >file &&
	git add file &&
	git commit -m msg
) &&

# the superproject includes it
git init --object-format=$sup_algo super &&
(
	cd super &&
	git -c protocol.file.allow=always submodule add "$PWD/../sub" foo &&
	git commit -m 'add sub'
) &&

# and then in theory we can clone it and check out the submodule
git -c protocol.file.allow=always clone --recurse-submodules super clone
-- >8 --

But there were two unexpected bits. The first is that the above doesn't
just need a way to store the foreign-hash oid in a tree. It needs to be
in the index, too.

We could perhaps use the same hackery there, storing sha1-...-foo in the
index with the sentinel hash. Or perhaps we could extend the index to
support holding foreign hashes. I'm not sure which would be less
annoying. If the index and trees don't match, then tree-to-index
comparisons get weird. If the index and filesystem don't match, then
index to filesystem comparisons get weird. My gut feeling is that making
index-to-filesystem comparisons weird (so putting the hacked name into
the index) is probably going to be better, just because the filesystem
is already out of our control. So we can do sorted comparisons between
the index and trees, but filesystem operations are inherently asking
about paths one by one. But I'd have to dig deeper to get more confident
in that opinion.

The other unexpected thing is that the sequence above works today! I
think it is mostly accidental, though. When we read the head of the
submodule, we get a "struct object_id" with a different algo field. But
we never pay attention to that field, and just treat it as if it is
using our native algorithm.

So if I use a sha256 submodule in a sha1 repo, the final part of the
output (in the recursive clone) looks like this:

  Submodule path 'foo': checked out 'db8b8f8006f4564e7862b246c4d57100790e2196'

We've truncated the sha256 to 20 bytes (40 hex) and used that instead of
the full oid. But Git is flexible enough that it is happy to find the
object by its abbreviated hash, and checks it out without complaining.

If we go the other way, with a sha1 submodule in a sha256 repo, we get
this final line:

  Submodule path 'foo': checked out '03f08992d562b03e2af5b4256e17b82be8eafa98000000000000000000000000'

I expected to get random garbage at the end, since we only copied in the
first 20 bytes, but since c98d762ed9 (global: ensure that object IDs are
always padded, 2024-06-14) we zero-pad all oids. The intent there was
making internal comparisons fast, but it happens to help us out here. I
am a little puzzled that Git is willing to check out the zero-padded
name, but I wonder if we simply parse it into an object_id struct and
then only look at the first algo->raw_size bytes.

So it all works as expected, but I feel like it's mostly by accident. My
gut feeling is that we probably wanted something like this to protect us
from confusion:

index 06ad74db22..295b0c6318 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -719,6 +719,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 	if (S_ISDIR(st_mode)) {
 		if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0)
 			return error(_("'%s' does not have a commit checked out"), path);
+		if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo))
+			return error(_("object format %s of '%s' is incompatible with this repository (%s)"), hash_algos[oid.algo].name, path, the_repository->hash_algo->name);
 		while (namelen && path[namelen-1] == '/')
 			namelen--;
 	}

Of course that is strictly worse for somebody who is relying on the
current accidental behavior. ;) And in the long run, I think this is the
spot we'd want to hook to do whatever massaging we need (whether
converting to the equivalent in-repo algorithm, or hacking up the name
to store the foreign hash).

I also won't be at all surprised if you've run across this already in
your interop work.

-Peff

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01 21:11     ` Jeff King
@ 2025-10-01 21:19       ` Junio C Hamano
  2025-10-01 21:45       ` brian m. carlson
  2025-10-01 22:59       ` Jeff King
  2 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2025-10-01 21:19 UTC (permalink / raw)
  To: Jeff King; +Cc: brian m. carlson, git

Jeff King <peff@peff.net> writes:

> The other unexpected thing is that the sequence above works today! I
> think it is mostly accidental, though. When we read the head of the
> submodule, we get a "struct object_id" with a different algo field. But
> we never pay attention to that field, and just treat it as if it is
> using our native algorithm.

Yeah, I recall that there was a discussion about this exact "yes, we
get by fine with SHA-1 commit object name 0-padded to the right,
when embedding SHA-1 submodule inside SHA-256 superproject",
countered by brian's "no, it may happen to work but you shouldn't
rely on it" recently.

> So it all works as expected, but I feel like it's mostly by accident. My
> gut feeling is that we probably wanted something like this to protect us
> from confusion:
>
> index 06ad74db22..295b0c6318 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -719,6 +719,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
>  	if (S_ISDIR(st_mode)) {
>  		if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0)
>  			return error(_("'%s' does not have a commit checked out"), path);
> +		if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo))
> +			return error(_("object format %s of '%s' is incompatible with this repository (%s)"), hash_algos[oid.algo].name, path, the_repository->hash_algo->name);
>  		while (namelen && path[namelen-1] == '/')
>  			namelen--;
>  	}
>
> Of course that is strictly worse for somebody who is relying on the
> current accidental behavior. ;) And in the long run, I think this is the
> spot we'd want to hook to do whatever massaging we need (whether
> converting to the equivalent in-repo algorithm, or hacking up the name
> to store the foreign hash).
>
> I also won't be at all surprised if you've run across this already in
> your interop work.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01 21:11     ` Jeff King
  2025-10-01 21:19       ` Junio C Hamano
@ 2025-10-01 21:45       ` brian m. carlson
  2025-10-01 23:00         ` Jeff King
  2025-10-01 22:59       ` Jeff King
  2 siblings, 1 reply; 10+ messages in thread
From: brian m. carlson @ 2025-10-01 21:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1907 bytes --]

On 2025-10-01 at 21:11:40, Jeff King wrote:
> So it all works as expected, but I feel like it's mostly by accident. My
> gut feeling is that we probably wanted something like this to protect us
> from confusion:
> 
> index 06ad74db22..295b0c6318 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -719,6 +719,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
>  	if (S_ISDIR(st_mode)) {
>  		if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0)
>  			return error(_("'%s' does not have a commit checked out"), path);
> +		if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo))
> +			return error(_("object format %s of '%s' is incompatible with this repository (%s)"), hash_algos[oid.algo].name, path, the_repository->hash_algo->name);
>  		while (namelen && path[namelen-1] == '/')
>  			namelen--;
>  	}
> 
> Of course that is strictly worse for somebody who is relying on the
> current accidental behavior. ;) And in the long run, I think this is the
> spot we'd want to hook to do whatever massaging we need (whether
> converting to the equivalent in-repo algorithm, or hacking up the name
> to store the foreign hash).

I think sending in this patch is a good first step right now.  The
intention of the transition document is that the code is in one and only
one hash algorithm at a time and it would be better to reject this case
until we're ready to wire it up correctly than end up with data that's
corrupt down the line.

> I also won't be at all surprised if you've run across this already in
> your interop work.

I have not, since it's primarily involved working with the testsuite and
some test repositories.  I know that submodules are broken in interop
mode; I didn't need to test that until I was writing code to make them
not-broken.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01 21:45       ` brian m. carlson
@ 2025-10-01 23:00         ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2025-10-01 23:00 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

On Wed, Oct 01, 2025 at 09:45:29PM +0000, brian m. carlson wrote:

> > Of course that is strictly worse for somebody who is relying on the
> > current accidental behavior. ;) And in the long run, I think this is the
> > spot we'd want to hook to do whatever massaging we need (whether
> > converting to the equivalent in-repo algorithm, or hacking up the name
> > to store the foreign hash).
> 
> I think sending in this patch is a good first step right now.  The
> intention of the transition document is that the code is in one and only
> one hash algorithm at a time and it would be better to reject this case
> until we're ready to wire it up correctly than end up with data that's
> corrupt down the line.

OK. I'll try to clean it up and add some tests, but it may not be until
next week.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01 21:11     ` Jeff King
  2025-10-01 21:19       ` Junio C Hamano
  2025-10-01 21:45       ` brian m. carlson
@ 2025-10-01 22:59       ` Jeff King
  2 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2025-10-01 22:59 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

On Wed, Oct 01, 2025 at 05:11:40PM -0400, Jeff King wrote:

> But there were two unexpected bits. The first is that the above doesn't
> just need a way to store the foreign-hash oid in a tree. It needs to be
> in the index, too.
> 
> We could perhaps use the same hackery there, storing sha1-...-foo in the
> index with the sentinel hash. Or perhaps we could extend the index to
> support holding foreign hashes. I'm not sure which would be less
> annoying. If the index and trees don't match, then tree-to-index
> comparisons get weird. If the index and filesystem don't match, then
> index to filesystem comparisons get weird. My gut feeling is that making
> index-to-filesystem comparisons weird (so putting the hacked name into
> the index) is probably going to be better, just because the filesystem
> is already out of our control. So we can do sorted comparisons between
> the index and trees, but filesystem operations are inherently asking
> about paths one by one. But I'd have to dig deeper to get more confident
> in that opinion.

I know everybody was on the edge of their seats, so here's how far I got
(I will probably be offline for travel for a day or two after this).

This patch generates the hacked-up index entry when you do a "submodule
add" (or even a regular "git add"). I suspect it is missing some corner
cases (in particular, I doubt that it would correctly update with "git
add -u" because the actual filesystem path appears untracked from the
naive perspective of the index).

It is not too ugly, I think:

diff --git a/read-cache.c b/read-cache.c
index 06ad74db22..77072df2f7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -708,6 +708,7 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 			  (intent_only ? ADD_CACHE_NEW_ONLY : 0));
 	unsigned hash_flags = pretend ? 0 : INDEX_WRITE_OBJECT;
 	struct object_id oid;
+	struct strbuf submodule_foreign_path = STRBUF_INIT;
 
 	if (flags & ADD_CACHE_RENORMALIZE)
 		hash_flags |= INDEX_RENORMALIZE;
@@ -719,6 +720,16 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 	if (S_ISDIR(st_mode)) {
 		if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0)
 			return error(_("'%s' does not have a commit checked out"), path);
+		if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) {
+			/* XXX maybe need to remove entry for "path"? */
+			strbuf_addf(&submodule_foreign_path, "%s-%s-%s",
+				    hash_algos[oid.algo].name,
+				    oid_to_hex(&oid),
+				    path);
+			path = submodule_foreign_path.buf;
+			namelen = submodule_foreign_path.len;
+		}
+
 		while (namelen && path[namelen-1] == '/')
 			namelen--;
 	}
@@ -768,7 +779,20 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 		}
 	}
 	if (!intent_only) {
-		if (index_path(istate, &ce->oid, path, st, hash_flags)) {
+		/*
+		 * We actually re-resolve submodules a second time in
+		 * index_path(). That's perhaps something we should
+		 * avoid in general (because it's racy), but it's
+		 * particularly important for our submodule name
+		 * hackery, since "path" is no longer something exists
+		 * in the filesystem at all, and we want to override it with
+		 * our fake sentinel OID anyway.
+		 */
+		if (submodule_foreign_path.len) {
+			/* probably a constant would be less horrific */
+			oidclr(&ce->oid, the_repository->hash_algo);
+			ce->oid.hash[the_repository->hash_algo->rawsz - 1] = 1;
+		} else if (index_path(istate, &ce->oid, path, st, hash_flags)) {
 			discard_cache_entry(ce);
 			return error(_("unable to index file '%s'"), path);
 		}


But then I took a look at the reading side, and it gets pretty ugly.
Often we have just a submodule name, and we end up referencing it in the
index. But of course we don't know if we need to use the hacked-up name
to do the lookup or not, because it depends on the sentinel hash, but we
don't always have a hash yet! So you'd have to speculatively look up
both names in lots of places.

So I think I do lean more towards the index having the full, correct
information, and just doing this hack at the tree level (and munging as
we read/write the trees).

Or of course deciding it's all too horrible and abandoning it. All of
this was meant to be an exploration. I do think it makes a lot of
potential problems and hassles go away if we can intermix submodules of
different hash algorithms. I'm undecided on whether the solution is
worth than the problem, though.

-Peff

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/1] Define an extended tree format
  2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
  2025-10-01 16:37   ` Junio C Hamano
  2025-10-01 17:41   ` Jeff King
@ 2025-10-01 21:21   ` Elijah Newren
  2 siblings, 0 replies; 10+ messages in thread
From: Elijah Newren @ 2025-10-01 21:21 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Jeff King

On Tue, Sep 30, 2025 at 5:58 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> +Following that is a modified BER-encoded integer representing the type of
> +object and a modified BER-encoded integer consisting of flags.
> +
> +The object type is one of the following:
> +
> +0:: invalid
> +1:: conflict tree (only valid at top level)

I think there may have been some miscommunication.  There was
discussion around "at the toplevel" and "trees," but I believe "at the
toplevel" was meant to refer to the commit itself -- specifically, for
each commit, at the highest level of the objects it references (i.e.,
the commit object).  The idea is that conflict information would be
represented by multiple traditional toplevel trees (or possibly even
commits?) recorded in the commit object, with no conflicts recorded
within any individual tree.  Instead, the existence of multiple
distinct toplevel trees is what would signal a conflict.

Recording conflict information in trees rather than commits, as this
proposal suggests (even if only at the toplevel tree), introduces a
problem similar to what Martin mentioned yesterday regarding blobs.
In particular, doing so would require every `git log` operation --
which wants to determine if a commit is conflicted so it can highlight
it for users -- to walk not only the commits but also additional
objects.  Even if limited to just one extra level for toplevel trees,
it effectively doubles the objects traversed for non-diffing log
operations, impacting performance for everyone, not just those using
conflict headers.  I’d prefer to avoid imposing a performance penalty
on users who don’t use any new features associated with first-class
conflicts.

Additionally, I believe this approach would diverge from how jj and
GitButler record conflicted commits.  I’d rather follow the path they
paved to benefit from their expertise and minimize (if not eliminate)
any possible rework they need for continued interoperability with Git
versions that gain features they first implemented.

We may find reason to revisit this as things progress, but for now, I
don’t conflicted trees needs to be included in your modified tree
proposal.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-10-01 23:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-01  0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson
2025-10-01  0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
2025-10-01 16:37   ` Junio C Hamano
2025-10-01 17:41   ` Jeff King
2025-10-01 21:11     ` Jeff King
2025-10-01 21:19       ` Junio C Hamano
2025-10-01 21:45       ` brian m. carlson
2025-10-01 23:00         ` Jeff King
2025-10-01 22:59       ` Jeff King
2025-10-01 21:21   ` Elijah Newren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).