* [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts @ 2025-10-01 0:58 brian m. carlson 2025-10-01 0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson 0 siblings, 1 reply; 10+ messages in thread From: brian m. carlson @ 2025-10-01 0:58 UTC (permalink / raw) To: git; +Cc: Jeff King Today at the Contributor Summit we discussed two different proposals that might benefit from some extensions to trees: submodules of a different hash algorithm and first-class conflicts. Peff suggested that we could do something hideous to the tree format to allow for extensions to submodules of different algorithms. In the interests of prototyping things to see how much people like or hate the idea, I've written up an example spec and included a script below so we can see how current Git implementations handle this. (This was part of my lunchtime activity.) Git itself handles this data as a submodule, which seems to be the least bad option for a new mode type. `git fsck` complains about the mode, as expected. I don't intend to actually implement this proposal or do anything serious with it, but I offer it as a discussion piece about what we could do if we wanted in a minimally incompatible way. If you really hate it, that's okay; I agree it's a little gross. ---- #!/bin/sh -e # This script requires a printf that supports hex escapes because I'm too lazy # to convert things into octal. rm -fr test-repo git init --object-format=sha1 -b dev test-repo cd test-repo tree_oid=$(/usr/bin/printf '130000 \x91\x82\x80s256\x01\xdf\xac\xbf\xee\xdf\xac\xbf\xee\xdf\xac\xbf\xeesubmodule\x00\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef' | git hash-object -t tree -w --literally --stdin) commit_oid=$(git commit-tree -m + "$tree_oid") git reset --hard "$commit_oid" git show HEAD ---- brian m. carlson (1): Define an extended tree format Documentation/Makefile | 1 + Documentation/gitformat-extended-tree.adoc | 77 ++++++++++++++++++++++ Documentation/meson.build | 1 + 3 files changed, 79 insertions(+) create mode 100644 Documentation/gitformat-extended-tree.adoc ^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 1/1] Define an extended tree format 2025-10-01 0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson @ 2025-10-01 0:58 ` brian m. carlson 2025-10-01 16:37 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: brian m. carlson @ 2025-10-01 0:58 UTC (permalink / raw) To: git; +Cc: Jeff King There are some cases in which we want to encode additional information in a tree but there is currently no possible way to do so. Define a format for extended trees that uses mode 130000 plus some additional nonzero bytes in the file name to encode additional data in a mostly backwards compatible way. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> --- Documentation/Makefile | 1 + Documentation/gitformat-extended-tree.adoc | 77 ++++++++++++++++++++++ Documentation/meson.build | 1 + 3 files changed, 79 insertions(+) create mode 100644 Documentation/gitformat-extended-tree.adoc diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fb83d0c6e..6d8ad220ed 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -33,6 +33,7 @@ MAN5_TXT += gitattributes.adoc MAN5_TXT += gitformat-bundle.adoc MAN5_TXT += gitformat-chunk.adoc MAN5_TXT += gitformat-commit-graph.adoc +MAN5_TXT += gitformat-extended-tree.adoc MAN5_TXT += gitformat-index.adoc MAN5_TXT += gitformat-pack.adoc MAN5_TXT += gitformat-signature.adoc diff --git a/Documentation/gitformat-extended-tree.adoc b/Documentation/gitformat-extended-tree.adoc new file mode 100644 index 0000000000..fc28bb16c9 --- /dev/null +++ b/Documentation/gitformat-extended-tree.adoc @@ -0,0 +1,77 @@ +Extended Tree Format +==================== + +Rationale +--------- + +Git needs to store some additional types of data in repositories that we had +previously not considered. Unfortunately, we lack a good way to extend tree +formats in a backwards compatible way. + +This document proposes an approach that adds an extended tree format that is +backwards compatible with earlier versions of Git except that it will appear +that the tree is improperly sorted. This is done by encoding additional bytes +that are guaranteed to be nonzero and that earlier versions of Git will +consider part of the file name. + +Format +------ +:ber: footnote:[This is the format used by the `w` pack format in Perl and Ruby.] + +An extended tree format consists of an entry that has mode 130000. + +A modified BER-encoded integer is a BER-encoded integer{ber} with the top bits +of each byte flipped. That is, values 0x00 through 0x7f are encoded as 0x80 +through 0xff, and larger values are encoded such that the bottom 7 bits of each +byte include the value and the top bit is 0 if there is a subsequent byte of +data and 1 if this is the last byte. The shortest possible encoding must be +used. Note that the byte 0x00 is not valid in the encoded value. + +The first part of what is traditionally the file name consists of a modified +BER-encoded integer representing the number of bytes in the extended +information section not including this length itself; that is, the number of +bytes which must be skipped to reach the file name. This allows parsing +unknown values in a graceful way. + +Following that is a modified BER-encoded integer representing the type of +object and a modified BER-encoded integer consisting of flags. + +The object type is one of the following: + +0:: invalid +1:: conflict tree (only valid at top level) +2:: extended submodule + +The following flags are defined: + +1 (bit 0):: This entry is critical and failure to handle this object should be fatal. +2 (bit 1):: Flags in this entry are critical and failure to understand them should be fatal. + +A conflict tree then has a modified BER-encoded integer representing the stage. + +Conflict trees are used to store a conflicted state. The top-level tree +contains only entries called stageN, where N is a decimal representation of the +stage integer. Resolved entries are always stored as stage 0. The contents of +each tree are those of the root tree at that stage. + +An extended submodule then has a 32-bit format ID representing the algorithm in +question, followed by a nonzero byte called an offset. If the object ID for +the selected algorithm is longer than the algorithm for the tree, the remaining +bytes of the object ID are encoded such that each one is XORed with the offset. +The offset must be the smallest nonzero byte value such that all of the encoded +bytes are also nonzero; an entry is malformed if this value is incorrect. + +If the object ID for the selected algorithm is shorter than the algorithm for +the tree, the main object ID field is padded with zeros. + +Examples +-------- + +An extended submodule of a SHA-256 submodule in a SHA-1 repository might look +like this if the submodule object name is +deadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeefdeadbeef and the name +is `submodule`: + +---- +130000 \x91\x82\x80s256\x01\xdf\xac\xbf\xee\xdf\xac\xbf\xee\xdf\xac\xbf\xeesubmodule\x00\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef\xde\xad\xbe\xef +---- diff --git a/Documentation/meson.build b/Documentation/meson.build index e34965c5b0..e2257050e8 100644 --- a/Documentation/meson.build +++ b/Documentation/meson.build @@ -172,6 +172,7 @@ manpages = { 'gitformat-bundle.adoc' : 5, 'gitformat-chunk.adoc' : 5, 'gitformat-commit-graph.adoc' : 5, + 'gitformat-extended-tree.adoc' : 5, 'gitformat-index.adoc' : 5, 'gitformat-pack.adoc' : 5, 'gitformat-signature.adoc' : 5, ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson @ 2025-10-01 16:37 ` Junio C Hamano 2025-10-01 17:41 ` Jeff King 2025-10-01 21:21 ` Elijah Newren 2 siblings, 0 replies; 10+ messages in thread From: Junio C Hamano @ 2025-10-01 16:37 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Jeff King "brian m. carlson" <sandals@crustytoothpaste.net> writes: > +The first part of what is traditionally the file name consists of a modified > +BER-encoded integer representing the number of bytes in the extended > +information section not including this length itself; that is, the number of > +bytes which must be skipped to reach the file name. This allows parsing > +unknown values in a graceful way. In short, you prepend extra bytes that contain no NUL in front of the pathname part of the tree entry, but that extra bytes begins with the length to allow skipping. The tree entries must be in sorted order, but I presume that this extra information merely happens to be encoded inside pathname field, but does not contribute to the ordering of the entries? For debuggability, I am not sure if modified BER is the best format (it is essentially binary gibberish). For extensibility, I wonder if we should allow more than (type, flags) to be given in the future? I guess a new <type> will allow extra things after <flags> (like conflict tree takes stage number), so the format is already extensible (we do not want to see people willy-nilly add extended stuff to the tree object anyway, since it would allow the logically same object represented by multiple byte streams that are not bit-for-bit identical, so extensibility is a double-edged sword, but still necessary). When we have more than one stage #1 entries (can happen in a criss-cross merge) or multiple stage #3 entries (octopus), what sort order do we apply to them for conflict tree entries? As you said, it does look ugly ;-) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson 2025-10-01 16:37 ` Junio C Hamano @ 2025-10-01 17:41 ` Jeff King 2025-10-01 21:11 ` Jeff King 2025-10-01 21:21 ` Elijah Newren 2 siblings, 1 reply; 10+ messages in thread From: Jeff King @ 2025-10-01 17:41 UTC (permalink / raw) To: brian m. carlson; +Cc: git On Wed, Oct 01, 2025 at 12:58:14AM +0000, brian m. carlson wrote: > There are some cases in which we want to encode additional information > in a tree but there is currently no possible way to do so. Define a > format for extended trees that uses mode 130000 plus some additional > nonzero bytes in the file name to encode additional data in a mostly > backwards compatible way. Thanks for writing this up. This was along the lines I was thinking, but I have a few variant suggestions to consider. Your proposal here covers a very general extension format which can be used for solving several possible problems. But if we constrain ourselves just to the gitlink question, it does open up some more possibilities. Here you're using a custom mode to mark the entry, and the downside is that the mode will be somewhat confusing to old versions of Git. They won't recognize a gitlink as such, and their fsck will complain about it. It also leaves the question of what goes into the hash/object-id field of the entry. We can't represent the true hash (that is the whole problem we are trying to solve). You could put in the null sha1, but that is another fsck landmine. So probably we have some dummy hash. But old versions, not knowing it's a submodule, will expect that hash to be reachable and will complain that it's missing! What if instead, we leave the mode as S_IFGITLINK, but use a dummy hash to recognize the extended submodule? Let's say 0000...1. We risk colliding with a real hash, but the chances are quite unlikely, and it's a risk we already take with the null oid. With this scheme, the old versions will still realize it's a submodule, and that we do not need to have access to that hash. If an old version checks it out without submodule recursion, we'll never even try to access that hash. And if it does, it will try to clone the appropriate gitmodule and say "oops, we don't have that hash". Which is probably the best outcome we can hope for. The obvious downside is that this does nothing to help the stored-conflict case. _If_ we are going to come up with a solution for that, I agree it might make sense to piggy-back on it for submodules. But I do think what I outlined above degrades a bit more gracefully. And the compatibility scenarios for these two use cases may be different. Alternate-hash submodule entries will be buried in old history and seen many people. Conflict markers are probably shorter-lived and less likely to be seen by people who aren't using modern versions of Git to work with them. The other way in which your proposal differs from what I was thinking is in the use of BER in the filename. It is nice that it can unambiguously encode a chunk of data, but the fact that it contains binary bytes makes me worried about a few things: 1. When users see it, does it look like junk? It is nice, IMHO, if the name does leak out to the user for it to be a bit more self-explanatory. 2. When a filesystem sees it, can it handle it? I didn't think hard on it, but I'd guess BER generally isn't valid UTF-8. What will HFS+ do with such a filename? 3. To what degree can BER encoded bytes conflict with meaningful path names? In particular I wonder if "/" can appear, which will cause confusion (both in git-fsck, but also when we try to work in a directory that does not exist). Or worse, if you can sneak ".git/" into a name in a way that is interpreted by _some_ versions of Git but not others. So I was thinking instead of something more ASCII-ish. Something like "<link_type>-<link_data>-<pathname>". So in a sha256 tree, an entry for a sha1 submodule might look like (as shown by ls-tree): 160000 00000000000000000000000000000001 sha1-12345abcde12345abcde12345abcde12345abcde-the-real-path The sentinel sha256 oid "000...1" would be binary in the tree, of course, but the sha1 in the filename would be hex. It's not as efficient, but I don't think that submodules are common enough for this to be a real issue. I was vague about "link_data" above. But I think if it is interpreted in a manner specific to link_type, then this scheme could also eventually enable non-git submodules. E.g., for the true masochist, you could imagine svn-<url>-foo (with <url> percent-encoded to avoid dashes and slashes). I'm not sure if anyone would ever find that useful, and it's not something I'd plan to look into myself, but maybe it's a bonus? The other issue of concern is sorting. If sha1-...-foo sorts as "foo", then older versions will complain the tree is wrongly sorted. And probably will produce slightly wrong diffs. I think modern Git would Just Work in most spots, though, provided the handling is done in decode_tree_entry() or similar. Sorting the other way, as "sha1-", is mostly going to be a nightmare for modern Git. If we want it to truly show "foo" in a diff, for example, it would have to scan forward for any instance of "sha1-...-aaa" before showing "aab". Yuck. But what if we accepted that these entries were a bit more user-visible? That is, the diff itself would not do anything special at all, and you really would get a delete/add for: -sha1-12345abcde12345abcde12345abcde12345abcde-foo +sha1-67890fedcb67890fedcb67890fedcb67890fedcb-foo instead of the more correct path-changing diff. And then either we accept that ugliness, or we fix up the diff for display via diffcore or similar (dropping those ugly entries and replacing them with a single diff_filepair of "foo"). And likewise the submodule code would leave these ugly entries until the moment it is ready to act on them, when it would realize that we want to touch "foo" in the checkout (and both here and in a diff, obviously it is an error if there is a "foo" already, but that is true already for any tree with duplicate entries). My gut feeling is that would keep the gross-ness confined to a few bits of code, and you could still operate on these natively with tools like mktree, ls-tree, and so on. The ugly names will leak out to the user sometimes, but most porcelain flows would do the right thing. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 17:41 ` Jeff King @ 2025-10-01 21:11 ` Jeff King 2025-10-01 21:19 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Jeff King @ 2025-10-01 21:11 UTC (permalink / raw) To: brian m. carlson; +Cc: git On Wed, Oct 01, 2025 at 01:41:10PM -0400, Jeff King wrote: > My gut feeling is that would keep the gross-ness confined to a few bits > of code, and you could still operate on these natively with tools like > mktree, ls-tree, and so on. The ugly names will leak out to the user > sometimes, but most porcelain flows would do the right thing. I was hoping make a proof-of-concept patch to get more familiar with exactly what it would look like for various commands. So the scenario I came up with was this: -- >8 -- #!/bin/sh # you can have a sha256 module in a sha1 repo or vice versa sup_algo=$1; shift sub_algo=$1; shift rm -rf sub super clone # a boring sub-project with one commit git init --object-format=$sub_algo sub && ( cd sub && echo content >file && git add file && git commit -m msg ) && # the superproject includes it git init --object-format=$sup_algo super && ( cd super && git -c protocol.file.allow=always submodule add "$PWD/../sub" foo && git commit -m 'add sub' ) && # and then in theory we can clone it and check out the submodule git -c protocol.file.allow=always clone --recurse-submodules super clone -- >8 -- But there were two unexpected bits. The first is that the above doesn't just need a way to store the foreign-hash oid in a tree. It needs to be in the index, too. We could perhaps use the same hackery there, storing sha1-...-foo in the index with the sentinel hash. Or perhaps we could extend the index to support holding foreign hashes. I'm not sure which would be less annoying. If the index and trees don't match, then tree-to-index comparisons get weird. If the index and filesystem don't match, then index to filesystem comparisons get weird. My gut feeling is that making index-to-filesystem comparisons weird (so putting the hacked name into the index) is probably going to be better, just because the filesystem is already out of our control. So we can do sorted comparisons between the index and trees, but filesystem operations are inherently asking about paths one by one. But I'd have to dig deeper to get more confident in that opinion. The other unexpected thing is that the sequence above works today! I think it is mostly accidental, though. When we read the head of the submodule, we get a "struct object_id" with a different algo field. But we never pay attention to that field, and just treat it as if it is using our native algorithm. So if I use a sha256 submodule in a sha1 repo, the final part of the output (in the recursive clone) looks like this: Submodule path 'foo': checked out 'db8b8f8006f4564e7862b246c4d57100790e2196' We've truncated the sha256 to 20 bytes (40 hex) and used that instead of the full oid. But Git is flexible enough that it is happy to find the object by its abbreviated hash, and checks it out without complaining. If we go the other way, with a sha1 submodule in a sha256 repo, we get this final line: Submodule path 'foo': checked out '03f08992d562b03e2af5b4256e17b82be8eafa98000000000000000000000000' I expected to get random garbage at the end, since we only copied in the first 20 bytes, but since c98d762ed9 (global: ensure that object IDs are always padded, 2024-06-14) we zero-pad all oids. The intent there was making internal comparisons fast, but it happens to help us out here. I am a little puzzled that Git is willing to check out the zero-padded name, but I wonder if we simply parse it into an object_id struct and then only look at the first algo->raw_size bytes. So it all works as expected, but I feel like it's mostly by accident. My gut feeling is that we probably wanted something like this to protect us from confusion: index 06ad74db22..295b0c6318 100644 --- a/read-cache.c +++ b/read-cache.c @@ -719,6 +719,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, if (S_ISDIR(st_mode)) { if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0) return error(_("'%s' does not have a commit checked out"), path); + if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) + return error(_("object format %s of '%s' is incompatible with this repository (%s)"), hash_algos[oid.algo].name, path, the_repository->hash_algo->name); while (namelen && path[namelen-1] == '/') namelen--; } Of course that is strictly worse for somebody who is relying on the current accidental behavior. ;) And in the long run, I think this is the spot we'd want to hook to do whatever massaging we need (whether converting to the equivalent in-repo algorithm, or hacking up the name to store the foreign hash). I also won't be at all surprised if you've run across this already in your interop work. -Peff ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 21:11 ` Jeff King @ 2025-10-01 21:19 ` Junio C Hamano 2025-10-01 21:45 ` brian m. carlson 2025-10-01 22:59 ` Jeff King 2 siblings, 0 replies; 10+ messages in thread From: Junio C Hamano @ 2025-10-01 21:19 UTC (permalink / raw) To: Jeff King; +Cc: brian m. carlson, git Jeff King <peff@peff.net> writes: > The other unexpected thing is that the sequence above works today! I > think it is mostly accidental, though. When we read the head of the > submodule, we get a "struct object_id" with a different algo field. But > we never pay attention to that field, and just treat it as if it is > using our native algorithm. Yeah, I recall that there was a discussion about this exact "yes, we get by fine with SHA-1 commit object name 0-padded to the right, when embedding SHA-1 submodule inside SHA-256 superproject", countered by brian's "no, it may happen to work but you shouldn't rely on it" recently. > So it all works as expected, but I feel like it's mostly by accident. My > gut feeling is that we probably wanted something like this to protect us > from confusion: > > index 06ad74db22..295b0c6318 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -719,6 +719,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, > if (S_ISDIR(st_mode)) { > if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0) > return error(_("'%s' does not have a commit checked out"), path); > + if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) > + return error(_("object format %s of '%s' is incompatible with this repository (%s)"), hash_algos[oid.algo].name, path, the_repository->hash_algo->name); > while (namelen && path[namelen-1] == '/') > namelen--; > } > > Of course that is strictly worse for somebody who is relying on the > current accidental behavior. ;) And in the long run, I think this is the > spot we'd want to hook to do whatever massaging we need (whether > converting to the equivalent in-repo algorithm, or hacking up the name > to store the foreign hash). > > I also won't be at all surprised if you've run across this already in > your interop work. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 21:11 ` Jeff King 2025-10-01 21:19 ` Junio C Hamano @ 2025-10-01 21:45 ` brian m. carlson 2025-10-01 23:00 ` Jeff King 2025-10-01 22:59 ` Jeff King 2 siblings, 1 reply; 10+ messages in thread From: brian m. carlson @ 2025-10-01 21:45 UTC (permalink / raw) To: Jeff King; +Cc: git [-- Attachment #1: Type: text/plain, Size: 1907 bytes --] On 2025-10-01 at 21:11:40, Jeff King wrote: > So it all works as expected, but I feel like it's mostly by accident. My > gut feeling is that we probably wanted something like this to protect us > from confusion: > > index 06ad74db22..295b0c6318 100644 > --- a/read-cache.c > +++ b/read-cache.c > @@ -719,6 +719,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, > if (S_ISDIR(st_mode)) { > if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0) > return error(_("'%s' does not have a commit checked out"), path); > + if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) > + return error(_("object format %s of '%s' is incompatible with this repository (%s)"), hash_algos[oid.algo].name, path, the_repository->hash_algo->name); > while (namelen && path[namelen-1] == '/') > namelen--; > } > > Of course that is strictly worse for somebody who is relying on the > current accidental behavior. ;) And in the long run, I think this is the > spot we'd want to hook to do whatever massaging we need (whether > converting to the equivalent in-repo algorithm, or hacking up the name > to store the foreign hash). I think sending in this patch is a good first step right now. The intention of the transition document is that the code is in one and only one hash algorithm at a time and it would be better to reject this case until we're ready to wire it up correctly than end up with data that's corrupt down the line. > I also won't be at all surprised if you've run across this already in > your interop work. I have not, since it's primarily involved working with the testsuite and some test repositories. I know that submodules are broken in interop mode; I didn't need to test that until I was writing code to make them not-broken. -- brian m. carlson (they/them) Toronto, Ontario, CA [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 262 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 21:45 ` brian m. carlson @ 2025-10-01 23:00 ` Jeff King 0 siblings, 0 replies; 10+ messages in thread From: Jeff King @ 2025-10-01 23:00 UTC (permalink / raw) To: brian m. carlson; +Cc: git On Wed, Oct 01, 2025 at 09:45:29PM +0000, brian m. carlson wrote: > > Of course that is strictly worse for somebody who is relying on the > > current accidental behavior. ;) And in the long run, I think this is the > > spot we'd want to hook to do whatever massaging we need (whether > > converting to the equivalent in-repo algorithm, or hacking up the name > > to store the foreign hash). > > I think sending in this patch is a good first step right now. The > intention of the transition document is that the code is in one and only > one hash algorithm at a time and it would be better to reject this case > until we're ready to wire it up correctly than end up with data that's > corrupt down the line. OK. I'll try to clean it up and add some tests, but it may not be until next week. -Peff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 21:11 ` Jeff King 2025-10-01 21:19 ` Junio C Hamano 2025-10-01 21:45 ` brian m. carlson @ 2025-10-01 22:59 ` Jeff King 2 siblings, 0 replies; 10+ messages in thread From: Jeff King @ 2025-10-01 22:59 UTC (permalink / raw) To: brian m. carlson; +Cc: git On Wed, Oct 01, 2025 at 05:11:40PM -0400, Jeff King wrote: > But there were two unexpected bits. The first is that the above doesn't > just need a way to store the foreign-hash oid in a tree. It needs to be > in the index, too. > > We could perhaps use the same hackery there, storing sha1-...-foo in the > index with the sentinel hash. Or perhaps we could extend the index to > support holding foreign hashes. I'm not sure which would be less > annoying. If the index and trees don't match, then tree-to-index > comparisons get weird. If the index and filesystem don't match, then > index to filesystem comparisons get weird. My gut feeling is that making > index-to-filesystem comparisons weird (so putting the hacked name into > the index) is probably going to be better, just because the filesystem > is already out of our control. So we can do sorted comparisons between > the index and trees, but filesystem operations are inherently asking > about paths one by one. But I'd have to dig deeper to get more confident > in that opinion. I know everybody was on the edge of their seats, so here's how far I got (I will probably be offline for travel for a day or two after this). This patch generates the hacked-up index entry when you do a "submodule add" (or even a regular "git add"). I suspect it is missing some corner cases (in particular, I doubt that it would correctly update with "git add -u" because the actual filesystem path appears untracked from the naive perspective of the index). It is not too ugly, I think: diff --git a/read-cache.c b/read-cache.c index 06ad74db22..77072df2f7 100644 --- a/read-cache.c +++ b/read-cache.c @@ -708,6 +708,7 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, (intent_only ? ADD_CACHE_NEW_ONLY : 0)); unsigned hash_flags = pretend ? 0 : INDEX_WRITE_OBJECT; struct object_id oid; + struct strbuf submodule_foreign_path = STRBUF_INIT; if (flags & ADD_CACHE_RENORMALIZE) hash_flags |= INDEX_RENORMALIZE; @@ -719,6 +720,16 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, if (S_ISDIR(st_mode)) { if (repo_resolve_gitlink_ref(the_repository, path, "HEAD", &oid) < 0) return error(_("'%s' does not have a commit checked out"), path); + if (oid.algo != hash_algo_by_ptr(the_repository->hash_algo)) { + /* XXX maybe need to remove entry for "path"? */ + strbuf_addf(&submodule_foreign_path, "%s-%s-%s", + hash_algos[oid.algo].name, + oid_to_hex(&oid), + path); + path = submodule_foreign_path.buf; + namelen = submodule_foreign_path.len; + } + while (namelen && path[namelen-1] == '/') namelen--; } @@ -768,7 +779,20 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st, } } if (!intent_only) { - if (index_path(istate, &ce->oid, path, st, hash_flags)) { + /* + * We actually re-resolve submodules a second time in + * index_path(). That's perhaps something we should + * avoid in general (because it's racy), but it's + * particularly important for our submodule name + * hackery, since "path" is no longer something exists + * in the filesystem at all, and we want to override it with + * our fake sentinel OID anyway. + */ + if (submodule_foreign_path.len) { + /* probably a constant would be less horrific */ + oidclr(&ce->oid, the_repository->hash_algo); + ce->oid.hash[the_repository->hash_algo->rawsz - 1] = 1; + } else if (index_path(istate, &ce->oid, path, st, hash_flags)) { discard_cache_entry(ce); return error(_("unable to index file '%s'"), path); } But then I took a look at the reading side, and it gets pretty ugly. Often we have just a submodule name, and we end up referencing it in the index. But of course we don't know if we need to use the hacked-up name to do the lookup or not, because it depends on the sentinel hash, but we don't always have a hash yet! So you'd have to speculatively look up both names in lots of places. So I think I do lean more towards the index having the full, correct information, and just doing this hack at the tree level (and munging as we read/write the trees). Or of course deciding it's all too horrible and abandoning it. All of this was meant to be an exploration. I do think it makes a lot of potential problems and hassles go away if we can intermix submodules of different hash algorithms. I'm undecided on whether the solution is worth than the problem, though. -Peff ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH 1/1] Define an extended tree format 2025-10-01 0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson 2025-10-01 16:37 ` Junio C Hamano 2025-10-01 17:41 ` Jeff King @ 2025-10-01 21:21 ` Elijah Newren 2 siblings, 0 replies; 10+ messages in thread From: Elijah Newren @ 2025-10-01 21:21 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Jeff King On Tue, Sep 30, 2025 at 5:58 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > +Following that is a modified BER-encoded integer representing the type of > +object and a modified BER-encoded integer consisting of flags. > + > +The object type is one of the following: > + > +0:: invalid > +1:: conflict tree (only valid at top level) I think there may have been some miscommunication. There was discussion around "at the toplevel" and "trees," but I believe "at the toplevel" was meant to refer to the commit itself -- specifically, for each commit, at the highest level of the objects it references (i.e., the commit object). The idea is that conflict information would be represented by multiple traditional toplevel trees (or possibly even commits?) recorded in the commit object, with no conflicts recorded within any individual tree. Instead, the existence of multiple distinct toplevel trees is what would signal a conflict. Recording conflict information in trees rather than commits, as this proposal suggests (even if only at the toplevel tree), introduces a problem similar to what Martin mentioned yesterday regarding blobs. In particular, doing so would require every `git log` operation -- which wants to determine if a commit is conflicted so it can highlight it for users -- to walk not only the commits but also additional objects. Even if limited to just one extra level for toplevel trees, it effectively doubles the objects traversed for non-diffing log operations, impacting performance for everyone, not just those using conflict headers. I’d prefer to avoid imposing a performance penalty on users who don’t use any new features associated with first-class conflicts. Additionally, I believe this approach would diverge from how jj and GitButler record conflicted commits. I’d rather follow the path they paved to benefit from their expertise and minimize (if not eliminate) any possible rework they need for continued interoperability with Git versions that gain features they first implemented. We may find reason to revisit this as things progress, but for now, I don’t conflicted trees needs to be included in your modified tree proposal. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-10-01 23:00 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-01 0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson 2025-10-01 0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson 2025-10-01 16:37 ` Junio C Hamano 2025-10-01 17:41 ` Jeff King 2025-10-01 21:11 ` Jeff King 2025-10-01 21:19 ` Junio C Hamano 2025-10-01 21:45 ` brian m. carlson 2025-10-01 23:00 ` Jeff King 2025-10-01 22:59 ` Jeff King 2025-10-01 21:21 ` Elijah Newren
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).