From: Jeff King <peff@peff.net>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: git@vger.kernel.org
Subject: Re: [RFC PATCH 1/1] Define an extended tree format
Date: Wed, 1 Oct 2025 13:41:10 -0400 [thread overview]
Message-ID: <20251001174110.GA137600@peff.net> (raw)
In-Reply-To: <20251001005814.846992-2-sandals@crustytoothpaste.net>
On Wed, Oct 01, 2025 at 12:58:14AM +0000, brian m. carlson wrote:
> There are some cases in which we want to encode additional information
> in a tree but there is currently no possible way to do so. Define a
> format for extended trees that uses mode 130000 plus some additional
> nonzero bytes in the file name to encode additional data in a mostly
> backwards compatible way.
Thanks for writing this up. This was along the lines I was thinking, but
I have a few variant suggestions to consider.
Your proposal here covers a very general extension format which can be
used for solving several possible problems. But if we constrain
ourselves just to the gitlink question, it does open up some more
possibilities. Here you're using a custom mode to mark the entry, and
the downside is that the mode will be somewhat confusing to old versions
of Git. They won't recognize a gitlink as such, and their fsck will
complain about it. It also leaves the question of what goes into the
hash/object-id field of the entry. We can't represent the true hash
(that is the whole problem we are trying to solve). You could put in the
null sha1, but that is another fsck landmine. So probably we have some
dummy hash. But old versions, not knowing it's a submodule, will expect
that hash to be reachable and will complain that it's missing!
What if instead, we leave the mode as S_IFGITLINK, but use a dummy hash
to recognize the extended submodule? Let's say 0000...1. We risk
colliding with a real hash, but the chances are quite unlikely, and it's
a risk we already take with the null oid. With this scheme, the old
versions will still realize it's a submodule, and that we do not need to
have access to that hash. If an old version checks it out without
submodule recursion, we'll never even try to access that hash. And if it
does, it will try to clone the appropriate gitmodule and say "oops, we
don't have that hash". Which is probably the best outcome we can hope
for.
The obvious downside is that this does nothing to help the
stored-conflict case. _If_ we are going to come up with a solution for
that, I agree it might make sense to piggy-back on it for submodules.
But I do think what I outlined above degrades a bit more gracefully. And
the compatibility scenarios for these two use cases may be different.
Alternate-hash submodule entries will be buried in old history and seen
many people. Conflict markers are probably shorter-lived and less likely
to be seen by people who aren't using modern versions of Git to work
with them.
The other way in which your proposal differs from what I was thinking is
in the use of BER in the filename. It is nice that it can unambiguously
encode a chunk of data, but the fact that it contains binary bytes makes
me worried about a few things:
1. When users see it, does it look like junk? It is nice, IMHO, if the
name does leak out to the user for it to be a bit more
self-explanatory.
2. When a filesystem sees it, can it handle it? I didn't think hard on
it, but I'd guess BER generally isn't valid UTF-8. What will HFS+
do with such a filename?
3. To what degree can BER encoded bytes conflict with meaningful
path names? In particular I wonder if "/" can appear, which will
cause confusion (both in git-fsck, but also when we try to work in
a directory that does not exist). Or worse, if you can sneak
".git/" into a name in a way that is interpreted by _some_ versions
of Git but not others.
So I was thinking instead of something more ASCII-ish. Something like
"<link_type>-<link_data>-<pathname>". So in a sha256 tree, an entry for
a sha1 submodule might look like (as shown by ls-tree):
160000 00000000000000000000000000000001 sha1-12345abcde12345abcde12345abcde12345abcde-the-real-path
The sentinel sha256 oid "000...1" would be binary in the tree, of
course, but the sha1 in the filename would be hex. It's not as
efficient, but I don't think that submodules are common enough for this
to be a real issue.
I was vague about "link_data" above. But I think if it is interpreted in
a manner specific to link_type, then this scheme could also eventually
enable non-git submodules. E.g., for the true masochist, you could
imagine svn-<url>-foo (with <url> percent-encoded to avoid dashes and
slashes). I'm not sure if anyone would ever find that useful, and it's
not something I'd plan to look into myself, but maybe it's a bonus?
The other issue of concern is sorting. If sha1-...-foo sorts as "foo",
then older versions will complain the tree is wrongly sorted. And
probably will produce slightly wrong diffs. I think modern Git would
Just Work in most spots, though, provided the handling is done in
decode_tree_entry() or similar.
Sorting the other way, as "sha1-", is mostly going to be a nightmare for
modern Git. If we want it to truly show "foo" in a diff, for example, it
would have to scan forward for any instance of "sha1-...-aaa" before
showing "aab". Yuck.
But what if we accepted that these entries were a bit more user-visible?
That is, the diff itself would not do anything special at all, and you
really would get a delete/add for:
-sha1-12345abcde12345abcde12345abcde12345abcde-foo
+sha1-67890fedcb67890fedcb67890fedcb67890fedcb-foo
instead of the more correct path-changing diff. And then either we
accept that ugliness, or we fix up the diff for display via diffcore or
similar (dropping those ugly entries and replacing them with a single
diff_filepair of "foo").
And likewise the submodule code would leave these ugly entries until the
moment it is ready to act on them, when it would realize that we want to
touch "foo" in the checkout (and both here and in a diff, obviously it
is an error if there is a "foo" already, but that is true already for
any tree with duplicate entries).
My gut feeling is that would keep the gross-ness confined to a few bits
of code, and you could still operate on these natively with tools like
mktree, ls-tree, and so on. The ugly names will leak out to the user
sometimes, but most porcelain flows would do the right thing.
-Peff
next prev parent reply other threads:[~2025-10-01 17:41 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-01 0:58 [RFC PATCH 0/1] Extended tree format for mixed submodules and conflicts brian m. carlson
2025-10-01 0:58 ` [RFC PATCH 1/1] Define an extended tree format brian m. carlson
2025-10-01 16:37 ` Junio C Hamano
2025-10-01 17:41 ` Jeff King [this message]
2025-10-01 21:11 ` Jeff King
2025-10-01 21:19 ` Junio C Hamano
2025-10-01 21:45 ` brian m. carlson
2025-10-01 23:00 ` Jeff King
2025-10-01 22:59 ` Jeff King
2025-10-01 21:21 ` Elijah Newren
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251001174110.GA137600@peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).