From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: SHA-1/SHA-256 interoperability work is functional
Date: Tue, 16 Jun 2026 21:31:27 +0000 [thread overview]
Message-ID: <ajHAr5XL3Tzery7J@fruit.crustytoothpaste.net> (raw)
In-Reply-To: <xmqqldce2r1a.fsf@gitster.g>
[-- Attachment #1: Type: text/plain, Size: 6352 bytes --]
On 2026-06-16 at 20:01:37, Junio C Hamano wrote:
> Not that I specifically care about packfile URI, but this one is
> curious. How would regular "fetch" and "push" traffic work under
> the new world order? Presumably we will keep one characteristic of
> the protocol, that the packdata stream is the only thing that is
> given to the other side and no object names are given, because the
> receiving end would not want to blindly trust the object name the
> sending end _claims_ to have sent and instead recomputes the object
> name out of the packed objects in the data stream ("if we rehash
> and recompute the object names from the datastream, the other side
> cannot lie to us" IIRC was a security measure).
>
> For a regular "fetch" and "push" to work, we would need to recompute
> the native object names and also somehow compute the compatibility
> object names if we are in interoperability mode, no?
>
> If we download *.pack files from a packfile URI, wouldn't it be the
> same story?
Let me explain how conversion works. Say you have an empty local
repository with SHA-256 as the main algorithm and SHA-1 as the
compatibility algorithm, plus a SHA-1 remote. When you do `git fetch`,
you get a SHA-1 pack. You cannot write this into the repository because
your repository doesn't use SHA-1 as the main algorithm, so `git
index-pack` takes the pack and maps any objects. If the objects are in
the new pack, they get rewritten based on the dependencies; otherwise,
Git uses the existing maps in the repository to rewrite the objects.
`git index-pack` then writes a completely new SHA-256 pack with an index
containing the SHA-1 mapping, using the corresponding deltas[0].
However, `git index-pack` can only index and map objects for one pack at
a time. We therefore need any pack that we get to be connected to our
existing history so that we can rely on our existing maps to remap
objects that are not in the pack. For instance, if we get a commit
without its parents, then we'll simply die because those objects cannot
be mapped and we can't write the mapping in the index.
The problem is that that packfile URIs result in multiple packs (one of
which is the dynamically generated protocol pack) that, _in total_,
provide a complete history with what we have, but are not necessarily
individually connected to our existing history. Moreover, the
dynamically generated protocol pack is sent _first_, so if we have
packfile URIs, that pack is almost certainly guaranteed _not_ to connect
to our existing history. We would therefore have to pause index-pack,
download all the packfile URIs, index those packfiles (which would have
to be connected to our history), and then unpause index-pack to rewrite
the history. This is not impossible, but it's tricky, and it has yet to
be implemented. Someone may decide that this is a valuable feature and
implement it, but it's not on my to-do list.
This doesn't pose a problem with single-algorithm repositories because
if you have unreferenced and unconnected objects, no big deal. They
just don't get used and will eventually get GC'd. But since we can't
map those objects in a multi-algorithm world, that's fatal and those
packs can't be indexed.
> > * Large object promisors cannot be used if the server does not actually have
> > the entire history, since the server must have a complete history in order to
> > provide object mappings.
>
> Again, this one worries me a bit, but perhaps I am not reading it
> correctly. Does this mean that the server side says "this is the
> data for object whose name is X in the SHA-1 world, which translates
> to X256 in the SHA-256 world", the receiving end blindly trusts
> without having a way to verify?
The server provides algorithm mappings for for submodules, shallow
clones, and partial clones. For shallow clones and partial clones, you
have to trust the server anyway because you're already getting a
truncated history. If you complete the history by fetching the missing
objects and run `git fsck`, then it will detect if the mapping is
invalid because the server was dishonest and complain. You will have a
corrupt set of mappings to the compatibility algorithm, but those could
theoretically be repaired.
However, in order for the server to produce those mappings, it has to
know the entire history. If there are objects that are outside the
repository in a secondary location, the server will not have mappings
for those objects and so it will abort the protocol.
The server does not normally provide mappings for non-submodules if
you're doing a regular fetch or clone, since the client has a
self-contained history and does not need those objects to compute the
mapping. That means that regular clones and fetches work just fine
against existing servers as long as no submodules are involved[1].
The tricky part is submodules. Because the data is in a separate
repository, we cannot be certain of the mapping. The documentation says
this:
There is a potential security problem with providing mappings of
submodules over the protocol. Namely, there is no way to guarantee
that the SHA-1 object ID and the SHA-256 object ID correspond to the
same commit. This means that, for example, a malicious server could
provide a SHA-256 object ID for a submodule that was up to date with
all security fixes, but map that to a SHA-1 object ID for an older
commit with security problems.
We therefore reject submodule mappings if fsck verification for
transferred objects is enabled unless the user has explicitly enabled
submodule mappings.
[0] If A deltas against B in SHA-1, then when those are rewritten into
SHA-256, we delta the SHA-256 A against the SHA-256 B. This does not
guarantee the best possible delta, but it is much cheaper than
redeltifying and because we expect remapped objects to have the same
shape, it should delta well enough in most cases.
[1] For instance, if you build my branch, you can do
`git clone --object-format=sha256:sha1 https://github.com/bk2204/lawn.git`
and it just works since there are no submodules. My dotfiles, on the
other hand, have submodules and will not work without protocol support.
--
brian m. carlson (they/them)
Toronto, Ontario, CA
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 325 bytes --]
prev parent reply other threads:[~2026-06-16 21:31 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-16 0:17 SHA-1/SHA-256 interoperability work is functional brian m. carlson
2026-06-16 20:01 ` Junio C Hamano
2026-06-16 21:31 ` brian m. carlson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajHAr5XL3Tzery7J@fruit.crustytoothpaste.net \
--to=sandals@crustytoothpaste.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox