SHA-1/SHA-256 interoperability work is functional

All of lore.kernel.org
 help / color / mirror / Atom feed

* SHA-1/SHA-256 interoperability work is functional
@ 2026-06-16  0:17 brian m. carlson
  2026-06-16 20:01 ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: brian m. carlson @ 2026-06-16  0:17 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 6043 bytes --]

I'm pleased to announce that I have Git fully passing the testsuite and
CI in interoperability mode, both with SHA-256 and SHA-1 as the main
algorithm.  While this is very exciting, the work is not ready to send
to the list and is effectively a draft, since there is still cleanup
and efficiency work to be done.

What is fully functional:

* The testsuite.
* The protocol, including extensions for mapping objects to support
  shallow clones, submodules, and partial clones and interoperability
  with remotes of different algorithms.
* A full complement of functionality for everyday use cases.

Features which are currently unsupported (and which may or may not be
supported in the future):

* Filtered bundles are unsupported because there is currently no way to provide
	a mapping.
* Multi-pack index cannot be used as the sole pack index format because it does
	not yet provide mappings.
* Pack index v1 and v2 cannot be used because they do not provide object
	mappings.  Git automatically uses pack index v3 instead when necessary, which
	does handle mappings.
* Packfile URIs are not supported because the protocol-provided packfile is not
	complete and its objects cannot be mapped.
* Large object promisors cannot be used if the server does not actually have
	the entire history, since the server must have a complete history in order to
	provide object mappings.
* `git fast-import` does not accept submodules in compatibility mode because
	there is no provision for mappings.
* Remote helpers do not emit signatures in the compatibility algorithm for
	signed tags.
* The WebDAV-based HTTP protocol doesn't support interoperability due to
  the lack of a way to distribute mappings.

Some additional things that may need to be improved:

* We have some recursive delta resolution code in `git index-pack` that
  will need to be made iterative to avoid stack overflows.
* We need to batch object maps whenever we write them, since having too
  many causes `git gc` to kick off frequently (which can be seen in some
  tests).  This will require substantial refactoring of code like `git
  add`, since any time we write an object we must be sure to always
  write the mapping (even if we `die`).
* We will probably want to move the object map repacking code out of
  `git gc` into a separate command that we can call manually.
* We may want some debugging tools for pack index v3 and other data
  formats so that we can show the mapping of objects.
* We will probably want to be able to sign in only one algorithm instead
  of always in both.  Users may not want to sign the SHA-1 format for
  security reasons.
* There is some extremely basic code for `^{sha1}` and `^{sha256}` to
  help `git rev-list` perform connectivity checks for remotes using the
  compatibility algorithm, but it is very much incomplete and will need
  to be completed or fenced off.
* Operations can be rather slow in some cases and we'll want to see what
  speedups we can perform.
* Probably other things I have forgotten.

There is new documentation in `Documentation/gitformat-hash.adoc` that
outlines the requirements for using the protocol.  The protocol
restrictions described there are hard technical limitations that cannot
be avoided; I've intentionally made things as featureful as they can be.
This imposes real restrictions on using protocol interoperability with many
projects, including Git and Linux[0].  Interested parties may wish to look
at t1017 to see what's tested vis-à-vis the protocol and
interoperability.

On the end of the series is a small amount of new Rust code that moves
us in the direction of object file conversion in Rust.  All the pointer
arithmetic makes me very nervous from a security perspective, especially
in network-facing code, so my hope is to eventually port that code over.
Note that Rust is already required for the interoperability work, so
this doesn't add any new dependencies.

I also have some unpublished code that could start on in-place migration
functionality much like is done with `git refs migrate` once the Rust
prerequisites above are ready.  This could also make it possible at some
point to migrate any relevant submodules as part of the migration of the
main repository.  I may or may not complete this work, but perhaps
someone else will want to pick it up if I do not.  I'm certain that such
code would be valuable to many people, including forges.

I intend to rebase and tidy this work for at least a little while, but
don't actually intend to send it upstream, since there are some
technical limitations that prevent me from doing so.  I have, however,
been in contact with someone (who may identify themselves if they
choose) who is interested in getting some of the polishing and
upstreaming work done, which I deeply appreciate.

If you're interested in testing or perusing the work, you may get it
from the `sha256-interop` branch of https://github.com/bk2204/git.git.
Please note that it may be rebased, rewound, or otherwise folded,
spindled, or mutilated at any time.

Even though the testsuite is passing earlier than I expected, I don't
expect it to make Git 3.0, nor do I think we should delay Git 3.0 for
this work.  There are approximately 200 patches currently (and more to
come if we add in-place migration tooling), so it seems very unlikely
that we could get the entire series upstream in any reasonable amount of
time.  We will also very much want to give this time in an experimental
state so people can try it out and report back on things that should be
improved, which is further evidence that it's not right for Git 3.0.

I'm happy to answer any further questions if folks have any.

[0] Git uses submodules, which would need to be rewritten first in order
to migrate, and Linux uses mergetags for which the tagged object is not
in the history (which makes mapping the commit containing it fail).
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 325 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: SHA-1/SHA-256 interoperability work is functional
  2026-06-16  0:17 SHA-1/SHA-256 interoperability work is functional brian m. carlson
@ 2026-06-16 20:01 ` Junio C Hamano
  2026-06-16 21:31   ` brian m. carlson
  0 siblings, 1 reply; 3+ messages in thread
From: Junio C Hamano @ 2026-06-16 20:01 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I'm pleased to announce that I have Git fully passing the testsuite and
> CI in interoperability mode, both with SHA-256 and SHA-1 as the main
> algorithm.  While this is very exciting, the work is not ready to send
> to the list and is effectively a draft, since there is still cleanup
> and efficiency work to be done.

Great to hear about a great milestone.

> Features which are currently unsupported (and which may or may not be
> supported in the future):
>
> * Filtered bundles are unsupported because there is currently no way to provide
> 	a mapping.
> * Multi-pack index cannot be used as the sole pack index format because it does
> 	not yet provide mappings.
> * Pack index v1 and v2 cannot be used because they do not provide object
> 	mappings.  Git automatically uses pack index v3 instead when necessary, which
> 	does handle mappings.

> * Packfile URIs are not supported because the protocol-provided packfile is not
> 	complete and its objects cannot be mapped.

Not that I specifically care about packfile URI, but this one is
curious.  How would regular "fetch" and "push" traffic work under
the new world order?  Presumably we will keep one characteristic of
the protocol, that the packdata stream is the only thing that is
given to the other side and no object names are given, because the
receiving end would not want to blindly trust the object name the
sending end _claims_ to have sent and instead recomputes the object
name out of the packed objects in the data stream ("if we rehash
and recompute the object names from the datastream, the other side
cannot lie to us" IIRC was a security measure).

For a regular "fetch" and "push" to work, we would need to recompute
the native object names and also somehow compute the compatibility
object names if we are in interoperability mode, no?

If we download *.pack files from a packfile URI, wouldn't it be the
same story?

> * Large object promisors cannot be used if the server does not actually have
> 	the entire history, since the server must have a complete history in order to
> 	provide object mappings.

Again, this one worries me a bit, but perhaps I am not reading it
correctly.  Does this mean that the server side says "this is the
data for object whose name is X in the SHA-1 world, which translates
to X256 in the SHA-256 world", the receiving end blindly trusts
without having a way to verify?

> There is new documentation in `Documentation/gitformat-hash.adoc` that
> outlines the requirements for using the protocol.  The protocol
> restrictions described there are hard technical limitations that cannot
> be avoided; I've intentionally made things as featureful as they can be.
> This imposes real restrictions on using protocol interoperability with many
> projects, including Git and Linux[0].  Interested parties may wish to look
> at t1017 to see what's tested vis-à-vis the protocol and
> interoperability.
> ...
> If you're interested in testing or perusing the work, you may get it
> from the `sha256-interop` branch of https://github.com/bk2204/git.git.
> Please note that it may be rebased, rewound, or otherwise folded,
> spindled, or mutilated at any time.

Sounds exciting.

Thanks.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: SHA-1/SHA-256 interoperability work is functional
  2026-06-16 20:01 ` Junio C Hamano
@ 2026-06-16 21:31   ` brian m. carlson
  0 siblings, 0 replies; 3+ messages in thread
From: brian m. carlson @ 2026-06-16 21:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 6352 bytes --]

On 2026-06-16 at 20:01:37, Junio C Hamano wrote:
> Not that I specifically care about packfile URI, but this one is
> curious.  How would regular "fetch" and "push" traffic work under
> the new world order?  Presumably we will keep one characteristic of
> the protocol, that the packdata stream is the only thing that is
> given to the other side and no object names are given, because the
> receiving end would not want to blindly trust the object name the
> sending end _claims_ to have sent and instead recomputes the object
> name out of the packed objects in the data stream ("if we rehash
> and recompute the object names from the datastream, the other side
> cannot lie to us" IIRC was a security measure).
> 
> For a regular "fetch" and "push" to work, we would need to recompute
> the native object names and also somehow compute the compatibility
> object names if we are in interoperability mode, no?
> 
> If we download *.pack files from a packfile URI, wouldn't it be the
> same story?

Let me explain how conversion works.  Say you have an empty local
repository with SHA-256 as the main algorithm and SHA-1 as the
compatibility algorithm, plus a SHA-1 remote.  When you do `git fetch`,
you get a SHA-1 pack.  You cannot write this into the repository because
your repository doesn't use SHA-1 as the main algorithm, so `git
index-pack` takes the pack and maps any objects.  If the objects are in
the new pack, they get rewritten based on the dependencies; otherwise,
Git uses the existing maps in the repository to rewrite the objects.
`git index-pack` then writes a completely new SHA-256 pack with an index
containing the SHA-1 mapping, using the corresponding deltas[0].

However, `git index-pack` can only index and map objects for one pack at
a time.  We therefore need any pack that we get to be connected to our
existing history so that we can rely on our existing maps to remap
objects that are not in the pack.  For instance, if we get a commit
without its parents, then we'll simply die because those objects cannot
be mapped and we can't write the mapping in the index.

The problem is that that packfile URIs result in multiple packs (one of
which is the dynamically generated protocol pack) that, _in total_,
provide a complete history with what we have, but are not necessarily
individually connected to our existing history.  Moreover, the
dynamically generated protocol pack is sent _first_, so if we have
packfile URIs, that pack is almost certainly guaranteed _not_ to connect
to our existing history.  We would therefore have to pause index-pack,
download all the packfile URIs, index those packfiles (which would have
to be connected to our history), and then unpause index-pack to rewrite
the history.  This is not impossible, but it's tricky, and it has yet to
be implemented.  Someone may decide that this is a valuable feature and
implement it, but it's not on my to-do list.

This doesn't pose a problem with single-algorithm repositories because
if you have unreferenced and unconnected objects, no big deal.  They
just don't get used and will eventually get GC'd.  But since we can't
map those objects in a multi-algorithm world, that's fatal and those
packs can't be indexed.

> > * Large object promisors cannot be used if the server does not actually have
> > 	the entire history, since the server must have a complete history in order to
> > 	provide object mappings.
> 
> Again, this one worries me a bit, but perhaps I am not reading it
> correctly.  Does this mean that the server side says "this is the
> data for object whose name is X in the SHA-1 world, which translates
> to X256 in the SHA-256 world", the receiving end blindly trusts
> without having a way to verify?

The server provides algorithm mappings for for submodules, shallow
clones, and partial clones.  For shallow clones and partial clones, you
have to trust the server anyway because you're already getting a
truncated history.  If you complete the history by fetching the missing
objects and run `git fsck`, then it will detect if the mapping is
invalid because the server was dishonest and complain.  You will have a
corrupt set of mappings to the compatibility algorithm, but those could
theoretically be repaired.

However, in order for the server to produce those mappings, it has to
know the entire history.  If there are objects that are outside the
repository in a secondary location, the server will not have mappings
for those objects and so it will abort the protocol.

The server does not normally provide mappings for non-submodules if
you're doing a regular fetch or clone, since the client has a
self-contained history and does not need those objects to compute the
mapping.  That means that regular clones and fetches work just fine
against existing servers as long as no submodules are involved[1].

The tricky part is submodules.  Because the data is in a separate
repository, we cannot be certain of the mapping.  The documentation says
this:

  There is a potential security problem with providing mappings of
  submodules over the protocol.  Namely, there is no way to guarantee
  that the SHA-1 object ID and the SHA-256 object ID correspond to the
  same commit.  This means that, for example, a malicious server could
  provide a SHA-256 object ID for a submodule that was up to date with
  all security fixes, but map that to a SHA-1 object ID for an older
  commit with security problems.

We therefore reject submodule mappings if fsck verification for
transferred objects is enabled unless the user has explicitly enabled
submodule mappings.

[0] If A deltas against B in SHA-1, then when those are rewritten into
SHA-256, we delta the SHA-256 A against the SHA-256 B.  This does not
guarantee the best possible delta, but it is much cheaper than
redeltifying and because we expect remapped objects to have the same
shape, it should delta well enough in most cases.
[1] For instance, if you build my branch, you can do
`git clone --object-format=sha256:sha1 https://github.com/bk2204/lawn.git`
and it just works since there are no submodules.  My dotfiles, on the
other hand, have submodules and will not work without protocol support.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 325 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-16 21:31 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16  0:17 SHA-1/SHA-256 interoperability work is functional brian m. carlson
2026-06-16 20:01 ` Junio C Hamano
2026-06-16 21:31   ` brian m. carlson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.