* Making bit-by-bit reproducible Git Bundles?
@ 2025-03-12 11:40 Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Simon Josefsson @ 2025-03-12 11:40 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 959 bytes --]
Hi.
Thank you for the "git-archive" and "git-bundle" features, making it
easier to do source-based builds in a no-Internet environment.
I have published a Git bundle of Gnulib:
https://www.gnu.org/software/gnulib/manual/html_node/Gnulib-Git-Bundle.html
As you can see at the end, I struggle to come up with a recipe to allow
others to reproduce the git bundle that I created.
If I run the recipe above twice (including the clone), I get different
checksums. This even if nothing was committed in the remote repository
meanwhile.
Is it possible to create a bit-by-bit reproducible git bundle using some
other set of commands? If so, how? I'm using git 2.48.1 from Guix.
Can anyone explain what is causing the irreproducibility? Running
diffoscope is not helpful, since the bundle is compressed and diffoscope
doesn't seem to know how to untangle it.
If this is not possible today, what do you think about changes to make
this work?
Thanks,
/Simon
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1251 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
@ 2025-03-12 16:02 ` Junio C Hamano
2025-03-13 3:09 ` Kyle Lippincott
2025-03-13 5:15 ` Jeff King
2 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2025-03-12 16:02 UTC (permalink / raw)
To: Simon Josefsson; +Cc: git
Simon Josefsson <simon@josefsson.org> writes:
> Can anyone explain what is causing the irreproducibility?
Multithreading?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
@ 2025-03-13 3:09 ` Kyle Lippincott
2025-03-13 7:59 ` Simon Josefsson
2025-03-13 5:15 ` Jeff King
2 siblings, 1 reply; 11+ messages in thread
From: Kyle Lippincott @ 2025-03-13 3:09 UTC (permalink / raw)
To: Simon Josefsson; +Cc: git
On Wed, Mar 12, 2025 at 4:59 AM Simon Josefsson <simon@josefsson.org> wrote:
>
> Hi.
>
> Thank you for the "git-archive" and "git-bundle" features, making it
> easier to do source-based builds in a no-Internet environment.
>
> I have published a Git bundle of Gnulib:
>
> https://www.gnu.org/software/gnulib/manual/html_node/Gnulib-Git-Bundle.html
>
> As you can see at the end, I struggle to come up with a recipe to allow
> others to reproduce the git bundle that I created.
>
> If I run the recipe above twice (including the clone), I get different
> checksums. This even if nothing was committed in the remote repository
> meanwhile.
>
> Is it possible to create a bit-by-bit reproducible git bundle using some
> other set of commands? If so, how? I'm using git 2.48.1 from Guix.
>
> Can anyone explain what is causing the irreproducibility? Running
> diffoscope is not helpful, since the bundle is compressed and diffoscope
> doesn't seem to know how to untangle it.
Spent some time on this, and when I followed the instructions, the
diffs were in the pack file portion of the bundle file, different
"tree" objects were produced at different points in the pack file. But
it produces identical bundles if I run `git bundle create` multiple
times in the same clone. My guess is that the non-determinism is
coming from the clone process being multi-threaded, meaning that the
order things are created in the filesystem during the clone,
presumably due to multithreading happening during the clone process,
or maybe during gc? The contents of .git/objects/pack have different
hashes across my two clones, and I haven't investigated why.
>
> If this is not possible today, what do you think about changes to make
> this work?
What is your end goal with being able to reproduce the bundles?
Bundles are just a list of refs and a pack file, I think. Reproducing
the bundle doesn't provide any more security than git provides when it
writes the pack file to disk - if you end up with commits with the
same hashes, the bundle has to be *effectively* the same as a git
clone of the repository.
Producing an identical bit-for-bit bundle might be doable by doing
some form of sorting of the objects in the pack file, but this would
only get us closer to bit-for-bit reproducibility *on the same machine
and versions of everything*. There could be some changes to git, zlib,
machine architecture, etc. that causes deterministic but different
values to be produced. As an example, maybe future versions of zlib
compress better, producing an equal result when decompressed, but a
different compressed result.
>
> Thanks,
> /Simon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
2025-03-13 3:09 ` Kyle Lippincott
@ 2025-03-13 5:15 ` Jeff King
2025-03-13 13:36 ` Junio C Hamano
2025-03-13 20:16 ` Simon Josefsson
2 siblings, 2 replies; 11+ messages in thread
From: Jeff King @ 2025-03-13 5:15 UTC (permalink / raw)
To: Simon Josefsson; +Cc: git
On Wed, Mar 12, 2025 at 12:40:05PM +0100, Simon Josefsson wrote:
> If I run the recipe above twice (including the clone), I get different
> checksums. This even if nothing was committed in the remote repository
> meanwhile.
>
> Is it possible to create a bit-by-bit reproducible git bundle using some
> other set of commands? If so, how? I'm using git 2.48.1 from Guix.
As Junio noted, multithreading is the first problem. E.g., here are some
commands on git.git, using my 8-core machine:
[try once...]
$ git bundle create --no-progress - HEAD | sha1sum
686da850200da487032c9d91bdc544b605a3e426 -
[and again; oops, it's different]
$ git bundle create --no-progress - HEAD | sha1sum
70b018c16d244f32b36e55deb931e29ae15506e3 -
[now without threading]
$ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
c897caf9c68d2c37d997d3973196886af3b0b46e -
[and we can do it again. yay!]
$ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
c897caf9c68d2c37d997d3973196886af3b0b46e -
What's happening here is that the bundle mostly consists of a packfile,
where many objects will be stored as deltas against others. The search
for deltas is multi-threaded, so it will find slightly different ones
each time (there surely is an "optimal" answer, but finding it is much
too expensive, so we bound the search with some heuristics).
So disabling threading gives you a deterministic answer. But that's not
the end of the story! We only search for deltas of objects that are not
already stored as deltas in on-disk packfiles. We try to reuse any
deltas we have already on disk (assuming that both the delta and its
base are going to be in the output).
There are options to ask pack-objects (the command which git-bundle uses
under the hood to generate the pack) not to reuse deltas. So
pack-objects running on a single thread without any delta reuse should
generate a deterministic pack. But there are some gotchas:
1. It's stable only for a given Git version, and with a particular set
of delta window/depth options. I wouldn't expect behavior to change
much between versions, but it's not something that we try to
guarantee.
2. There is no way to pass pack-objects options down through
git-bundle. So you'd have to either assemble the bundle yourself,
or perhaps generate a stable on-disk pack state, and then generate
the bundle. Perhaps something like:
# make one single pack, with no reuse, using the default options
git -c pack.threads=1 repack -adf
# now we can make a bundle from that. We probably do not even
# need to disable threads here, since we'd just be picking the
# deltas from the on-disk file (assuming that you're including
# all objects in the bundle)
git bundle create - | sha1sum
3. It will be really slow. We're throwing out all of the deltas and
searching from scratch. And doing it single-threaded. I didn't time
it, but I'd guess from past experience we're talking about hours to
generate the bundle for something like linux.git.
So I think it's possible, but I doubt it's very ergonomic. You're
probably better off using some checksum over Git's logical model, rather
than the stored bytes. The obvious one is that a single Git commit hash
unambiguously represents the whole tree and all of history leading up to
it, because of the chains of hashes.
But that implies you trust Git's object hash algorithm. If you don't
trust sha1 (and don't want to try out the sha256 support), then you'd
have to design something else. Perhaps something like:
# print all commits in topological order, with ties broken by
# committer date, which should be stable. And then follow up with the
# trees and blobs for each.
git rev-list --topo-order --objects HEAD >objects
# now print the contents of each object (preceded by its name, type,
# and length, so there's no chance of weird prepending or appending
# attacks). We cut off the path information from rev-list here, since
# the ordered set of objects is all we care about.
cut -d' ' -f1 objects |
git cat-file --batch >content
# and then take a hash over that content; this will be unambiguous.
sha256sum <content
-Peff
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-13 3:09 ` Kyle Lippincott
@ 2025-03-13 7:59 ` Simon Josefsson
0 siblings, 0 replies; 11+ messages in thread
From: Simon Josefsson @ 2025-03-13 7:59 UTC (permalink / raw)
To: Kyle Lippincott; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 3066 bytes --]
Kyle Lippincott <spectral@google.com> writes:
>> Can anyone explain what is causing the irreproducibility? Running
>> diffoscope is not helpful, since the bundle is compressed and diffoscope
>> doesn't seem to know how to untangle it.
>
> Spent some time on this, and when I followed the instructions, the
> diffs were in the pack file portion of the bundle file, different
> "tree" objects were produced at different points in the pack file. But
> it produces identical bundles if I run `git bundle create` multiple
> times in the same clone. My guess is that the non-determinism is
> coming from the clone process being multi-threaded, meaning that the
> order things are created in the filesystem during the clone,
> presumably due to multithreading happening during the clone process,
> or maybe during gc? The contents of .git/objects/pack have different
> hashes across my two clones, and I haven't investigated why.
Yes, my perception is also that the reproducibility problems happens
during 'git clone'. Within the same git clone, it is no problem to
create a bit-by-bit reproducible git bundle. But if you work in two
different clones, I haven't been able to find any set of commands that
leads to identical results.
FWIW, some other ways to do the clone that I have tried but didn't get
to work (of course I may have made some mistake in my attempts):
# dumb protocol doesn't repack the objects
GIT_SMART_HTTP=0 git clone https://git.savannah.gnu.org/git/gnulib.git
# using rsync fetches .git identical as upstream
rsync -av git.savannah.gnu.org::git/gnulib.git/ gnulib
>> If this is not possible today, what do you think about changes to make
>> this work?
>
> What is your end goal with being able to reproduce the bundles?
Good question - I should have made that clear.
The end goal is for someone other than me as uploader of the gnulib git
bundle to be able re-create it bit-by-bit identical. This pursuit is in
the name of improved software security supply-chain security. Compare
efforts to make gzip and tarball files reproducible by others:
https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html
https://www.gnu.org/software/gzip/manual/html_node/Environment.html
> Producing an identical bit-for-bit bundle might be doable by doing
> some form of sorting of the objects in the pack file, but this would
> only get us closer to bit-for-bit reproducibility *on the same machine
> and versions of everything*. There could be some changes to git, zlib,
> machine architecture, etc. that causes deterministic but different
> values to be produced. As an example, maybe future versions of zlib
> compress better, producing an equal result when decompressed, but a
> different compressed result.
That is an improvement compared to todays situation where nobody can
reproduce the git bundle at all. Being able to reproduce it using the
same environment (toolchain) is better. This is similar for
reproducible builds of binaries: typically you need to reproduce a
similar environment to get reproducible results.
/Simon
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1251 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-13 5:15 ` Jeff King
@ 2025-03-13 13:36 ` Junio C Hamano
2025-03-13 20:16 ` Simon Josefsson
1 sibling, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2025-03-13 13:36 UTC (permalink / raw)
To: Jeff King; +Cc: Simon Josefsson, git
Jeff King <peff@peff.net> writes:
> .... But there are some gotchas:
>
> 1. It's stable only for a given Git version, and with a particular set
> ...
> 2. There is no way to pass pack-objects options down through
> ...
> 3. It will be really slow. We're throwing out all of the deltas and
> ...
There also is 4.
4. We do not control zlib, so even with the same Git binary, the
zlib implementation that is dynamically linked to us is free
to produce better compressed base object (or compressed
delta).
3. is not a downside if the priority of the requestor is about
bit-for-bit reproducibility (iow, "no matter what the cost").
> # print all commits in topological order, with ties broken by
> # committer date, which should be stable. And then follow up with the
> # trees and blobs for each.
> git rev-list --topo-order --objects HEAD >objects
>
> # now print the contents of each object (preceded by its name, type,
> # and length, so there's no chance of weird prepending or appending
> # attacks). We cut off the path information from rev-list here, since
> # the ordered set of objects is all we care about.
> cut -d' ' -f1 objects |
> git cat-file --batch >content
>
> # and then take a hash over that content; this will be unambiguous.
> sha256sum <content
Gross but probably stable ;-)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-13 5:15 ` Jeff King
2025-03-13 13:36 ` Junio C Hamano
@ 2025-03-13 20:16 ` Simon Josefsson
2025-03-13 21:07 ` Kyle Lippincott
2025-03-14 2:42 ` Jeff King
1 sibling, 2 replies; 11+ messages in thread
From: Simon Josefsson @ 2025-03-13 20:16 UTC (permalink / raw)
To: Jeff King; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 2987 bytes --]
Jeff King <peff@peff.net> writes:
> [now without threading]
> $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
> c897caf9c68d2c37d997d3973196886af3b0b46e -
>
> [and we can do it again. yay!]
> $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
> c897caf9c68d2c37d997d3973196886af3b0b46e -
That's the commands I use -- it doesn't lead to the same hash in two
different 'git clone's. I tried running 'git clone' with the same '-c
pack.threads=1' but it made no difference.
> 2. There is no way to pass pack-objects options down through
> git-bundle. So you'd have to either assemble the bundle yourself,
> or perhaps generate a stable on-disk pack state, and then generate
> the bundle. Perhaps something like:
>
> # make one single pack, with no reuse, using the default options
> git -c pack.threads=1 repack -adf
Yay! You may have solved this for me. I have to verify this a bit
more, but this looks promising (these are two different git clones):
jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle
jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle
jas@kaka:~/t/gnulib-2$
> So I think it's possible, but I doubt it's very ergonomic. You're
> probably better off using some checksum over Git's logical model, rather
> than the stored bytes. The obvious one is that a single Git commit hash
> unambiguously represents the whole tree and all of history leading up to
> it, because of the chains of hashes.
>
> But that implies you trust Git's object hash algorithm.
Right -- I think anything but bit-by-bit identical files is going to be
too complex to verify.
> # print all commits in topological order, with ties broken by
> # committer date, which should be stable. And then follow up with the
> # trees and blobs for each.
> git rev-list --topo-order --objects HEAD >objects
>
> # now print the contents of each object (preceded by its name, type,
> # and length, so there's no chance of weird prepending or appending
> # attacks). We cut off the path information from rev-list here, since
> # the ordered set of objects is all we care about.
> cut -d' ' -f1 objects |
> git cat-file --batch >content
>
> # and then take a hash over that content; this will be unambiguous.
> sha256sum <content
How to read this output? Could this be made git bundle compatible?
But if the above is solves it, this part isn't necessary.
/Simon
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1251 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-13 20:16 ` Simon Josefsson
@ 2025-03-13 21:07 ` Kyle Lippincott
2025-03-13 22:09 ` Junio C Hamano
2025-03-14 2:42 ` Jeff King
1 sibling, 1 reply; 11+ messages in thread
From: Kyle Lippincott @ 2025-03-13 21:07 UTC (permalink / raw)
To: Simon Josefsson; +Cc: Jeff King, git
On Thu, Mar 13, 2025 at 1:18 PM Simon Josefsson <simon@josefsson.org> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> > [now without threading]
> > $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
> > c897caf9c68d2c37d997d3973196886af3b0b46e -
> >
> > [and we can do it again. yay!]
> > $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum
> > c897caf9c68d2c37d997d3973196886af3b0b46e -
>
> That's the commands I use -- it doesn't lead to the same hash in two
> different 'git clone's. I tried running 'git clone' with the same '-c
> pack.threads=1' but it made no difference.
>
> > 2. There is no way to pass pack-objects options down through
> > git-bundle. So you'd have to either assemble the bundle yourself,
> > or perhaps generate a stable on-disk pack state, and then generate
> > the bundle. Perhaps something like:
> >
> > # make one single pack, with no reuse, using the default options
> > git -c pack.threads=1 repack -adf
>
> Yay! You may have solved this for me. I have to verify this a bit
> more, but this looks promising (these are two different git clones):
>
> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
> jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle
> jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
> jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle
> jas@kaka:~/t/gnulib-2$
>
> > So I think it's possible, but I doubt it's very ergonomic. You're
> > probably better off using some checksum over Git's logical model, rather
> > than the stored bytes. The obvious one is that a single Git commit hash
> > unambiguously represents the whole tree and all of history leading up to
> > it, because of the chains of hashes.
> >
> > But that implies you trust Git's object hash algorithm.
>
> Right -- I think anything but bit-by-bit identical files is going to be
> too complex to verify.
I'm curious what specific attacks you're trying to catch here. Because
to get into a situation where you unbundle the bundle and have the
same commit hash but different contents, you would need to have a
collision in the SHA-1 hash for some object (or SHA-256 hash if the
repo is using that). If you're also providing the instructions (or
even just the commit hash and server to clone from, and linking to
instructions maintained elsewhere) to validate the bundle is
legitimate, it seems MUCH easier to just replace those validation
instructions to point to a commit/server that has already been
backdoored than it would be to generate a SHA-1 collision that would
go undetected.
>
> > # print all commits in topological order, with ties broken by
> > # committer date, which should be stable. And then follow up with the
> > # trees and blobs for each.
> > git rev-list --topo-order --objects HEAD >objects
> >
> > # now print the contents of each object (preceded by its name, type,
> > # and length, so there's no chance of weird prepending or appending
> > # attacks). We cut off the path information from rev-list here, since
> > # the ordered set of objects is all we care about.
> > cut -d' ' -f1 objects |
> > git cat-file --batch >content
> >
> > # and then take a hash over that content; this will be unambiguous.
> > sha256sum <content
>
> How to read this output? Could this be made git bundle compatible?
>
> But if the above is solves it, this part isn't necessary.
>
> /Simon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-13 21:07 ` Kyle Lippincott
@ 2025-03-13 22:09 ` Junio C Hamano
0 siblings, 0 replies; 11+ messages in thread
From: Junio C Hamano @ 2025-03-13 22:09 UTC (permalink / raw)
To: Kyle Lippincott; +Cc: Simon Josefsson, Jeff King, git
Kyle Lippincott <spectral@google.com> writes:
>> > But that implies you trust Git's object hash algorithm.
>>
>> Right -- I think anything but bit-by-bit identical files is going to be
>> too complex to verify.
>
> I'm curious what specific attacks you're trying to catch here. Because
> to get into a situation where you unbundle the bundle and have the
> same commit hash but different contents, you would need to have a
> collision in the SHA-1 hash for some object (or SHA-256 hash if the
> repo is using that). If you're also providing the instructions (or
> even just the commit hash and server to clone from, and linking to
> instructions maintained elsewhere) to validate the bundle is
> legitimate, it seems MUCH easier to just replace those validation
> instructions to point to a commit/server that has already been
> backdoored than it would be to generate a SHA-1 collision that would
> go undetected.
I think there are two levels of "verify" involved in this discussion.
There are those who want to trust bundles and and place enough trust
on whoever created that bundle. They are happy as long as the
bitstream they received the first time does not change when they ask
for it for the second time, because they at least know that the same
input would result in the same output. To them, "attack" is what
changes the bitstream while they are looking the other way. They do
not like the fact that there can be more than one representations of
the same thing for this reason.
Then there are those who know Git enough to know that they do not
need to trust the middleman who create bundle files, and they do not
need to trust exact bitstream that is contained within these bundle
files. They can extract the bundle to verify the tip commits of the
history (by comparing their object names with published hashes, by
verifying the embedded signatures, etc.), which is what ensures
integrity in Merkle tree based systems like history stored in Git.
The latter folks may worry about the "attacks" you mention here, but
the former may not necessarily do so.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles?
2025-03-13 20:16 ` Simon Josefsson
2025-03-13 21:07 ` Kyle Lippincott
@ 2025-03-14 2:42 ` Jeff King
2025-03-14 22:24 ` rsbecker
1 sibling, 1 reply; 11+ messages in thread
From: Jeff King @ 2025-03-14 2:42 UTC (permalink / raw)
To: Simon Josefsson; +Cc: git
On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote:
> > 2. There is no way to pass pack-objects options down through
> > git-bundle. So you'd have to either assemble the bundle yourself,
> > or perhaps generate a stable on-disk pack state, and then generate
> > the bundle. Perhaps something like:
> >
> > # make one single pack, with no reuse, using the default options
> > git -c pack.threads=1 repack -adf
>
> Yay! You may have solved this for me. I have to verify this a bit
> more, but this looks promising (these are two different git clones):
>
> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
> jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle
> jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all
> jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle
> jas@kaka:~/t/gnulib-2$
One thing to watch out for here: that repack is going to look at _all_
objects in the repository. So you will get different output if you make
a bundle of a tag "v1.0" today than you would get later, when "v1.1"
also exists. Ditto for any other activity in the repository, like writes
to unrelated branches, or even reflog entries.
So you'd probably want to make an absolute minimal repository with the
reachable objects, perhaps like:
git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git
cd just-v1.0.git
git -c pack.threads=1 repack -adf
It doesn't have to be just one ref, of course; you might want to
snapshot the whole set of refs at the time you make the bundle. E.g., by
fetching into the empty repo using a refspec.
This would all be a non-issue if you could ask git-bundle to directly
pass the equivalent of "-f" to pack-objects (at that layer it is called
"--no-reuse-delta"). Since then it would be computing the full set of
objects itself. But without a patch to Git, I don't think there's a way
to do that.
The bundle format is pretty simple, so you _could_ hack around it
yourself, like:
# list refs we care about; you can pick whatever subset you want
# here.
git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs
{
# bundle header plus list of refs, plus blank line terminator
echo "# v2 git bundle"
cat refs
echo
# and now the pack. We just need to feed it the object ids for
# all of the refs. It will handle sorting and de-duping for us.
cut -d' ' -f1 <refs |
git -c pack.threads=1 pack-objects \
--stdout --revs --delta-base-offset --no-reuse-delta
} >foo.bundle
I dunno if that is more or less gross than teaching git-bundle to pass
--no-reuse-delta itself. It's certainly more intimate with the details,
but OTOH it is less likely to change in other versions of Git (e.g., if
we started making "v3" bundles by default).
> > # print all commits in topological order, with ties broken by
> > # committer date, which should be stable. And then follow up with the
> > # trees and blobs for each.
> > git rev-list --topo-order --objects HEAD >objects
> >
> > # now print the contents of each object (preceded by its name, type,
> > # and length, so there's no chance of weird prepending or appending
> > # attacks). We cut off the path information from rev-list here, since
> > # the ordered set of objects is all we care about.
> > cut -d' ' -f1 objects |
> > git cat-file --batch >content
> >
> > # and then take a hash over that content; this will be unambiguous.
> > sha256sum <content
>
> How to read this output? Could this be made git bundle compatible?
You'd have to compare the result of doing that after fetching from the
bundle into an empty repo. I don't think there's a great way to operate
directly on the bundle packfile (it has to be indexed first to see
what's in it).
The closest I could get is:
input=foo.bundle
# split the bundle into header and packfile sections on the first
# blank line
sed '/^$/q' <$input >header
size=$(stat --format=%s header)
tail -c +$((size+1)) <$input >bundle.pack
# we can first do a byte-level comparison of the header; if this isn't
# the same, the bundles do not match.
sha256sum <header
# now index the pack, so we know what's in it; this makes bundle.idx
git index-pack -v bundle.pack
# and now we want to dump the full logical contents (not the
# delta-compressed versions) of each object. First we need a list of
# the objects. This will come out in lexical order of object id, which
# is good for us since it will be stable.
git show-index <bundle.idx | awk '{print $2}' >objects
# unfortunately here things break down. There is no command to read
# the data directly out of the pack/idx pair without a repository
# (even though it could be done technically). So we hack around it
# with a temp repo.
git init --bare tmp.git
mv bundle.idx bundle.pack tmp.git/objects/pack/
git -C tmp.git cat-file --batch <objects | sha256sum
So...also kind of gross. And not really all that different than what:
git init --bare tmp.git
cd tmp.git
git fetch ../foo.bundle refs/*:refs/*
would do (you end up with the same pack/idx pair). So I dunno. I guess
it depends how many and which Git commands you're willing to trust. ;)
-Peff
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Making bit-by-bit reproducible Git Bundles?
2025-03-14 2:42 ` Jeff King
@ 2025-03-14 22:24 ` rsbecker
0 siblings, 0 replies; 11+ messages in thread
From: rsbecker @ 2025-03-14 22:24 UTC (permalink / raw)
To: 'Jeff King', 'Simon Josefsson'; +Cc: git
On March 13, 2025 10:42 PM, Jeff King wrote:
>On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote:
>
>> > 2. There is no way to pass pack-objects options down through
>> > git-bundle. So you'd have to either assemble the bundle yourself,
>> > or perhaps generate a stable on-disk pack state, and then generate
>> > the bundle. Perhaps something like:
>> >
>> > # make one single pack, with no reuse, using the default options
>> > git -c pack.threads=1 repack -adf
>>
>> Yay! You may have solved this for me. I have to verify this a bit
>> more, but this looks promising (these are two different git clones):
>>
>> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
>> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-2$
>
>One thing to watch out for here: that repack is going to look at _all_ objects in the
>repository. So you will get different output if you make a bundle of a tag "v1.0"
>today than you would get later, when "v1.1"
>also exists. Ditto for any other activity in the repository, like writes to unrelated
>branches, or even reflog entries.
>
>So you'd probably want to make an absolute minimal repository with the reachable
>objects, perhaps like:
>
> git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git
> cd just-v1.0.git
> git -c pack.threads=1 repack -adf
>
>It doesn't have to be just one ref, of course; you might want to snapshot the whole
>set of refs at the time you make the bundle. E.g., by fetching into the empty repo
>using a refspec.
>
>This would all be a non-issue if you could ask git-bundle to directly pass the
>equivalent of "-f" to pack-objects (at that layer it is called "--no-reuse-delta"). Since
>then it would be computing the full set of objects itself. But without a patch to Git, I
>don't think there's a way to do that.
>
>The bundle format is pretty simple, so you _could_ hack around it yourself, like:
>
> # list refs we care about; you can pick whatever subset you want
> # here.
> git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs
>
> {
> # bundle header plus list of refs, plus blank line terminator
> echo "# v2 git bundle"
> cat refs
> echo
>
> # and now the pack. We just need to feed it the object ids for
> # all of the refs. It will handle sorting and de-duping for us.
> cut -d' ' -f1 <refs |
> git -c pack.threads=1 pack-objects \
> --stdout --revs --delta-base-offset --no-reuse-delta
> } >foo.bundle
>
>I dunno if that is more or less gross than teaching git-bundle to pass --no-reuse-
>delta itself. It's certainly more intimate with the details, but OTOH it is less likely to
>change in other versions of Git (e.g., if we started making "v3" bundles by default).
>
>> > # print all commits in topological order, with ties broken by
>> > # committer date, which should be stable. And then follow up with the
>> > # trees and blobs for each.
>> > git rev-list --topo-order --objects HEAD >objects
>> >
>> > # now print the contents of each object (preceded by its name, type,
>> > # and length, so there's no chance of weird prepending or appending
>> > # attacks). We cut off the path information from rev-list here, since
>> > # the ordered set of objects is all we care about.
>> > cut -d' ' -f1 objects |
>> > git cat-file --batch >content
>> >
>> > # and then take a hash over that content; this will be unambiguous.
>> > sha256sum <content
>>
>> How to read this output? Could this be made git bundle compatible?
>
>You'd have to compare the result of doing that after fetching from the bundle into
>an empty repo. I don't think there's a great way to operate directly on the bundle
>packfile (it has to be indexed first to see what's in it).
>
>The closest I could get is:
>
> input=foo.bundle
>
> # split the bundle into header and packfile sections on the first
> # blank line
> sed '/^$/q' <$input >header
> size=$(stat --format=%s header)
> tail -c +$((size+1)) <$input >bundle.pack
>
> # we can first do a byte-level comparison of the header; if this isn't
> # the same, the bundles do not match.
> sha256sum <header
>
> # now index the pack, so we know what's in it; this makes bundle.idx
> git index-pack -v bundle.pack
>
> # and now we want to dump the full logical contents (not the
> # delta-compressed versions) of each object. First we need a list of
> # the objects. This will come out in lexical order of object id, which
> # is good for us since it will be stable.
> git show-index <bundle.idx | awk '{print $2}' >objects
>
> # unfortunately here things break down. There is no command to read
> # the data directly out of the pack/idx pair without a repository
> # (even though it could be done technically). So we hack around it
> # with a temp repo.
> git init --bare tmp.git
> mv bundle.idx bundle.pack tmp.git/objects/pack/
> git -C tmp.git cat-file --batch <objects | sha256sum
>
>So...also kind of gross. And not really all that different than what:
>
> git init --bare tmp.git
> cd tmp.git
> git fetch ../foo.bundle refs/*:refs/*
>
>would do (you end up with the same pack/idx pair). So I dunno. I guess it depends
>how many and which Git commands you're willing to trust. ;)
I would go one step further on this. Using --depth=1 and potentially a --sparse checkout
with only what you specifically need to verify.
However, Junio's point on checking end-point commit and tags is useful and significant
on verifying that the Merkel Tree itself is intact and not modified using signing is usually
sufficient verification and more reliable than a bit-for bit comparison, which may have
dependencies on the underlying operating system, particularly if the originating
directory inode contents differ from the destination - an example is using a Windows
server for the upstream and a NonStop server for the clone (not so much with Linux vs.
NonStop). It is pretty much guaranteed that the inodes will be different.
--Randall
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-03-14 22:26 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
2025-03-13 3:09 ` Kyle Lippincott
2025-03-13 7:59 ` Simon Josefsson
2025-03-13 5:15 ` Jeff King
2025-03-13 13:36 ` Junio C Hamano
2025-03-13 20:16 ` Simon Josefsson
2025-03-13 21:07 ` Kyle Lippincott
2025-03-13 22:09 ` Junio C Hamano
2025-03-14 2:42 ` Jeff King
2025-03-14 22:24 ` rsbecker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).