* Making bit-by-bit reproducible Git Bundles?
@ 2025-03-12 11:40 Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Simon Josefsson @ 2025-03-12 11:40 UTC (permalink / raw)
To: git
[-- Attachment #1: Type: text/plain, Size: 959 bytes --]
Hi.
Thank you for the "git-archive" and "git-bundle" features, making it
easier to do source-based builds in a no-Internet environment.
I have published a Git bundle of Gnulib:
https://www.gnu.org/software/gnulib/manual/html_node/Gnulib-Git-Bundle.html
As you can see at the end, I struggle to come up with a recipe to allow
others to reproduce the git bundle that I created.
If I run the recipe above twice (including the clone), I get different
checksums. This even if nothing was committed in the remote repository
meanwhile.
Is it possible to create a bit-by-bit reproducible git bundle using some
other set of commands? If so, how? I'm using git 2.48.1 from Guix.
Can anyone explain what is causing the irreproducibility? Running
diffoscope is not helpful, since the bundle is compressed and diffoscope
doesn't seem to know how to untangle it.
If this is not possible today, what do you think about changes to make
this work?
Thanks,
/Simon
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1251 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson @ 2025-03-12 16:02 ` Junio C Hamano 2025-03-13 3:09 ` Kyle Lippincott 2025-03-13 5:15 ` Jeff King 2 siblings, 0 replies; 11+ messages in thread From: Junio C Hamano @ 2025-03-12 16:02 UTC (permalink / raw) To: Simon Josefsson; +Cc: git Simon Josefsson <simon@josefsson.org> writes: > Can anyone explain what is causing the irreproducibility? Multithreading? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson 2025-03-12 16:02 ` Junio C Hamano @ 2025-03-13 3:09 ` Kyle Lippincott 2025-03-13 7:59 ` Simon Josefsson 2025-03-13 5:15 ` Jeff King 2 siblings, 1 reply; 11+ messages in thread From: Kyle Lippincott @ 2025-03-13 3:09 UTC (permalink / raw) To: Simon Josefsson; +Cc: git On Wed, Mar 12, 2025 at 4:59 AM Simon Josefsson <simon@josefsson.org> wrote: > > Hi. > > Thank you for the "git-archive" and "git-bundle" features, making it > easier to do source-based builds in a no-Internet environment. > > I have published a Git bundle of Gnulib: > > https://www.gnu.org/software/gnulib/manual/html_node/Gnulib-Git-Bundle.html > > As you can see at the end, I struggle to come up with a recipe to allow > others to reproduce the git bundle that I created. > > If I run the recipe above twice (including the clone), I get different > checksums. This even if nothing was committed in the remote repository > meanwhile. > > Is it possible to create a bit-by-bit reproducible git bundle using some > other set of commands? If so, how? I'm using git 2.48.1 from Guix. > > Can anyone explain what is causing the irreproducibility? Running > diffoscope is not helpful, since the bundle is compressed and diffoscope > doesn't seem to know how to untangle it. Spent some time on this, and when I followed the instructions, the diffs were in the pack file portion of the bundle file, different "tree" objects were produced at different points in the pack file. But it produces identical bundles if I run `git bundle create` multiple times in the same clone. My guess is that the non-determinism is coming from the clone process being multi-threaded, meaning that the order things are created in the filesystem during the clone, presumably due to multithreading happening during the clone process, or maybe during gc? The contents of .git/objects/pack have different hashes across my two clones, and I haven't investigated why. > > If this is not possible today, what do you think about changes to make > this work? What is your end goal with being able to reproduce the bundles? Bundles are just a list of refs and a pack file, I think. Reproducing the bundle doesn't provide any more security than git provides when it writes the pack file to disk - if you end up with commits with the same hashes, the bundle has to be *effectively* the same as a git clone of the repository. Producing an identical bit-for-bit bundle might be doable by doing some form of sorting of the objects in the pack file, but this would only get us closer to bit-for-bit reproducibility *on the same machine and versions of everything*. There could be some changes to git, zlib, machine architecture, etc. that causes deterministic but different values to be produced. As an example, maybe future versions of zlib compress better, producing an equal result when decompressed, but a different compressed result. > > Thanks, > /Simon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-13 3:09 ` Kyle Lippincott @ 2025-03-13 7:59 ` Simon Josefsson 0 siblings, 0 replies; 11+ messages in thread From: Simon Josefsson @ 2025-03-13 7:59 UTC (permalink / raw) To: Kyle Lippincott; +Cc: git [-- Attachment #1: Type: text/plain, Size: 3066 bytes --] Kyle Lippincott <spectral@google.com> writes: >> Can anyone explain what is causing the irreproducibility? Running >> diffoscope is not helpful, since the bundle is compressed and diffoscope >> doesn't seem to know how to untangle it. > > Spent some time on this, and when I followed the instructions, the > diffs were in the pack file portion of the bundle file, different > "tree" objects were produced at different points in the pack file. But > it produces identical bundles if I run `git bundle create` multiple > times in the same clone. My guess is that the non-determinism is > coming from the clone process being multi-threaded, meaning that the > order things are created in the filesystem during the clone, > presumably due to multithreading happening during the clone process, > or maybe during gc? The contents of .git/objects/pack have different > hashes across my two clones, and I haven't investigated why. Yes, my perception is also that the reproducibility problems happens during 'git clone'. Within the same git clone, it is no problem to create a bit-by-bit reproducible git bundle. But if you work in two different clones, I haven't been able to find any set of commands that leads to identical results. FWIW, some other ways to do the clone that I have tried but didn't get to work (of course I may have made some mistake in my attempts): # dumb protocol doesn't repack the objects GIT_SMART_HTTP=0 git clone https://git.savannah.gnu.org/git/gnulib.git # using rsync fetches .git identical as upstream rsync -av git.savannah.gnu.org::git/gnulib.git/ gnulib >> If this is not possible today, what do you think about changes to make >> this work? > > What is your end goal with being able to reproduce the bundles? Good question - I should have made that clear. The end goal is for someone other than me as uploader of the gnulib git bundle to be able re-create it bit-by-bit identical. This pursuit is in the name of improved software security supply-chain security. Compare efforts to make gzip and tarball files reproducible by others: https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html https://www.gnu.org/software/gzip/manual/html_node/Environment.html > Producing an identical bit-for-bit bundle might be doable by doing > some form of sorting of the objects in the pack file, but this would > only get us closer to bit-for-bit reproducibility *on the same machine > and versions of everything*. There could be some changes to git, zlib, > machine architecture, etc. that causes deterministic but different > values to be produced. As an example, maybe future versions of zlib > compress better, producing an equal result when decompressed, but a > different compressed result. That is an improvement compared to todays situation where nobody can reproduce the git bundle at all. Being able to reproduce it using the same environment (toolchain) is better. This is similar for reproducible builds of binaries: typically you need to reproduce a similar environment to get reproducible results. /Simon [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 1251 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson 2025-03-12 16:02 ` Junio C Hamano 2025-03-13 3:09 ` Kyle Lippincott @ 2025-03-13 5:15 ` Jeff King 2025-03-13 13:36 ` Junio C Hamano 2025-03-13 20:16 ` Simon Josefsson 2 siblings, 2 replies; 11+ messages in thread From: Jeff King @ 2025-03-13 5:15 UTC (permalink / raw) To: Simon Josefsson; +Cc: git On Wed, Mar 12, 2025 at 12:40:05PM +0100, Simon Josefsson wrote: > If I run the recipe above twice (including the clone), I get different > checksums. This even if nothing was committed in the remote repository > meanwhile. > > Is it possible to create a bit-by-bit reproducible git bundle using some > other set of commands? If so, how? I'm using git 2.48.1 from Guix. As Junio noted, multithreading is the first problem. E.g., here are some commands on git.git, using my 8-core machine: [try once...] $ git bundle create --no-progress - HEAD | sha1sum 686da850200da487032c9d91bdc544b605a3e426 - [and again; oops, it's different] $ git bundle create --no-progress - HEAD | sha1sum 70b018c16d244f32b36e55deb931e29ae15506e3 - [now without threading] $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum c897caf9c68d2c37d997d3973196886af3b0b46e - [and we can do it again. yay!] $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum c897caf9c68d2c37d997d3973196886af3b0b46e - What's happening here is that the bundle mostly consists of a packfile, where many objects will be stored as deltas against others. The search for deltas is multi-threaded, so it will find slightly different ones each time (there surely is an "optimal" answer, but finding it is much too expensive, so we bound the search with some heuristics). So disabling threading gives you a deterministic answer. But that's not the end of the story! We only search for deltas of objects that are not already stored as deltas in on-disk packfiles. We try to reuse any deltas we have already on disk (assuming that both the delta and its base are going to be in the output). There are options to ask pack-objects (the command which git-bundle uses under the hood to generate the pack) not to reuse deltas. So pack-objects running on a single thread without any delta reuse should generate a deterministic pack. But there are some gotchas: 1. It's stable only for a given Git version, and with a particular set of delta window/depth options. I wouldn't expect behavior to change much between versions, but it's not something that we try to guarantee. 2. There is no way to pass pack-objects options down through git-bundle. So you'd have to either assemble the bundle yourself, or perhaps generate a stable on-disk pack state, and then generate the bundle. Perhaps something like: # make one single pack, with no reuse, using the default options git -c pack.threads=1 repack -adf # now we can make a bundle from that. We probably do not even # need to disable threads here, since we'd just be picking the # deltas from the on-disk file (assuming that you're including # all objects in the bundle) git bundle create - | sha1sum 3. It will be really slow. We're throwing out all of the deltas and searching from scratch. And doing it single-threaded. I didn't time it, but I'd guess from past experience we're talking about hours to generate the bundle for something like linux.git. So I think it's possible, but I doubt it's very ergonomic. You're probably better off using some checksum over Git's logical model, rather than the stored bytes. The obvious one is that a single Git commit hash unambiguously represents the whole tree and all of history leading up to it, because of the chains of hashes. But that implies you trust Git's object hash algorithm. If you don't trust sha1 (and don't want to try out the sha256 support), then you'd have to design something else. Perhaps something like: # print all commits in topological order, with ties broken by # committer date, which should be stable. And then follow up with the # trees and blobs for each. git rev-list --topo-order --objects HEAD >objects # now print the contents of each object (preceded by its name, type, # and length, so there's no chance of weird prepending or appending # attacks). We cut off the path information from rev-list here, since # the ordered set of objects is all we care about. cut -d' ' -f1 objects | git cat-file --batch >content # and then take a hash over that content; this will be unambiguous. sha256sum <content -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-13 5:15 ` Jeff King @ 2025-03-13 13:36 ` Junio C Hamano 2025-03-13 20:16 ` Simon Josefsson 1 sibling, 0 replies; 11+ messages in thread From: Junio C Hamano @ 2025-03-13 13:36 UTC (permalink / raw) To: Jeff King; +Cc: Simon Josefsson, git Jeff King <peff@peff.net> writes: > .... But there are some gotchas: > > 1. It's stable only for a given Git version, and with a particular set > ... > 2. There is no way to pass pack-objects options down through > ... > 3. It will be really slow. We're throwing out all of the deltas and > ... There also is 4. 4. We do not control zlib, so even with the same Git binary, the zlib implementation that is dynamically linked to us is free to produce better compressed base object (or compressed delta). 3. is not a downside if the priority of the requestor is about bit-for-bit reproducibility (iow, "no matter what the cost"). > # print all commits in topological order, with ties broken by > # committer date, which should be stable. And then follow up with the > # trees and blobs for each. > git rev-list --topo-order --objects HEAD >objects > > # now print the contents of each object (preceded by its name, type, > # and length, so there's no chance of weird prepending or appending > # attacks). We cut off the path information from rev-list here, since > # the ordered set of objects is all we care about. > cut -d' ' -f1 objects | > git cat-file --batch >content > > # and then take a hash over that content; this will be unambiguous. > sha256sum <content Gross but probably stable ;-) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-13 5:15 ` Jeff King 2025-03-13 13:36 ` Junio C Hamano @ 2025-03-13 20:16 ` Simon Josefsson 2025-03-13 21:07 ` Kyle Lippincott 2025-03-14 2:42 ` Jeff King 1 sibling, 2 replies; 11+ messages in thread From: Simon Josefsson @ 2025-03-13 20:16 UTC (permalink / raw) To: Jeff King; +Cc: git [-- Attachment #1: Type: text/plain, Size: 2987 bytes --] Jeff King <peff@peff.net> writes: > [now without threading] > $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum > c897caf9c68d2c37d997d3973196886af3b0b46e - > > [and we can do it again. yay!] > $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum > c897caf9c68d2c37d997d3973196886af3b0b46e - That's the commands I use -- it doesn't lead to the same hash in two different 'git clone's. I tried running 'git clone' with the same '-c pack.threads=1' but it made no difference. > 2. There is no way to pass pack-objects options down through > git-bundle. So you'd have to either assemble the bundle yourself, > or perhaps generate a stable on-disk pack state, and then generate > the bundle. Perhaps something like: > > # make one single pack, with no reuse, using the default options > git -c pack.threads=1 repack -adf Yay! You may have solved this for me. I have to verify this a bit more, but this looks promising (these are two different git clones): jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle jas@kaka:~/t/gnulib-1$ cd ../gnulib-2 jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle jas@kaka:~/t/gnulib-2$ > So I think it's possible, but I doubt it's very ergonomic. You're > probably better off using some checksum over Git's logical model, rather > than the stored bytes. The obvious one is that a single Git commit hash > unambiguously represents the whole tree and all of history leading up to > it, because of the chains of hashes. > > But that implies you trust Git's object hash algorithm. Right -- I think anything but bit-by-bit identical files is going to be too complex to verify. > # print all commits in topological order, with ties broken by > # committer date, which should be stable. And then follow up with the > # trees and blobs for each. > git rev-list --topo-order --objects HEAD >objects > > # now print the contents of each object (preceded by its name, type, > # and length, so there's no chance of weird prepending or appending > # attacks). We cut off the path information from rev-list here, since > # the ordered set of objects is all we care about. > cut -d' ' -f1 objects | > git cat-file --batch >content > > # and then take a hash over that content; this will be unambiguous. > sha256sum <content How to read this output? Could this be made git bundle compatible? But if the above is solves it, this part isn't necessary. /Simon [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 1251 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-13 20:16 ` Simon Josefsson @ 2025-03-13 21:07 ` Kyle Lippincott 2025-03-13 22:09 ` Junio C Hamano 2025-03-14 2:42 ` Jeff King 1 sibling, 1 reply; 11+ messages in thread From: Kyle Lippincott @ 2025-03-13 21:07 UTC (permalink / raw) To: Simon Josefsson; +Cc: Jeff King, git On Thu, Mar 13, 2025 at 1:18 PM Simon Josefsson <simon@josefsson.org> wrote: > > Jeff King <peff@peff.net> writes: > > > [now without threading] > > $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum > > c897caf9c68d2c37d997d3973196886af3b0b46e - > > > > [and we can do it again. yay!] > > $ git -c pack.threads=1 bundle create --no-progress - HEAD | sha1sum > > c897caf9c68d2c37d997d3973196886af3b0b46e - > > That's the commands I use -- it doesn't lead to the same hash in two > different 'git clone's. I tried running 'git clone' with the same '-c > pack.threads=1' but it made no difference. > > > 2. There is no way to pass pack-objects options down through > > git-bundle. So you'd have to either assemble the bundle yourself, > > or perhaps generate a stable on-disk pack state, and then generate > > the bundle. Perhaps something like: > > > > # make one single pack, with no reuse, using the default options > > git -c pack.threads=1 repack -adf > > Yay! You may have solved this for me. I have to verify this a bit > more, but this looks promising (these are two different git clones): > > jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf > jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all > jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle > c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle > jas@kaka:~/t/gnulib-1$ cd ../gnulib-2 > jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf > jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all > jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle > c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle > jas@kaka:~/t/gnulib-2$ > > > So I think it's possible, but I doubt it's very ergonomic. You're > > probably better off using some checksum over Git's logical model, rather > > than the stored bytes. The obvious one is that a single Git commit hash > > unambiguously represents the whole tree and all of history leading up to > > it, because of the chains of hashes. > > > > But that implies you trust Git's object hash algorithm. > > Right -- I think anything but bit-by-bit identical files is going to be > too complex to verify. I'm curious what specific attacks you're trying to catch here. Because to get into a situation where you unbundle the bundle and have the same commit hash but different contents, you would need to have a collision in the SHA-1 hash for some object (or SHA-256 hash if the repo is using that). If you're also providing the instructions (or even just the commit hash and server to clone from, and linking to instructions maintained elsewhere) to validate the bundle is legitimate, it seems MUCH easier to just replace those validation instructions to point to a commit/server that has already been backdoored than it would be to generate a SHA-1 collision that would go undetected. > > > # print all commits in topological order, with ties broken by > > # committer date, which should be stable. And then follow up with the > > # trees and blobs for each. > > git rev-list --topo-order --objects HEAD >objects > > > > # now print the contents of each object (preceded by its name, type, > > # and length, so there's no chance of weird prepending or appending > > # attacks). We cut off the path information from rev-list here, since > > # the ordered set of objects is all we care about. > > cut -d' ' -f1 objects | > > git cat-file --batch >content > > > > # and then take a hash over that content; this will be unambiguous. > > sha256sum <content > > How to read this output? Could this be made git bundle compatible? > > But if the above is solves it, this part isn't necessary. > > /Simon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-13 21:07 ` Kyle Lippincott @ 2025-03-13 22:09 ` Junio C Hamano 0 siblings, 0 replies; 11+ messages in thread From: Junio C Hamano @ 2025-03-13 22:09 UTC (permalink / raw) To: Kyle Lippincott; +Cc: Simon Josefsson, Jeff King, git Kyle Lippincott <spectral@google.com> writes: >> > But that implies you trust Git's object hash algorithm. >> >> Right -- I think anything but bit-by-bit identical files is going to be >> too complex to verify. > > I'm curious what specific attacks you're trying to catch here. Because > to get into a situation where you unbundle the bundle and have the > same commit hash but different contents, you would need to have a > collision in the SHA-1 hash for some object (or SHA-256 hash if the > repo is using that). If you're also providing the instructions (or > even just the commit hash and server to clone from, and linking to > instructions maintained elsewhere) to validate the bundle is > legitimate, it seems MUCH easier to just replace those validation > instructions to point to a commit/server that has already been > backdoored than it would be to generate a SHA-1 collision that would > go undetected. I think there are two levels of "verify" involved in this discussion. There are those who want to trust bundles and and place enough trust on whoever created that bundle. They are happy as long as the bitstream they received the first time does not change when they ask for it for the second time, because they at least know that the same input would result in the same output. To them, "attack" is what changes the bitstream while they are looking the other way. They do not like the fact that there can be more than one representations of the same thing for this reason. Then there are those who know Git enough to know that they do not need to trust the middleman who create bundle files, and they do not need to trust exact bitstream that is contained within these bundle files. They can extract the bundle to verify the tip commits of the history (by comparing their object names with published hashes, by verifying the embedded signatures, etc.), which is what ensures integrity in Merkle tree based systems like history stored in Git. The latter folks may worry about the "attacks" you mention here, but the former may not necessarily do so. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Making bit-by-bit reproducible Git Bundles? 2025-03-13 20:16 ` Simon Josefsson 2025-03-13 21:07 ` Kyle Lippincott @ 2025-03-14 2:42 ` Jeff King 2025-03-14 22:24 ` rsbecker 1 sibling, 1 reply; 11+ messages in thread From: Jeff King @ 2025-03-14 2:42 UTC (permalink / raw) To: Simon Josefsson; +Cc: git On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote: > > 2. There is no way to pass pack-objects options down through > > git-bundle. So you'd have to either assemble the bundle yourself, > > or perhaps generate a stable on-disk pack state, and then generate > > the bundle. Perhaps something like: > > > > # make one single pack, with no reuse, using the default options > > git -c pack.threads=1 repack -adf > > Yay! You may have solved this for me. I have to verify this a bit > more, but this looks promising (these are two different git clones): > > jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf > jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create gnulib.bundle --all > jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle > c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle > jas@kaka:~/t/gnulib-1$ cd ../gnulib-2 > jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf > jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create gnulib.bundle --all > jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle > c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 gnulib.bundle > jas@kaka:~/t/gnulib-2$ One thing to watch out for here: that repack is going to look at _all_ objects in the repository. So you will get different output if you make a bundle of a tag "v1.0" today than you would get later, when "v1.1" also exists. Ditto for any other activity in the repository, like writes to unrelated branches, or even reflog entries. So you'd probably want to make an absolute minimal repository with the reachable objects, perhaps like: git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git cd just-v1.0.git git -c pack.threads=1 repack -adf It doesn't have to be just one ref, of course; you might want to snapshot the whole set of refs at the time you make the bundle. E.g., by fetching into the empty repo using a refspec. This would all be a non-issue if you could ask git-bundle to directly pass the equivalent of "-f" to pack-objects (at that layer it is called "--no-reuse-delta"). Since then it would be computing the full set of objects itself. But without a patch to Git, I don't think there's a way to do that. The bundle format is pretty simple, so you _could_ hack around it yourself, like: # list refs we care about; you can pick whatever subset you want # here. git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs { # bundle header plus list of refs, plus blank line terminator echo "# v2 git bundle" cat refs echo # and now the pack. We just need to feed it the object ids for # all of the refs. It will handle sorting and de-duping for us. cut -d' ' -f1 <refs | git -c pack.threads=1 pack-objects \ --stdout --revs --delta-base-offset --no-reuse-delta } >foo.bundle I dunno if that is more or less gross than teaching git-bundle to pass --no-reuse-delta itself. It's certainly more intimate with the details, but OTOH it is less likely to change in other versions of Git (e.g., if we started making "v3" bundles by default). > > # print all commits in topological order, with ties broken by > > # committer date, which should be stable. And then follow up with the > > # trees and blobs for each. > > git rev-list --topo-order --objects HEAD >objects > > > > # now print the contents of each object (preceded by its name, type, > > # and length, so there's no chance of weird prepending or appending > > # attacks). We cut off the path information from rev-list here, since > > # the ordered set of objects is all we care about. > > cut -d' ' -f1 objects | > > git cat-file --batch >content > > > > # and then take a hash over that content; this will be unambiguous. > > sha256sum <content > > How to read this output? Could this be made git bundle compatible? You'd have to compare the result of doing that after fetching from the bundle into an empty repo. I don't think there's a great way to operate directly on the bundle packfile (it has to be indexed first to see what's in it). The closest I could get is: input=foo.bundle # split the bundle into header and packfile sections on the first # blank line sed '/^$/q' <$input >header size=$(stat --format=%s header) tail -c +$((size+1)) <$input >bundle.pack # we can first do a byte-level comparison of the header; if this isn't # the same, the bundles do not match. sha256sum <header # now index the pack, so we know what's in it; this makes bundle.idx git index-pack -v bundle.pack # and now we want to dump the full logical contents (not the # delta-compressed versions) of each object. First we need a list of # the objects. This will come out in lexical order of object id, which # is good for us since it will be stable. git show-index <bundle.idx | awk '{print $2}' >objects # unfortunately here things break down. There is no command to read # the data directly out of the pack/idx pair without a repository # (even though it could be done technically). So we hack around it # with a temp repo. git init --bare tmp.git mv bundle.idx bundle.pack tmp.git/objects/pack/ git -C tmp.git cat-file --batch <objects | sha256sum So...also kind of gross. And not really all that different than what: git init --bare tmp.git cd tmp.git git fetch ../foo.bundle refs/*:refs/* would do (you end up with the same pack/idx pair). So I dunno. I guess it depends how many and which Git commands you're willing to trust. ;) -Peff ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: Making bit-by-bit reproducible Git Bundles? 2025-03-14 2:42 ` Jeff King @ 2025-03-14 22:24 ` rsbecker 0 siblings, 0 replies; 11+ messages in thread From: rsbecker @ 2025-03-14 22:24 UTC (permalink / raw) To: 'Jeff King', 'Simon Josefsson'; +Cc: git On March 13, 2025 10:42 PM, Jeff King wrote: >On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote: > >> > 2. There is no way to pass pack-objects options down through >> > git-bundle. So you'd have to either assemble the bundle yourself, >> > or perhaps generate a stable on-disk pack state, and then generate >> > the bundle. Perhaps something like: >> > >> > # make one single pack, with no reuse, using the default options >> > git -c pack.threads=1 repack -adf >> >> Yay! You may have solved this for me. I have to verify this a bit >> more, but this looks promising (these are two different git clones): >> >> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf >> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create >> gnulib.bundle --all jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle >> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 >> gnulib.bundle jas@kaka:~/t/gnulib-1$ cd ../gnulib-2 >> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf >> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create >> gnulib.bundle --all jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle >> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890 >> gnulib.bundle jas@kaka:~/t/gnulib-2$ > >One thing to watch out for here: that repack is going to look at _all_ objects in the >repository. So you will get different output if you make a bundle of a tag "v1.0" >today than you would get later, when "v1.1" >also exists. Ditto for any other activity in the repository, like writes to unrelated >branches, or even reflog entries. > >So you'd probably want to make an absolute minimal repository with the reachable >objects, perhaps like: > > git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git > cd just-v1.0.git > git -c pack.threads=1 repack -adf > >It doesn't have to be just one ref, of course; you might want to snapshot the whole >set of refs at the time you make the bundle. E.g., by fetching into the empty repo >using a refspec. > >This would all be a non-issue if you could ask git-bundle to directly pass the >equivalent of "-f" to pack-objects (at that layer it is called "--no-reuse-delta"). Since >then it would be computing the full set of objects itself. But without a patch to Git, I >don't think there's a way to do that. > >The bundle format is pretty simple, so you _could_ hack around it yourself, like: > > # list refs we care about; you can pick whatever subset you want > # here. > git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs > > { > # bundle header plus list of refs, plus blank line terminator > echo "# v2 git bundle" > cat refs > echo > > # and now the pack. We just need to feed it the object ids for > # all of the refs. It will handle sorting and de-duping for us. > cut -d' ' -f1 <refs | > git -c pack.threads=1 pack-objects \ > --stdout --revs --delta-base-offset --no-reuse-delta > } >foo.bundle > >I dunno if that is more or less gross than teaching git-bundle to pass --no-reuse- >delta itself. It's certainly more intimate with the details, but OTOH it is less likely to >change in other versions of Git (e.g., if we started making "v3" bundles by default). > >> > # print all commits in topological order, with ties broken by >> > # committer date, which should be stable. And then follow up with the >> > # trees and blobs for each. >> > git rev-list --topo-order --objects HEAD >objects >> > >> > # now print the contents of each object (preceded by its name, type, >> > # and length, so there's no chance of weird prepending or appending >> > # attacks). We cut off the path information from rev-list here, since >> > # the ordered set of objects is all we care about. >> > cut -d' ' -f1 objects | >> > git cat-file --batch >content >> > >> > # and then take a hash over that content; this will be unambiguous. >> > sha256sum <content >> >> How to read this output? Could this be made git bundle compatible? > >You'd have to compare the result of doing that after fetching from the bundle into >an empty repo. I don't think there's a great way to operate directly on the bundle >packfile (it has to be indexed first to see what's in it). > >The closest I could get is: > > input=foo.bundle > > # split the bundle into header and packfile sections on the first > # blank line > sed '/^$/q' <$input >header > size=$(stat --format=%s header) > tail -c +$((size+1)) <$input >bundle.pack > > # we can first do a byte-level comparison of the header; if this isn't > # the same, the bundles do not match. > sha256sum <header > > # now index the pack, so we know what's in it; this makes bundle.idx > git index-pack -v bundle.pack > > # and now we want to dump the full logical contents (not the > # delta-compressed versions) of each object. First we need a list of > # the objects. This will come out in lexical order of object id, which > # is good for us since it will be stable. > git show-index <bundle.idx | awk '{print $2}' >objects > > # unfortunately here things break down. There is no command to read > # the data directly out of the pack/idx pair without a repository > # (even though it could be done technically). So we hack around it > # with a temp repo. > git init --bare tmp.git > mv bundle.idx bundle.pack tmp.git/objects/pack/ > git -C tmp.git cat-file --batch <objects | sha256sum > >So...also kind of gross. And not really all that different than what: > > git init --bare tmp.git > cd tmp.git > git fetch ../foo.bundle refs/*:refs/* > >would do (you end up with the same pack/idx pair). So I dunno. I guess it depends >how many and which Git commands you're willing to trust. ;) I would go one step further on this. Using --depth=1 and potentially a --sparse checkout with only what you specifically need to verify. However, Junio's point on checking end-point commit and tags is useful and significant on verifying that the Merkel Tree itself is intact and not modified using signing is usually sufficient verification and more reliable than a bit-for bit comparison, which may have dependencies on the underlying operating system, particularly if the originating directory inode contents differ from the destination - an example is using a Windows server for the upstream and a NonStop server for the clone (not so much with Linux vs. NonStop). It is pretty much guaranteed that the inodes will be different. --Randall ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-03-14 22:26 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson 2025-03-12 16:02 ` Junio C Hamano 2025-03-13 3:09 ` Kyle Lippincott 2025-03-13 7:59 ` Simon Josefsson 2025-03-13 5:15 ` Jeff King 2025-03-13 13:36 ` Junio C Hamano 2025-03-13 20:16 ` Simon Josefsson 2025-03-13 21:07 ` Kyle Lippincott 2025-03-13 22:09 ` Junio C Hamano 2025-03-14 2:42 ` Jeff King 2025-03-14 22:24 ` rsbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).