From: <rsbecker@nexbridge.com>
To: "'Jeff King'" <peff@peff.net>, "'Simon Josefsson'" <simon@josefsson.org>
Cc: <git@vger.kernel.org>
Subject: RE: Making bit-by-bit reproducible Git Bundles?
Date: Fri, 14 Mar 2025 18:24:53 -0400 [thread overview]
Message-ID: <011101db952f$ebcffe80$c36ffb80$@nexbridge.com> (raw)
In-Reply-To: <20250314024218.GA114103@coredump.intra.peff.net>
On March 13, 2025 10:42 PM, Jeff King wrote:
>On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote:
>
>> > 2. There is no way to pass pack-objects options down through
>> > git-bundle. So you'd have to either assemble the bundle yourself,
>> > or perhaps generate a stable on-disk pack state, and then generate
>> > the bundle. Perhaps something like:
>> >
>> > # make one single pack, with no reuse, using the default options
>> > git -c pack.threads=1 repack -adf
>>
>> Yay! You may have solved this for me. I have to verify this a bit
>> more, but this looks promising (these are two different git clones):
>>
>> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
>> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-2$
>
>One thing to watch out for here: that repack is going to look at _all_ objects in the
>repository. So you will get different output if you make a bundle of a tag "v1.0"
>today than you would get later, when "v1.1"
>also exists. Ditto for any other activity in the repository, like writes to unrelated
>branches, or even reflog entries.
>
>So you'd probably want to make an absolute minimal repository with the reachable
>objects, perhaps like:
>
> git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git
> cd just-v1.0.git
> git -c pack.threads=1 repack -adf
>
>It doesn't have to be just one ref, of course; you might want to snapshot the whole
>set of refs at the time you make the bundle. E.g., by fetching into the empty repo
>using a refspec.
>
>This would all be a non-issue if you could ask git-bundle to directly pass the
>equivalent of "-f" to pack-objects (at that layer it is called "--no-reuse-delta"). Since
>then it would be computing the full set of objects itself. But without a patch to Git, I
>don't think there's a way to do that.
>
>The bundle format is pretty simple, so you _could_ hack around it yourself, like:
>
> # list refs we care about; you can pick whatever subset you want
> # here.
> git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs
>
> {
> # bundle header plus list of refs, plus blank line terminator
> echo "# v2 git bundle"
> cat refs
> echo
>
> # and now the pack. We just need to feed it the object ids for
> # all of the refs. It will handle sorting and de-duping for us.
> cut -d' ' -f1 <refs |
> git -c pack.threads=1 pack-objects \
> --stdout --revs --delta-base-offset --no-reuse-delta
> } >foo.bundle
>
>I dunno if that is more or less gross than teaching git-bundle to pass --no-reuse-
>delta itself. It's certainly more intimate with the details, but OTOH it is less likely to
>change in other versions of Git (e.g., if we started making "v3" bundles by default).
>
>> > # print all commits in topological order, with ties broken by
>> > # committer date, which should be stable. And then follow up with the
>> > # trees and blobs for each.
>> > git rev-list --topo-order --objects HEAD >objects
>> >
>> > # now print the contents of each object (preceded by its name, type,
>> > # and length, so there's no chance of weird prepending or appending
>> > # attacks). We cut off the path information from rev-list here, since
>> > # the ordered set of objects is all we care about.
>> > cut -d' ' -f1 objects |
>> > git cat-file --batch >content
>> >
>> > # and then take a hash over that content; this will be unambiguous.
>> > sha256sum <content
>>
>> How to read this output? Could this be made git bundle compatible?
>
>You'd have to compare the result of doing that after fetching from the bundle into
>an empty repo. I don't think there's a great way to operate directly on the bundle
>packfile (it has to be indexed first to see what's in it).
>
>The closest I could get is:
>
> input=foo.bundle
>
> # split the bundle into header and packfile sections on the first
> # blank line
> sed '/^$/q' <$input >header
> size=$(stat --format=%s header)
> tail -c +$((size+1)) <$input >bundle.pack
>
> # we can first do a byte-level comparison of the header; if this isn't
> # the same, the bundles do not match.
> sha256sum <header
>
> # now index the pack, so we know what's in it; this makes bundle.idx
> git index-pack -v bundle.pack
>
> # and now we want to dump the full logical contents (not the
> # delta-compressed versions) of each object. First we need a list of
> # the objects. This will come out in lexical order of object id, which
> # is good for us since it will be stable.
> git show-index <bundle.idx | awk '{print $2}' >objects
>
> # unfortunately here things break down. There is no command to read
> # the data directly out of the pack/idx pair without a repository
> # (even though it could be done technically). So we hack around it
> # with a temp repo.
> git init --bare tmp.git
> mv bundle.idx bundle.pack tmp.git/objects/pack/
> git -C tmp.git cat-file --batch <objects | sha256sum
>
>So...also kind of gross. And not really all that different than what:
>
> git init --bare tmp.git
> cd tmp.git
> git fetch ../foo.bundle refs/*:refs/*
>
>would do (you end up with the same pack/idx pair). So I dunno. I guess it depends
>how many and which Git commands you're willing to trust. ;)
I would go one step further on this. Using --depth=1 and potentially a --sparse checkout
with only what you specifically need to verify.
However, Junio's point on checking end-point commit and tags is useful and significant
on verifying that the Merkel Tree itself is intact and not modified using signing is usually
sufficient verification and more reliable than a bit-for bit comparison, which may have
dependencies on the underlying operating system, particularly if the originating
directory inode contents differ from the destination - an example is using a Windows
server for the upstream and a NonStop server for the clone (not so much with Linux vs.
NonStop). It is pretty much guaranteed that the inodes will be different.
--Randall
prev parent reply other threads:[~2025-03-14 22:26 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
2025-03-13 3:09 ` Kyle Lippincott
2025-03-13 7:59 ` Simon Josefsson
2025-03-13 5:15 ` Jeff King
2025-03-13 13:36 ` Junio C Hamano
2025-03-13 20:16 ` Simon Josefsson
2025-03-13 21:07 ` Kyle Lippincott
2025-03-13 22:09 ` Junio C Hamano
2025-03-14 2:42 ` Jeff King
2025-03-14 22:24 ` rsbecker [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='011101db952f$ebcffe80$c36ffb80$@nexbridge.com' \
--to=rsbecker@nexbridge.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
--cc=simon@josefsson.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).