git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: <rsbecker@nexbridge.com>
To: "'Jeff King'" <peff@peff.net>, "'Simon Josefsson'" <simon@josefsson.org>
Cc: <git@vger.kernel.org>
Subject: RE: Making bit-by-bit reproducible Git Bundles?
Date: Fri, 14 Mar 2025 18:24:53 -0400	[thread overview]
Message-ID: <011101db952f$ebcffe80$c36ffb80$@nexbridge.com> (raw)
In-Reply-To: <20250314024218.GA114103@coredump.intra.peff.net>

On March 13, 2025 10:42 PM, Jeff King wrote:
>On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote:
>
>> >   2. There is no way to pass pack-objects options down through
>> >      git-bundle. So you'd have to either assemble the bundle yourself,
>> >      or perhaps generate a stable on-disk pack state, and then generate
>> >      the bundle. Perhaps something like:
>> >
>> >        # make one single pack, with no reuse, using the default options
>> >        git -c pack.threads=1 repack -adf
>>
>> Yay!  You may have solved this for me.  I have to verify this a bit
>> more, but this looks promising (these are two different git clones):
>>
>> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
>> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-2$
>
>One thing to watch out for here: that repack is going to look at _all_ objects in the
>repository. So you will get different output if you make a bundle of a tag "v1.0"
>today than you would get later, when "v1.1"
>also exists. Ditto for any other activity in the repository, like writes to unrelated
>branches, or even reflog entries.
>
>So you'd probably want to make an absolute minimal repository with the reachable
>objects, perhaps like:
>
>  git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git
>  cd just-v1.0.git
>  git -c pack.threads=1 repack -adf
>
>It doesn't have to be just one ref, of course; you might want to snapshot the whole
>set of refs at the time you make the bundle. E.g., by fetching into the empty repo
>using a refspec.
>
>This would all be a non-issue if you could ask git-bundle to directly pass the
>equivalent of "-f" to pack-objects (at that layer it is called "--no-reuse-delta"). Since
>then it would be computing the full set of objects itself. But without a patch to Git, I
>don't think there's a way to do that.
>
>The bundle format is pretty simple, so you _could_ hack around it yourself, like:
>
>  # list refs we care about; you can pick whatever subset you want
>  # here.
>  git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs
>
>  {
>	# bundle header plus list of refs, plus blank line terminator
>	echo "# v2 git bundle"
>	cat refs
>	echo
>
>	# and now the pack. We just need to feed it the object ids for
>	# all of the refs. It will handle sorting and de-duping for us.
>	cut -d' ' -f1 <refs |
>	git -c pack.threads=1 pack-objects \
>		--stdout --revs --delta-base-offset --no-reuse-delta
>  } >foo.bundle
>
>I dunno if that is more or less gross than teaching git-bundle to pass --no-reuse-
>delta itself. It's certainly more intimate with the details, but OTOH it is less likely to
>change in other versions of Git (e.g., if we started making "v3" bundles by default).
>
>> >   # print all commits in topological order, with ties broken by
>> >   # committer date, which should be stable. And then follow up with the
>> >   # trees and blobs for each.
>> >   git rev-list --topo-order --objects HEAD >objects
>> >
>> >   # now print the contents of each object (preceded by its name, type,
>> >   # and length, so there's no chance of weird prepending or appending
>> >   # attacks). We cut off the path information from rev-list here, since
>> >   # the ordered set of objects is all we care about.
>> >   cut -d' ' -f1 objects |
>> >   git cat-file --batch >content
>> >
>> >   # and then take a hash over that content; this will be unambiguous.
>> >   sha256sum <content
>>
>> How to read this output?  Could this be made git bundle compatible?
>
>You'd have to compare the result of doing that after fetching from the bundle into
>an empty repo. I don't think there's a great way to operate directly on the bundle
>packfile (it has to be indexed first to see what's in it).
>
>The closest I could get is:
>
>  input=foo.bundle
>
>  # split the bundle into header and packfile sections on the first
>  # blank line
>  sed '/^$/q' <$input >header
>  size=$(stat --format=%s header)
>  tail -c +$((size+1)) <$input >bundle.pack
>
>  # we can first do a byte-level comparison of the header; if this isn't
>  # the same, the bundles do not match.
>  sha256sum <header
>
>  # now index the pack, so we know what's in it; this makes bundle.idx
>  git index-pack -v bundle.pack
>
>  # and now we want to dump the full logical contents (not the
>  # delta-compressed versions) of each object. First we need a list of
>  # the objects. This will come out in lexical order of object id, which
>  # is good for us since it will be stable.
>  git show-index <bundle.idx  | awk '{print $2}' >objects
>
>  # unfortunately here things break down. There is no command to read
>  # the data directly out of the pack/idx pair without a repository
>  # (even though it could be done technically). So we hack around it
>  # with a temp repo.
>  git init --bare tmp.git
>  mv bundle.idx bundle.pack tmp.git/objects/pack/
>  git -C tmp.git cat-file --batch <objects | sha256sum
>
>So...also kind of gross. And not really all that different than what:
>
>  git init --bare tmp.git
>  cd tmp.git
>  git fetch ../foo.bundle refs/*:refs/*
>
>would do (you end up with the same pack/idx pair). So I dunno. I guess it depends
>how many and which Git commands you're willing to trust. ;)

I would go one step further on this. Using --depth=1 and potentially a --sparse checkout
with only what you specifically need to verify.

However, Junio's point on checking end-point commit and tags is useful and significant
on verifying that the Merkel Tree itself is intact and not modified using signing is usually
sufficient verification and more reliable than a bit-for bit comparison, which may have
dependencies on the underlying  operating system, particularly if the originating
directory inode contents differ from the destination - an example is using a Windows
server for the upstream and a NonStop server for the clone (not so much with Linux vs.
NonStop). It is pretty much guaranteed that the inodes will be different.

--Randall


      reply	other threads:[~2025-03-14 22:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
2025-03-13  3:09 ` Kyle Lippincott
2025-03-13  7:59   ` Simon Josefsson
2025-03-13  5:15 ` Jeff King
2025-03-13 13:36   ` Junio C Hamano
2025-03-13 20:16   ` Simon Josefsson
2025-03-13 21:07     ` Kyle Lippincott
2025-03-13 22:09       ` Junio C Hamano
2025-03-14  2:42     ` Jeff King
2025-03-14 22:24       ` rsbecker [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='011101db952f$ebcffe80$c36ffb80$@nexbridge.com' \
    --to=rsbecker@nexbridge.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=simon@josefsson.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).