All of lore.kernel.org
 help / color / mirror / Atom feed
From: <rsbecker@nexbridge.com>
To: "'Jeff King'" <peff@peff.net>, "'Simon Josefsson'" <simon@josefsson.org>
Cc: <git@vger.kernel.org>
Subject: RE: Making bit-by-bit reproducible Git Bundles?
Date: Fri, 14 Mar 2025 18:24:53 -0400	[thread overview]
Message-ID: <011101db952f$ebcffe80$c36ffb80$@nexbridge.com> (raw)
In-Reply-To: <20250314024218.GA114103@coredump.intra.peff.net>

On March 13, 2025 10:42 PM, Jeff King wrote:
>On Thu, Mar 13, 2025 at 09:16:34PM +0100, Simon Josefsson wrote:
>
>> >   2. There is no way to pass pack-objects options down through
>> >      git-bundle. So you'd have to either assemble the bundle yourself,
>> >      or perhaps generate a stable on-disk pack state, and then generate
>> >      the bundle. Perhaps something like:
>> >
>> >        # make one single pack, with no reuse, using the default options
>> >        git -c pack.threads=1 repack -adf
>>
>> Yay!  You may have solved this for me.  I have to verify this a bit
>> more, but this looks promising (these are two different git clones):
>>
>> jas@kaka:~/t/gnulib-1$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-1$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-1$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-1$ cd ../gnulib-2
>> jas@kaka:~/t/gnulib-2$ git -c pack.threads=1 repack -adf
>> jas@kaka:~/t/gnulib-2$ git -c 'pack.threads=1' bundle create
>> gnulib.bundle --all jas@kaka:~/t/gnulib-2$ sha256sum gnulib.bundle
>> c780bb07501cf016e702fbe3f52704b4f64edd6882c13c9be0f3f114c894e890
>> gnulib.bundle jas@kaka:~/t/gnulib-2$
>
>One thing to watch out for here: that repack is going to look at _all_ objects in the
>repository. So you will get different output if you make a bundle of a tag "v1.0"
>today than you would get later, when "v1.1"
>also exists. Ditto for any other activity in the repository, like writes to unrelated
>branches, or even reflog entries.
>
>So you'd probably want to make an absolute minimal repository with the reachable
>objects, perhaps like:
>
>  git clone --bare --no-local --single-branch -b v1.0 . just-v1.0.git
>  cd just-v1.0.git
>  git -c pack.threads=1 repack -adf
>
>It doesn't have to be just one ref, of course; you might want to snapshot the whole
>set of refs at the time you make the bundle. E.g., by fetching into the empty repo
>using a refspec.
>
>This would all be a non-issue if you could ask git-bundle to directly pass the
>equivalent of "-f" to pack-objects (at that layer it is called "--no-reuse-delta"). Since
>then it would be computing the full set of objects itself. But without a patch to Git, I
>don't think there's a way to do that.
>
>The bundle format is pretty simple, so you _could_ hack around it yourself, like:
>
>  # list refs we care about; you can pick whatever subset you want
>  # here.
>  git for-each-ref --format='%(objectname) %(refname)' refs/heads/ >refs
>
>  {
>	# bundle header plus list of refs, plus blank line terminator
>	echo "# v2 git bundle"
>	cat refs
>	echo
>
>	# and now the pack. We just need to feed it the object ids for
>	# all of the refs. It will handle sorting and de-duping for us.
>	cut -d' ' -f1 <refs |
>	git -c pack.threads=1 pack-objects \
>		--stdout --revs --delta-base-offset --no-reuse-delta
>  } >foo.bundle
>
>I dunno if that is more or less gross than teaching git-bundle to pass --no-reuse-
>delta itself. It's certainly more intimate with the details, but OTOH it is less likely to
>change in other versions of Git (e.g., if we started making "v3" bundles by default).
>
>> >   # print all commits in topological order, with ties broken by
>> >   # committer date, which should be stable. And then follow up with the
>> >   # trees and blobs for each.
>> >   git rev-list --topo-order --objects HEAD >objects
>> >
>> >   # now print the contents of each object (preceded by its name, type,
>> >   # and length, so there's no chance of weird prepending or appending
>> >   # attacks). We cut off the path information from rev-list here, since
>> >   # the ordered set of objects is all we care about.
>> >   cut -d' ' -f1 objects |
>> >   git cat-file --batch >content
>> >
>> >   # and then take a hash over that content; this will be unambiguous.
>> >   sha256sum <content
>>
>> How to read this output?  Could this be made git bundle compatible?
>
>You'd have to compare the result of doing that after fetching from the bundle into
>an empty repo. I don't think there's a great way to operate directly on the bundle
>packfile (it has to be indexed first to see what's in it).
>
>The closest I could get is:
>
>  input=foo.bundle
>
>  # split the bundle into header and packfile sections on the first
>  # blank line
>  sed '/^$/q' <$input >header
>  size=$(stat --format=%s header)
>  tail -c +$((size+1)) <$input >bundle.pack
>
>  # we can first do a byte-level comparison of the header; if this isn't
>  # the same, the bundles do not match.
>  sha256sum <header
>
>  # now index the pack, so we know what's in it; this makes bundle.idx
>  git index-pack -v bundle.pack
>
>  # and now we want to dump the full logical contents (not the
>  # delta-compressed versions) of each object. First we need a list of
>  # the objects. This will come out in lexical order of object id, which
>  # is good for us since it will be stable.
>  git show-index <bundle.idx  | awk '{print $2}' >objects
>
>  # unfortunately here things break down. There is no command to read
>  # the data directly out of the pack/idx pair without a repository
>  # (even though it could be done technically). So we hack around it
>  # with a temp repo.
>  git init --bare tmp.git
>  mv bundle.idx bundle.pack tmp.git/objects/pack/
>  git -C tmp.git cat-file --batch <objects | sha256sum
>
>So...also kind of gross. And not really all that different than what:
>
>  git init --bare tmp.git
>  cd tmp.git
>  git fetch ../foo.bundle refs/*:refs/*
>
>would do (you end up with the same pack/idx pair). So I dunno. I guess it depends
>how many and which Git commands you're willing to trust. ;)

I would go one step further on this. Using --depth=1 and potentially a --sparse checkout
with only what you specifically need to verify.

However, Junio's point on checking end-point commit and tags is useful and significant
on verifying that the Merkel Tree itself is intact and not modified using signing is usually
sufficient verification and more reliable than a bit-for bit comparison, which may have
dependencies on the underlying  operating system, particularly if the originating
directory inode contents differ from the destination - an example is using a Windows
server for the upstream and a NonStop server for the clone (not so much with Linux vs.
NonStop). It is pretty much guaranteed that the inodes will be different.

--Randall


      reply	other threads:[~2025-03-14 22:26 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-12 11:40 Making bit-by-bit reproducible Git Bundles? Simon Josefsson
2025-03-12 16:02 ` Junio C Hamano
2025-03-13  3:09 ` Kyle Lippincott
2025-03-13  7:59   ` Simon Josefsson
2025-03-13  5:15 ` Jeff King
2025-03-13 13:36   ` Junio C Hamano
2025-03-13 20:16   ` Simon Josefsson
2025-03-13 21:07     ` Kyle Lippincott
2025-03-13 22:09       ` Junio C Hamano
2025-03-14  2:42     ` Jeff King
2025-03-14 22:24       ` rsbecker [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='011101db952f$ebcffe80$c36ffb80$@nexbridge.com' \
    --to=rsbecker@nexbridge.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=simon@josefsson.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.