Git development
 help / color / mirror / Atom feed
* Git generated tarballs and Debian
@ 2026-04-28  8:40 Simon Richter
  2026-04-28 10:25 ` brian m. carlson
  0 siblings, 1 reply; 6+ messages in thread
From: Simon Richter @ 2026-04-28  8:40 UTC (permalink / raw)
  To: git; +Cc: Ian Jackson

Hi,

in Debian, we're shipping "original" tarballs for each software package, 
and the Debian specific changes in a separate file.

Historically, this users could do a bitwise comparison of the original 
tarball and the one in Debian to verify that these were unchanged.

With git, some authors have stopped releasing official tarballs, so 
we're using git-archive a lot -- but this is reproducible only by 
accident. GitHub also prepares some release tarballs that may or not be 
bitwise identical to what git archive produces.

I've written a small tool that generates the tree checksum for a given 
tarball (running inside a SECCOMP environment, not writing anything to 
disk), that already goes a long way to make tarballs verifiable: one can 
check whether that ID is the same as the one mentioned in a commit (and 
the comment inside a git-archive generated tarball is helpful in finding 
which commit).

The downsides of that are:

1. that you still need a copy of the commit to verify it, as it's not 
included in the tarball.

We could add an ancillary file that contains the commit object (its 
checksum being reproducible, and containing the tree checksum) and 
possibly a signed tag object as well, so that is solvable inside Debian.

Another option would be to extend the git-archive format to include them 
as a (longer) comment in the global pax header.

2. that it doesn't work for submodules

What we do currently is generate multiple archives with different 
prefixes, and concatenate them using tar. That loses all the pax global 
headers though, so commit information is lost. In addition, putting the 
actual contents into a subdirectory instead of a commit reference means 
that generating the tree object from the tarball contents means the 
checksum does not match.

What we could do is generate multiple archives, and keep them separate, 
but the Debian toolchain can only unpack additional archives into a 
direct subdirectory of the main archive (e.g. "orig.tar.gz" gets 
unpacked to "foo-1.0", then "orig-addon.tar.gz" gets unpacked into 
"foo-1.0/addon"). We can fix _that_ with symlinks, but it gets more and 
more hacky.

One thing we could do inside git here is add a method to create archives 
that include submodules (that gets rid of the concatenation), but in 
order for this to be easily verifiable, I still need to know where 
submodules are and what their commit objects are (so I know the commit 
checksum and can verify the tree checksum).

The goal is to extend what I can already do inside the Linux kernel:

$ git rev-parse HEAD
94dfcc4a99b0cece77e73dc3011284050f95da89
$ git rev-parse HEAD^{tree}
2d14d43ce9f062160262f4e4f162f5ff0ed91a5e
$ git archive --format=tar HEAD | git-treeof
Commit-Hint: 94dfcc4a99b0cece77e73dc3011284050f95da89
Tree-SHA1: 2d14d43ce9f062160262f4e4f162f5ff0ed91a5e

so the "Commit-Hint" can become a stronger statement "I have seen a 
commit object with this checksum that actually refers to the correct 
tree", and to allow this to work for repositories with submodules.

Does it make sense to extend git here to allow this, or should I try to 
solve this entirely within Debian?

    Simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git generated tarballs and Debian
  2026-04-28  8:40 Git generated tarballs and Debian Simon Richter
@ 2026-04-28 10:25 ` brian m. carlson
  2026-04-28 11:32   ` Simon Richter
  2026-04-28 11:50   ` Theodore Tso
  0 siblings, 2 replies; 6+ messages in thread
From: brian m. carlson @ 2026-04-28 10:25 UTC (permalink / raw)
  To: Simon Richter; +Cc: git, Ian Jackson

[-- Attachment #1: Type: text/plain, Size: 1574 bytes --]

On 2026-04-28 at 08:40:05, Simon Richter wrote:
> Hi,
> 
> in Debian, we're shipping "original" tarballs for each software package, and
> the Debian specific changes in a separate file.
> 
> Historically, this users could do a bitwise comparison of the original
> tarball and the one in Debian to verify that these were unchanged.
> 
> With git, some authors have stopped releasing official tarballs, so we're
> using git-archive a lot -- but this is reproducible only by accident. GitHub
> also prepares some release tarballs that may or not be bitwise identical to
> what git archive produces.

I'll just note that we don't make any guarantees that `git archive`
produces identical output across versions.  Incorrectly making that
assumption broke kernel.org when we changed the format in the past.

Also, if you use `export-subst`, then it's possible to emit short object
IDs, which can differ in length depending on how many objects are in the
repository.  It's also possible to use zlib or pigz instead of gzip to
produce tarballs, in which case the compressed data will also differ.

I had intended to create and emit a standard, reproducible format for
`git archive`, but never got around to finishing that.  Perhaps I'll try
to pick it up at some point; I expect it will be easier to implement now
that we have Rust support in the tree.

When I was one of the maintainer of Git LFS, we intentionally produced
source tarballs specifically to emit bit-for-bit identical artifacts.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 325 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git generated tarballs and Debian
  2026-04-28 10:25 ` brian m. carlson
@ 2026-04-28 11:32   ` Simon Richter
  2026-04-28 11:50   ` Theodore Tso
  1 sibling, 0 replies; 6+ messages in thread
From: Simon Richter @ 2026-04-28 11:32 UTC (permalink / raw)
  To: brian m. carlson, git, Ian Jackson

Hi,

On 4/28/26 7:25 PM, brian m. carlson wrote:

> I'll just note that we don't make any guarantees that `git archive`
> produces identical output across versions.  Incorrectly making that
> assumption broke kernel.org when we changed the format in the past.

Exactly -- that's why I read the tarball and calculate the checksum of 
the corresponding tree object, but we have a few cases where we need 
extra information that isn't in the archive, and I'm wondering where to 
put that extra information: inside the archive itself, or into an extra 
file.

> Also, if you use `export-subst`, then it's possible to emit short object
> IDs, which can differ in length depending on how many objects are in the
> repository.  It's also possible to use zlib or pigz instead of gzip to
> produce tarballs, in which case the compressed data will also differ.

export-subst breaks verification completely as soon as a blob changes.

Compression isn't an issue, because we're comparing tree checksums.

    Simon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git generated tarballs and Debian
  2026-04-28 10:25 ` brian m. carlson
  2026-04-28 11:32   ` Simon Richter
@ 2026-04-28 11:50   ` Theodore Tso
  2026-04-28 21:20     ` brian m. carlson
  2026-04-29  7:30     ` Jeff King
  1 sibling, 2 replies; 6+ messages in thread
From: Theodore Tso @ 2026-04-28 11:50 UTC (permalink / raw)
  To: brian m. carlson, Simon Richter, git, Ian Jackson

On Tue, Apr 28, 2026 at 10:25:24AM +0000, brian m. carlson wrote:
> 
> I'll just note that we don't make any guarantees that `git archive`
> produces identical output across versions.  Incorrectly making that
> assumption broke kernel.org when we changed the format in the past.
> 
> Also, if you use `export-subst`, then it's possible to emit short object
> IDs, which can differ in length depending on how many objects are in the
> repository.  It's also possible to use zlib or pigz instead of gzip to
> produce tarballs, in which case the compressed data will also differ.

This is what I've been using to try get reproducible tarballs for
e2fprogs:

git archive --prefix=e2fsprogs-${ver}/ ${commit} | gzip -9n > $fn

,,, where $commit is a signed git tag.

I know that in the past, using --format=tgz has broken based on
different compression parameters used by git (and whether it used an
external or internal compressor).  I also know that if $commit is a
tree-id, this can result in the timestamps being not reproduible.  I
also don't use export-subst.

There is also the difference in the prefix used by github and gitlab,
but that's arguably not git's fault.

What other gotchas are there?  How is this likely to be inconsistent
in the future?  How much work is there to provide that guarantee in
the future?

   	    	    	 	      	  - Ted

P.S.  Although I use pristine-tar in Debian because I didn't want to
count on git-archive being reproducible.  But it would be lovely if I
could make that guarantee starting on a particular git version.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git generated tarballs and Debian
  2026-04-28 11:50   ` Theodore Tso
@ 2026-04-28 21:20     ` brian m. carlson
  2026-04-29  7:30     ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: brian m. carlson @ 2026-04-28 21:20 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Simon Richter, git, Ian Jackson

[-- Attachment #1: Type: text/plain, Size: 3791 bytes --]

On 2026-04-28 at 11:50:17, Theodore Tso wrote:
> On Tue, Apr 28, 2026 at 10:25:24AM +0000, brian m. carlson wrote:
> > 
> > I'll just note that we don't make any guarantees that `git archive`
> > produces identical output across versions.  Incorrectly making that
> > assumption broke kernel.org when we changed the format in the past.
> > 
> > Also, if you use `export-subst`, then it's possible to emit short object
> > IDs, which can differ in length depending on how many objects are in the
> > repository.  It's also possible to use zlib or pigz instead of gzip to
> > produce tarballs, in which case the compressed data will also differ.
> 
> This is what I've been using to try get reproducible tarballs for
> e2fprogs:
> 
> git archive --prefix=e2fsprogs-${ver}/ ${commit} | gzip -9n > $fn
> 
> ,,, where $commit is a signed git tag.
> 
> I know that in the past, using --format=tgz has broken based on
> different compression parameters used by git (and whether it used an
> external or internal compressor).  I also know that if $commit is a
> tree-id, this can result in the timestamps being not reproduible.  I
> also don't use export-subst.
> 
> There is also the difference in the prefix used by github and gitlab,
> but that's arguably not git's fault.
> 
> What other gotchas are there?  How is this likely to be inconsistent
> in the future?  How much work is there to provide that guarantee in
> the future?

We could in theory provide reproducible tar output, but again, nobody
has committed to doing that yet.  If we did that, we would add a special
option that produces, say, reproducible format v1, and if we needed to
make a change, then we would provide reproducible format v2, and so on.
That would also necessarily disable `export-subst`, since that
introduces non-reproducibility.

Of course, if you're using filters, then those can also be a source of
issues.  Git LFS doesn't have that problem because it identifies objects
by SHA-256 hash, but there are many people who _do_ have unreproducible
filters (for instance, inserting the current date and time), so we might
need to disable those as well.

My approach would be to document a format and then implement it and
thoroughly test it.  I was hoping, in fact, to define a format that
other tarball-generating implementations could _also_ implement, since
reproducible tarballs are also an issue for other tools like Cargo.  I
have some code somewhere in some branch to do part of that, but it ran
into complexities due to handling `--add-file` and `--add-virtual-file`,
which are always appended, when we'd actually want them inserted in
sorted order.  This will almost certainly be easier to write in Rust
because of better data structures and easier unit testing, so I may pick
it up at some point.

We cannot guarantee providing reproducible tar.gz output because we
don't control the compressor.  If gzip tomorrow decides to release a
version that produces a different bitstream for some output, we're not
going to ship our own gzip.  Same goes for zlib, especially since
different distros use different libraries to implement that interface.

Our zip files also have the undesirable attribute that they contain both
a local and a UTC timestamp, so the timezone is a problem.  This bit us
at $DAYJOB when we started generating archives inside a container, since
the local timezone changed to UTC[0] and thus the archives were no
longer bit-for-bit identical.  Of course, zip files also have
compression, which adds additional potential for reproducibility
problems.

[0] Yes, I know all servers should be using UTC and I agree, but that
decision got made well before my time.
-- 
brian m. carlson (they/them)
Toronto, Ontario, CA

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 325 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Git generated tarballs and Debian
  2026-04-28 11:50   ` Theodore Tso
  2026-04-28 21:20     ` brian m. carlson
@ 2026-04-29  7:30     ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2026-04-29  7:30 UTC (permalink / raw)
  To: Theodore Tso; +Cc: brian m. carlson, Simon Richter, git, Ian Jackson

On Tue, Apr 28, 2026 at 07:50:17AM -0400, Theodore Tso wrote:

> I know that in the past, using --format=tgz has broken based on
> different compression parameters used by git (and whether it used an
> external or internal compressor).  I also know that if $commit is a
> tree-id, this can result in the timestamps being not reproduible.  I
> also don't use export-subst.
> 
> There is also the difference in the prefix used by github and gitlab,
> but that's arguably not git's fault.
> 
> What other gotchas are there?  How is this likely to be inconsistent
> in the future?  How much work is there to provide that guarantee in
> the future?

The biggest unexpected change I recall was caused by a bug/compatibility
fix. 22f0dcd963 (archive-tar: split long paths more carefully,
2013-01-05) changed how some long paths were represented to be more
compatible between GNU tar and NetBSD. Lots of Homebrew recipes, etc,
were broken when GitHub deployed a version of Git with that commit.

I think there was a more recent one in 2023-ish caused by some
gzip-related changes (but it was after my time and I don't know the
details).

I feel like there was one in the middle, too, but I'm having trouble
digging it up (I think GitHub reverted 22f0dcd963 at the time and
finally reinstated it in 2017 after a warning period, so that might be
what I'm thinking of).

But I'm not sure how often we'd do fixes like that. Not a lot, as the
tar code is pretty stable. But is 82a46af13e (archive-tar: fix pax
extended header length calculation, 2019-08-17), for example, likely to
have changed hashes for some repos? Probably.

So I think if you really want byte-for-byte compatibility of git-archive
you have to cement the behavior, bugs and all, behind some kind of
version flag, and every possible behavior change has to be analyzed for
a potential version bump.

Though breaking some obscure cases once every 5-10 years is maybe not
_so_ bad, and we can live with it. ;)

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-29  7:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28  8:40 Git generated tarballs and Debian Simon Richter
2026-04-28 10:25 ` brian m. carlson
2026-04-28 11:32   ` Simon Richter
2026-04-28 11:50   ` Theodore Tso
2026-04-28 21:20     ` brian m. carlson
2026-04-29  7:30     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox