On 2026-04-28 at 11:50:17, Theodore Tso wrote: > On Tue, Apr 28, 2026 at 10:25:24AM +0000, brian m. carlson wrote: > > > > I'll just note that we don't make any guarantees that `git archive` > > produces identical output across versions. Incorrectly making that > > assumption broke kernel.org when we changed the format in the past. > > > > Also, if you use `export-subst`, then it's possible to emit short object > > IDs, which can differ in length depending on how many objects are in the > > repository. It's also possible to use zlib or pigz instead of gzip to > > produce tarballs, in which case the compressed data will also differ. > > This is what I've been using to try get reproducible tarballs for > e2fprogs: > > git archive --prefix=e2fsprogs-${ver}/ ${commit} | gzip -9n > $fn > > ,,, where $commit is a signed git tag. > > I know that in the past, using --format=tgz has broken based on > different compression parameters used by git (and whether it used an > external or internal compressor). I also know that if $commit is a > tree-id, this can result in the timestamps being not reproduible. I > also don't use export-subst. > > There is also the difference in the prefix used by github and gitlab, > but that's arguably not git's fault. > > What other gotchas are there? How is this likely to be inconsistent > in the future? How much work is there to provide that guarantee in > the future? We could in theory provide reproducible tar output, but again, nobody has committed to doing that yet. If we did that, we would add a special option that produces, say, reproducible format v1, and if we needed to make a change, then we would provide reproducible format v2, and so on. That would also necessarily disable `export-subst`, since that introduces non-reproducibility. Of course, if you're using filters, then those can also be a source of issues. Git LFS doesn't have that problem because it identifies objects by SHA-256 hash, but there are many people who _do_ have unreproducible filters (for instance, inserting the current date and time), so we might need to disable those as well. My approach would be to document a format and then implement it and thoroughly test it. I was hoping, in fact, to define a format that other tarball-generating implementations could _also_ implement, since reproducible tarballs are also an issue for other tools like Cargo. I have some code somewhere in some branch to do part of that, but it ran into complexities due to handling `--add-file` and `--add-virtual-file`, which are always appended, when we'd actually want them inserted in sorted order. This will almost certainly be easier to write in Rust because of better data structures and easier unit testing, so I may pick it up at some point. We cannot guarantee providing reproducible tar.gz output because we don't control the compressor. If gzip tomorrow decides to release a version that produces a different bitstream for some output, we're not going to ship our own gzip. Same goes for zlib, especially since different distros use different libraries to implement that interface. Our zip files also have the undesirable attribute that they contain both a local and a UTC timestamp, so the timezone is a problem. This bit us at $DAYJOB when we started generating archives inside a container, since the local timezone changed to UTC[0] and thus the archives were no longer bit-for-bit identical. Of course, zip files also have compression, which adds additional potential for reproducibility problems. [0] Yes, I know all servers should be using UTC and I agree, but that decision got made well before my time. -- brian m. carlson (they/them) Toronto, Ontario, CA