All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: rsbecker@nexbridge.com
Cc: "'brian m. carlson'" <sandals@crustytoothpaste.net>,
	'Junio C Hamano' <gitster@pobox.com>,
	'Konstantin Ryabitsev' <konstantin@linuxfoundation.org>,
	'Eli Schwartz' <eschwartz93@gmail.com>,
	'Git List' <git@vger.kernel.org>
Subject: Re: Stability of git-archive, breaking (?) the Github universe, and a possible solution
Date: Fri, 03 Feb 2023 14:18:58 +0100	[thread overview]
Message-ID: <230203.86sffmc1tz.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <01a901d93760$c690d970$53b28c50$@nexbridge.com>


On Thu, Feb 02 2023, rsbecker@nexbridge.com wrote:

> On February 2, 2023 6:02 PM, brian m. carlson wrote:
>>On 2023-02-01 at 23:37:19, Junio C Hamano wrote:
>>> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
>>>
>>> > I don't think a blurb is necessary, but you're basically
>>> > underscoring the problem, which is that nobody is willing to promise
>>> > that compression is consistent, but yet people want to rely on that
>>> > fact.  I'm willing to write and implement a consistent tar spec and
>>> > to guarantee compatibility with that, but the tension here is that
>>> > people also want gzip to never change its byte format ever, which
>>> > frankly seems unrealistic without explicit guarantees.  Maybe the
>>> > authors will agree to promise that, but it seems unlikely.
>>>
>>> Just to step back a bit, where does the distinction between
>>> guaranteeing the tar format stability and gzip compressed bitstream
>>> stability come from?  At both levels, the same thing can be expressed
>>> in multiple different ways, I think, but spelling out how exactly the
>>> compressor compresses is more involved than spelling out how entries
>>> in a tar archive is ordered and each entry is expressed, or something?
>>
>>Yes, at least with my understanding about how gzip and compression in general
>>work.
>>
>>The tar format (and the pax format which builds on it) can mostly be restricted by
>>explaining what data is to be included in the pax and tar headers and how it is to be
>>formatted.  If we say, we will always write such and such information in the pax
>>header and sort the keys, and we write such and such information in the tar header,
>>then the format is completely deterministic, and we can make nice guarantees.
>>
>>My understanding about how Lempel-Ziv-based compression algorithms work is that
>>there's a lot more freedom to decide how best to compress things and that there
>>isn't always a logical obvious choice, but I will admit my understanding is relatively
>>limited.  If someone thinks we can effectively succeed in supporting compression
>>more than just relying on gzip, I would be delighted to be shown to be wrong.
>
> The nice part about gzip is that it is generally available on
> virtually all platforms (or can be easily obtained). Other compression
> forms, like bz2, which sometimes produces more dense compression, are
> not necessarily available. Availability is something I would be
> worried about...

I agree with all of that, gzip is in such wide use for a reason. 

>... (clone and checkout failures).

But how would a hypothetical obscure format for "git archive" contribute
to clone or checkout failures? Are you thinking of our use of zlib for
e.g. loose objects? That's unrelated to this discussion (and I don't
think anyone relies on their compressed checksum).

> Tar formats are also to be used carefully. Not all platform
> implementations of tar support all variants. "ustar" is fairly common
> but there are others that are not. Interoperability needs to be the
> biggest factor in this decision, IMHO, rather than compression rates.

For "git archive" whether you care about interoperability depends on the
target audience of your archive, and in any case I don't see why we need
to worry about it, except to perhaps note that some are more portable
than others if we e.g. had a built-in "tar.bz2" helper method.

> The alternative is having git supply its own implementation, but that
> is a longer term migration problem, resembling the SHA-256 migration.

I've noted elsewhere in this thread that I don't see the point of
shipping a fallback "gzip" beyond the "git archive gzip" we have
already, but even if we did that the scope of that seems pretty simple,
and *much* easier than the SHA-256 migration.

  reply	other threads:[~2023-02-03 13:32 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-31  0:06 Stability of git-archive, breaking (?) the Github universe, and a possible solution Eli Schwartz
2023-01-31  7:49 ` Ævar Arnfjörð Bjarmason
2023-01-31  9:11   ` Eli Schwartz
2023-02-02  9:32   ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 1/9] archive & tar config docs: de-duplicate configuration section Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 2/9] git config docs: document "tar.<format>.{command,remote}" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 3/9] archiver API: make the "flags" in "struct archiver" an enum Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 4/9] archive: omit the shell for built-in "command" filters Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 5/9] archive-tar.c: move internal gzip implementation to a function Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 6/9] archive: use "gzip -cn" for stability, not "git archive gzip" Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 7/9] test-lib.sh: add a lazy GZIP prerequisite Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 8/9] archive tests: test for "gzip -cn" and "git archive gzip" stability Ævar Arnfjörð Bjarmason
2023-02-02  9:32     ` [PATCH 9/9] git archive docs: document output non-stability Ævar Arnfjörð Bjarmason
2023-02-02 10:25       ` brian m. carlson
2023-02-02 10:30         ` Ævar Arnfjörð Bjarmason
2023-02-02 16:34         ` Junio C Hamano
2023-02-04 17:46           ` brian m. carlson
2023-02-02 16:17     ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Phillip Wood
2023-02-02 16:40       ` Junio C Hamano
2023-02-02 19:23       ` Raymond E. Pasco
2023-02-03  8:06         ` [PATCH] archive: document output stability concerns Raymond E. Pasco
2023-02-03 13:49       ` [PATCH 0/9] git archive: use gzip again by default, document output stabilty Ævar Arnfjörð Bjarmason
2023-02-06 14:46         ` Phillip Wood
2023-02-03 15:47       ` Theodore Ts'o
2023-02-02 16:25     ` Junio C Hamano
2023-02-04 18:08       ` René Scharfe
2023-02-05 21:30         ` Ævar Arnfjörð Bjarmason
2023-02-12 17:41           ` René Scharfe
2023-01-31  9:54 ` Stability of git-archive, breaking (?) the Github universe, and a possible solution brian m. carlson
2023-01-31 11:31   ` Ævar Arnfjörð Bjarmason
2023-01-31 15:05   ` Konstantin Ryabitsev
2023-01-31 22:32     ` brian m. carlson
2023-02-01  9:40       ` Ævar Arnfjörð Bjarmason
2023-02-01 11:34         ` demerphq
2023-02-01 12:21           ` Michal Suchánek
2023-02-01 12:48             ` demerphq
2023-02-01 13:43               ` Ævar Arnfjörð Bjarmason
2023-02-01 15:21                 ` demerphq
2023-02-01 18:56                   ` Theodore Ts'o
2023-02-02 21:19                     ` Joey Hess
2023-02-03  4:02                       ` Theodore Ts'o
2023-02-03 13:32                         ` Ævar Arnfjörð Bjarmason
2023-02-01 12:17         ` Raymond E. Pasco
2023-02-01 23:16         ` brian m. carlson
2023-02-01 23:37           ` Junio C Hamano
2023-02-02 23:01             ` brian m. carlson
2023-02-02 23:47               ` rsbecker
2023-02-03 13:18                 ` Ævar Arnfjörð Bjarmason [this message]
2023-02-02  0:42           ` Ævar Arnfjörð Bjarmason
2023-01-31 15:56   ` Eli Schwartz
2023-01-31 16:20     ` Konstantin Ryabitsev
2023-01-31 16:34       ` Eli Schwartz
2023-01-31 20:34         ` Konstantin Ryabitsev
2023-01-31 20:45         ` Michal Suchánek
2023-02-01  1:33     ` brian m. carlson
2023-02-01 12:42   ` Ævar Arnfjörð Bjarmason
2023-02-01 23:18     ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=230203.86sffmc1tz.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=eschwartz93@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=konstantin@linuxfoundation.org \
    --cc=rsbecker@nexbridge.com \
    --cc=sandals@crustytoothpaste.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.