Re: git archive should use vendor extension in pax header

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "René Scharfe" <l.s.r@web.de>
To: fuz@fuz.su
Cc: git@vger.kernel.org
Subject: Re: git archive should use vendor extension in pax header
Date: Mon, 15 Feb 2016 21:25:40 +0100	[thread overview]
Message-ID: <56C23444.5000405@web.de> (raw)
In-Reply-To: <20160206145726.GA27001@fuz.su>

Am 06.02.2016 um 15:57 schrieb fuz@fuz.su:
> On Sat, Feb 06, 2016 at 02:23:11PM +0100, René Scharfe wrote:
>> Am 28.01.2016 um 00:45 schrieb fuz@fuz.su:
>>>> There is git get-tar-commit-id, which prints the commit ID if it
>>>> finds a comment entry which looks like a hexadecimal SHA-1 hash.
>>>> It's better than a hex editor at least. :)
>>>
>>> This is incredibly fuzzy and can get wrong for a pleothora of reasons.
>>> I hope you agree though that the situation is suboptimal, git is doing
>>> the equivalent of using a custom file format without an easily
>>> recognizable magic number.
>>
>> It is fuzzy in theory. But which other programs allow writing a
>> comment header?  I'm not aware of any, but I have to admit that I
>> didn't look too hard.
>
> Well, let's say what happens if the Mercurial folks were to implement
> the same thing? Suddenly there is a conflict. Yes, of course, right now
> there might be no program that uses the comment field for its own
> purpose but such design decisions tend to be not future proof. There is
> a very good reason why file formats typically have magic numbers and
> don't just rely on people knowing that the file has a certain type and
> that is the same reason why git should mark its meta data in a unique
> fashion.

Chances are good that Mercurial would do it in a way that doesn't 
conflict with git's tar comments.  I get your point, though, and agree 
that it's not ideal.  However, so far it's just a potential problem.

>>>> But I'm still interested how you got a collection of tar files with
>>>> unknown origin.  Just curious.
>>>
>>> Easy: Just download the (source) distribution archives of a distribution
>>> of choice and try to verify that the tarballs they use to compile their
>>> packages actually come from the project's public git repositories.
>>
>> OK, that's easier than calculating checksums and comparing them with
>> those published by the respective projects, but also less
>> trustworthy.
>
> If you have a known trusted archive, you could use it directly, no need
> for cross-verification. The intent is to be able to check if archives
> generated by someone from some sources could have plausibly been
> generated from these sources.

It's probably not too important, but I think I still don't fully 
understand.  So you have a tar file of unknown origin.  You hand it to 
git get-tar-commit-id or a similar tool and get back 
a08595f76159b09d57553e37a5123f1091bb13e7.  You can google this string 
and find out it's the commit ID for git v2.7.1.

Your tar file could have been modified in various ways, though, e.g. 
with tar u or tar --delete.  So you try to find a download site for the 
software that includes file hashes for archives of this release, like in
https://www.kernel.org/pub/software/scm/git/sha256sums.asc.

If the published hash and a hash of your file match then you can be 
reasonably sure the files are the same.  If they don't then it could be 
due to variations added by the compressor.  You can download the 
authoritative archive and compare it with yours.

Is that how it goes?

>> I'm very interested in hearing about any git specific bugs.
>
> I don't know any. Bugs tens to be known only after 1000s of buggy
> archives have been published (just as with some GNU tar bugs). It's
> great to have a way to detect that the archive might be affected by
> a bug so you know that you need to work around it.

That requires a field containing the git version which was used to 
create the archive, no?

> Thinking about the problem a bit more and discussion with the
> aforementioned Jörg Schilling we came to the conclusion that the best
> way to deal with an “file omitted” attribute is to attach it to the
> directory that would normally contain the omitted file.

Sounds sensible, but the ordering can be a bit tricky.  If d/a is 
included and d/b is not then it would be easy to write d/, d/a and the 
extended header that says that d/b is excluded, in that order.  Writing 
the extended header first is a bit harder and I'm not sure if it's 
needed.  And it gets tricky if more than one entry is excluded per 
directory. (Just thinking out loud here.)

>> Letting archivers extract meta data as regular files is annoying to
>> those that are not interested in it.  Extended headers themselves
>> (type g) are bad enough already in this regard for those stuck with
>> old tar versions.
>
> I think we can safely assume that systems support pax headers 15 years
> after they have been standardized. I was actually unable to find a
> non-historical version of a serious archiver that claims to support tar
> archives but doesn't support pax headers.

Well, that depends on your definition of "serious".  Plan 9's tar 
perhaps doesn't fit it, but what about 7-Zip (http://www.7-zip.org/)?

And there is no way (or did I overlook it?) to modify or display the 
comment extended header using GNU tar.  That's actually surprising to 
me: I'd think the ability to add a human-readable description to a 
backup on tape is quite important.  (But I didn't touch an actual tape 
for quite a while, and I never used tar directly with them.)

>>>>> The GIT.path option holds the paths that are being archived. It is a bit
>>>>> tricky to get right.  The intent of POSIX pax headers is that each key
>>>>> is an attribute that applies to a series of files.  In the case of a
>>>>> global header, each key applies until it is overridden with a new
>>>>> header or with a local header.  A GIT.path key should only apply to the
>>>>> files that correspond to this path operant to git archive.  Thus, a new
>>>>> GIT.path should be written frequently.  There should always be at least
>>>>> one GIT.path.
>>>>
>>>> That's for the optional path parameters of git archive, right?  A
>>>> list of included paths (GIT.include) would be simpler and should
>>>> suffice, no?
>>>
>>> No.  Again: An attribute in a pax header pertains a file.  It's metadata
>>> attached to a file, not metadata attached to the whole archive, even when
>>> part of a global header.  Thus each file should have attached what path
>>> operand it came from.  A file doesn't have the attribute what other path
>>> operands git received, only the path operand that caused the inclusion of
>>> that one file is an attribute of the file.
>>
>> Not an issue; we can make our own rules for our own keywords.
>
> Well, yes, but they should still stick to the semantic concept POSIX
> imposes for extended headers: headers pertain to files and the only
> difference between a g header and an x header is that the former applies
> until it is revoked by a new g header or overridden by an x header.
> Not sticking to this concept can lead to weird problems with programs
> that modify tar archives (like GNU-tar) and is not future proof. Better
> stick to the standard.

It's easy enough, I think: For each archive entry check if it is 
explicity mentioned in the list of paths to archive and write an 
extended header with GIT.path before proceeding as usual, no?  And for 
the common case without path specification (meaning all files are 
included) no such header would be needed.

René

     prev parent reply	other threads:[~2016-02-15 20:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-24 15:59 git archive should use vendor extension in pax header fuz
2016-01-26 22:06 ` René Scharfe
2016-01-27 23:25   ` fuz
     [not found]   ` <20160127114634.GA1976@fuz.su>
     [not found]     ` <56A92913.3030909@web.de>
2016-01-27 23:45       ` fuz
2016-01-28  8:13         ` Johannes Schindelin
2016-01-28  9:14           ` fuz
2016-02-06 13:23         ` René Scharfe
2016-02-06 14:57           ` fuz
2016-02-15 20:25             ` René Scharfe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56C23444.5000405@web.de \
    --to=l.s.r@web.de \
    --cc=fuz@fuz.su \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.