git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "René Scharfe" <l.s.r@web.de>
To: fuz@fuz.su
Cc: git@vger.kernel.org
Subject: Re: git archive should use vendor extension in pax header
Date: Sat, 6 Feb 2016 14:23:11 +0100	[thread overview]
Message-ID: <56B5F3BF.7080601@web.de> (raw)
In-Reply-To: <20160127234512.GA5453@fuz.su>

Am 28.01.2016 um 00:45 schrieb fuz@fuz.su:
>> There is git get-tar-commit-id, which prints the commit ID if it
>> finds a comment entry which looks like a hexadecimal SHA-1 hash.
>> It's better than a hex editor at least. :)
>
> This is incredibly fuzzy and can get wrong for a pleothora of reasons.
> I hope you agree though that the situation is suboptimal, git is doing
> the equivalent of using a custom file format without an easily
> recognizable magic number.

It is fuzzy in theory. But which other programs allow writing a comment 
header?  I'm not aware of any, but I have to admit that I didn't look 
too hard.

>> But I'm still interested how you got a collection of tar files with
>> unknown origin.  Just curious.
>
> Easy: Just download the (source) distribution archives of a distribution
> of choice and try to verify that the tarballs they use to compile their
> packages actually come from the project's public git repositories.

OK, that's easier than calculating checksums and comparing them with 
those published by the respective projects, but also less trustworthy.

> There are other reasons why an easily detectable git hash might be
> useful.  For example, file(1) could show that the archive comes from
> git.  Other utilities could use this to work around git-specific bugs.
> An unpacker could add corresponding meta-data when unpacking the file.

file(1) could use the same heuristic as git get-tar-commit-id. 
Something like this would work (the first line is already shipped with 
file):

	257	string	ustar\0 POSIX tar archive
	>156	string	g
	>>512	string	52\ comment=
	>>>523	regex	[0-9a-f]{40}	\b, git commit %s

NB: With Ian Darwin's file you need to use -e tar in order to turn off 
its internal tar test.

I'm very interested in hearing about any git specific bugs.

>>>>> It would be much more useful if git created a
>>>>> custom key. As per POSIX suggestions, something like this would be
>>>>> appropriate:
>>>>>
>>>>>      GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1
>>>>
>>>> This would be included in addition to the comment in order to avoid
>>>> breaking existing users, I guess.
>>>
>>> Good point.  I'm not sure how many user use the comment header at all.
>>
>> Apart from git get-tar-commit-id I don't know any program for
>> extracting pax comments.  And I don't know how widely used that is,
>> but I assume there is *someone* out there, extracting commit IDs
>> with it.
>
> Neither do I.  But remember, POSIX explicitly specifies that programs
> that parse pax file must ignore pax comments so an unpacker that
> interpretes the content of such a comment in any way is in violation of
> the pax specification.

Almost right: The spec says that *pax* shall ignore comments.  Which is 
good -- we can use this field to transport anything without pax complaining.

>>>> If you have a random archive and want to know if it was generated by
>>>> git then your next question might be which options and substitutions
>>>> were used.  That reminds me of this thread regarding verifiable
>>>> archives:
>>>>
>>>>      http://article.gmane.org/gmane.comp.version-control.git/240244
>>>
>>> Good point.  Something like this should be enough to be enough to have
>>> reproducable archives if archives with a tree ID were to have a time
>>> stamp of 0 (1970-01-01) instead of the current date:
>>>
>>>      comment=...    (for compatibility)
>>>      GIT.commit=... (like comment)
>>>      GIT.umask=...  (tar.umask)
>>>      GIT.prefix=... (--prefix=)
>>>      GIT.path=...   (see below
>>>      GIT.export-subst=1 (in extended header instead of global header)
>>>
>>> A different key such as GIT.treeish might be appropriate.  The
>>> GIT.export-subst key should be set only for those files where a
>>> substitution has taken place.
>>
>> What would GIT.export-subst contain? There can be multiple
>> replacements in a file.
>
> GIT.export-subst would only contain a 1 if substitution is turned on.
> The goal is to have reproduceable archives, not the ability to turn an
> archive back into a git repository.

OK.

>>> Maybe there should also be an
>>> GIT.original-name key.
>>
>> What would it be used for?
>
> In case an export substition changes the file name so the implementation
> can verify that the original file could plausibly have been substituted
> into the current name.  Also for the case where multiple files
> substitute into the same name to tell which file git should check
> equivalency with.

Stupid question: Could you please provide an example?  The only 
possibility for name changes that I'm aware of is using --prefix.

>>> An option GIT.export-ignore is not required.  Instead it would be more
>>> useful to have a special file type G (for git) with the convention that
>>> the file name .gitattributes means “attributes that apply to this git
>>> archive.”
>>
>> That would be a non-standard extension.  Archivers would extract
>> these as regular files.  Storing a list of excluded paths (in
>> GIT.exclude or so) might be a better idea.
>
> No, that's not a good idea as pax headers are interpreted as “attributes
> pertaining to a file.”  A file doesn't have the attribute that other
> files have been omitted.  Making this a special file type is useful as
> it allows archivers that don't implement git extensions to recover this
> information in a useful way (after all, the .gitattributes file took
> part in creating the archive) and, more importantly, reserves a file
> type for future git extensions.

We can interpret our own keywords as we see fit.  Other programs will 
ignore them (or at most print a warning).  There are precedents for 
global headers pertaining to the whole archive, e.g. SCHILY.archtype of 
star by Jörg Schilling.

Letting archivers extract meta data as regular files is annoying to 
those that are not interested in it.  Extended headers themselves (type 
g) are bad enough already in this regard for those stuck with old tar 
versions.

>>> The GIT.path option holds the paths that are being archived. It is a bit
>>> tricky to get right.  The intent of POSIX pax headers is that each key
>>> is an attribute that applies to a series of files.  In the case of a
>>> global header, each key applies until it is overridden with a new
>>> header or with a local header.  A GIT.path key should only apply to the
>>> files that correspond to this path operant to git archive.  Thus, a new
>>> GIT.path should be written frequently.  There should always be at least
>>> one GIT.path.
>>
>> That's for the optional path parameters of git archive, right?  A
>> list of included paths (GIT.include) would be simpler and should
>> suffice, no?
>
> No.  Again: An attribute in a pax header pertains a file.  It's metadata
> attached to a file, not metadata attached to the whole archive, even when
> part of a global header.  Thus each file should have attached what path
> operand it came from.  A file doesn't have the attribute what other path
> operands git received, only the path operand that caused the inclusion of
> that one file is an attribute of the file.

Not an issue; we can make our own rules for our own keywords.

>>> It might be a good idea to be able to control the kind of metadata git
>>> adds to the archive as to be able to not leak any confidential
>>> information with git archive.  If you are interested I can try to make a
>>> specification for these headers.
>>
>> Which of the field might be sensitive?
>
> The existence of a git-specific pax header is sensitive as it proves
> that a git archive of the source code exists.  This can be a problem if
> you want to plausibly deny the possession of other versions of the
> source code you distribute.  The existence of export-ignore meta data
> leaks information about what other files are in the repository the
> archive was created from and can be critical.  The existence of
> path-operand meta data can show what path structure the repository has
> which can be sensitive.  Basically the existence of any information
> besides the information you want to add itself is sensitive.

OK..

>> Users can always go back to the original format.  At least I don't
>> expect this new format becoming the default too quickly.
>
> Sure thing.  If this is going to be implemented, I would add options
> to choose what / what style of metadata to include.

Alright.  (An environment requiring these options sounds scary, though.)

René

  parent reply	other threads:[~2016-02-06 13:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-24 15:59 git archive should use vendor extension in pax header fuz
2016-01-26 22:06 ` René Scharfe
2016-01-27 23:25   ` fuz
     [not found]   ` <20160127114634.GA1976@fuz.su>
     [not found]     ` <56A92913.3030909@web.de>
2016-01-27 23:45       ` fuz
2016-01-28  8:13         ` Johannes Schindelin
2016-01-28  9:14           ` fuz
2016-02-06 13:23         ` René Scharfe [this message]
2016-02-06 14:57           ` fuz
2016-02-15 20:25             ` René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56B5F3BF.7080601@web.de \
    --to=l.s.r@web.de \
    --cc=fuz@fuz.su \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).