git archive should use vendor extension in pax header

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git archive should use vendor extension in pax header
@ 2016-01-24 15:59 fuz
  2016-01-26 22:06 ` René Scharfe
  0 siblings, 1 reply; 9+ messages in thread
From: fuz @ 2016-01-24 15:59 UTC (permalink / raw)
  To: git

Right now, git archive creates a pax global header of the form

    comment=57ca140635bf157354124e4e4b3c8e1bde2832f1

in tar archives it creates. This is suboptimal as as comments are
specified to be ignored by extraction software. It is impossible to
find out in an automatic way (short of guessing) that this is supposed
to be a commit hash. It would be much more useful if git created a
custom key. As per POSIX suggestions, something like this would be
appropriate:

    GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1

Please consider this suggestion.

Yours sincerely,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-01-24 15:59 git archive should use vendor extension in pax header fuz
@ 2016-01-26 22:06 ` René Scharfe
  2016-01-27 23:25   ` fuz
       [not found]   ` <20160127114634.GA1976@fuz.su>
  0 siblings, 2 replies; 9+ messages in thread
From: René Scharfe @ 2016-01-26 22:06 UTC (permalink / raw)
  To: fuz, git

Am 24.01.2016 um 16:59 schrieb fuz@fuz.su:
> Right now, git archive creates a pax global header of the form
>
>      comment=57ca140635bf157354124e4e4b3c8e1bde2832f1
>
> in tar archives it creates. This is suboptimal as as comments are
> specified to be ignored by extraction software. It is impossible to
> find out in an automatic way (short of guessing) that this is supposed
> to be a commit hash.

This is only a problem if you don't know how a given tar files was 
created (or modified later).  How did you get into this situation?  Or 
in other words: Please tell me more about your use case.

> It would be much more useful if git created a
> custom key. As per POSIX suggestions, something like this would be
> appropriate:
>
>      GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1

This would be included in addition to the comment in order to avoid 
breaking existing users, I guess.

If you have a random archive and want to know if it was generated by git 
then your next question might be which options and substitutions were 
used.  That reminds me of this thread regarding verifiable archives:

     http://article.gmane.org/gmane.comp.version-control.git/240244

Thanks,
René

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-01-26 22:06 ` René Scharfe
@ 2016-01-27 23:25   ` fuz
       [not found]   ` <20160127114634.GA1976@fuz.su>
  1 sibling, 0 replies; 9+ messages in thread
From: fuz @ 2016-01-27 23:25 UTC (permalink / raw)
  To: git

On Tue, Jan 26, 2016 at 11:06:25PM +0100, René Scharfe wrote:
> Am 24.01.2016 um 16:59 schrieb fuz@fuz.su:
> >Right now, git archive creates a pax global header of the form
> >
> >     comment=57ca140635bf157354124e4e4b3c8e1bde2832f1
> >
> >in tar archives it creates. This is suboptimal as as comments are
> >specified to be ignored by extraction software. It is impossible to
> >find out in an automatic way (short of guessing) that this is supposed
> >to be a commit hash.
> 
> This is only a problem if you don't know how a given tar files was
> created (or modified later).  How did you get into this situation?
> Or in other words: Please tell me more about your use case.

My situation is that I'm interested in knowing if an archive was created
by git so I can find out where the corresponding repository is and find
out which commit this archive was created from.  Right now the only way
is to open a hex editor or as archiving software is instructed to ignore
the content of comment headers.  This is clearly a suboptimal situation.

> >It would be much more useful if git created a
> >custom key. As per POSIX suggestions, something like this would be
> >appropriate:
> >
> >     GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1
> 
> This would be included in addition to the comment in order to avoid
> breaking existing users, I guess.

Good point.  I'm not sure how many user use the comment header at all.

> If you have a random archive and want to know if it was generated by
> git then your next question might be which options and substitutions
> were used.  That reminds me of this thread regarding verifiable
> archives:
> 
>     http://article.gmane.org/gmane.comp.version-control.git/240244

Good point.  Something like this should be enough to be enough to have
reproducable archives if archives with a tree ID were to have a time
stamp of 0 (1970-01-01) instead of the current date:

    comment=...    (for compatibility)
    GIT.commit=... (like comment)
    GIT.umask=...  (tar.umask)
    GIT.prefix=... (--prefix=)
    GIT.path=...   (see below
    GIT.export-subst=1 (in extended header instead of global header)

A different key such as GIT.treeish might be appropriate.  The
GIT.export-subst key should be set only for those files where a
substitution has taken place. Maybe there should also be an
GIT.original-name key.

An option GIT.export-ignore is not required.  Instead it would be more
useful to have a special file type G (for git) with the convention that
the file name .gitattributes means “attributes that apply to this git
archive.”

The GIT.path option holds the paths that are being archived. It is a bit
tricky to get right.  The intent of POSIX pax headers is that each key
is an attribute that applies to a series of files.  In the case of a
global header, each key applies until it is overridden with a new
header or with a local header.  A GIT.path key should only apply to the
files that correspond to this path operant to git archive.  Thus, a new
GIT.path should be written frequently.  There should always be at least
one GIT.path.

It might be a good idea to be able to control the kind of metadata git
adds to the archive as to be able to not leak any confidential
information with git archive.  If you are interested I can try to make a
specification for these headers.

Yours sincerely,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <20160127114634.GA1976@fuz.su>]

[parent not found: <56A92913.3030909@web.de>]

* Re: git archive should use vendor extension in pax header
       [not found]     ` <56A92913.3030909@web.de>
@ 2016-01-27 23:45       ` fuz
  2016-01-28  8:13         ` Johannes Schindelin
  2016-02-06 13:23         ` René Scharfe
  0 siblings, 2 replies; 9+ messages in thread
From: fuz @ 2016-01-27 23:45 UTC (permalink / raw)
  To: git

Hallo René,

On Wed, Jan 27, 2016 at 09:31:15PM +0100, René Scharfe wrote:
> Hello Robert,
> 
> it's customary to discuss in the open by copying the list.  Unless
> there are secrets involved, but I don't see any below.  I kept it
> private anyway in case I missed any, but please cc:
> git@vger.kernel.org on your reply if possible.

I'm sorry for miss-sending the last mail, I sent it to the list again.

> >>This is only a problem if you don't know how a given tar files was
> >>created (or modified later).  How did you get into this situation?
> >>Or in other words: Please tell me more about your use case.
> >
> >My situation is that I'm interested in knowing if an archive was created
> >by git so I can find out where the corresponding repository is and find
> >out which commit this archive was created from.  Right now the only way
> >is to open a hex editor or as archiving software is instructed to ignore
> >the content of comment headers.  This is clearly a suboptimal situation.
> 
> There is git get-tar-commit-id, which prints the commit ID if it
> finds a comment entry which looks like a hexadecimal SHA-1 hash.
> It's better than a hex editor at least. :)

This is incredibly fuzzy and can get wrong for a pleothora of reasons.
I hope you agree though that the situation is suboptimal, git is doing
the equivalent of using a custom file format without an easily
recognizable magic number.

> But I'm still interested how you got a collection of tar files with
> unknown origin.  Just curious.

Easy: Just download the (source) distribution archives of a distribution
of choice and try to verify that the tarballs they use to compile their
packages actually come from the project's public git repositories.

There are other reasons why an easily detectable git hash might be
useful.  For example, file(1) could show that the archive comes from
git.  Other utilities could use this to work around git-specific bugs.
An unpacker could add corresponding meta-data when unpacking the file.

> >>>It would be much more useful if git created a
> >>>custom key. As per POSIX suggestions, something like this would be
> >>>appropriate:
> >>>
> >>>     GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1
> >>
> >>This would be included in addition to the comment in order to avoid
> >>breaking existing users, I guess.
> >
> >Good point.  I'm not sure how many user use the comment header at all.
> 
> Apart from git get-tar-commit-id I don't know any program for
> extracting pax comments.  And I don't know how widely used that is,
> but I assume there is *someone* out there, extracting commit IDs
> with it.

Neither do I.  But remember, POSIX explicitly specifies that programs
that parse pax file must ignore pax comments so an unpacker that
interpretes the content of such a comment in any way is in violation of
the pax specification.

> >>If you have a random archive and want to know if it was generated by
> >>git then your next question might be which options and substitutions
> >>were used.  That reminds me of this thread regarding verifiable
> >>archives:
> >>
> >>     http://article.gmane.org/gmane.comp.version-control.git/240244
> >
> >Good point.  Something like this should be enough to be enough to have
> >reproducable archives if archives with a tree ID were to have a time
> >stamp of 0 (1970-01-01) instead of the current date:
> >
> >     comment=...    (for compatibility)
> >     GIT.commit=... (like comment)
> >     GIT.umask=...  (tar.umask)
> >     GIT.prefix=... (--prefix=)
> >     GIT.path=...   (see below
> >     GIT.export-subst=1 (in extended header instead of global header)
> >
> >A different key such as GIT.treeish might be appropriate.  The
> >GIT.export-subst key should be set only for those files where a
> >substitution has taken place.
> 
> What would GIT.export-subst contain? There can be multiple
> replacements in a file.

GIT.export-subst would only contain a 1 if substitution is turned on.
The goal is to have reproduceable archives, not the ability to turn an
archive back into a git repository.

> >Maybe there should also be an
> >GIT.original-name key.
> 
> What would it be used for?

In case an export substition changes the file name so the implementation
can verify that the original file could plausibly have been substituted
into the current name.  Also for the case where multiple files
substitute into the same name to tell which file git should check
equivalency with.

> >An option GIT.export-ignore is not required.  Instead it would be more
> >useful to have a special file type G (for git) with the convention that
> >the file name .gitattributes means “attributes that apply to this git
> >archive.”
> 
> That would be a non-standard extension.  Archivers would extract
> these as regular files.  Storing a list of excluded paths (in
> GIT.exclude or so) might be a better idea.

No, that's not a good idea as pax headers are interpreted as “attributes
pertaining to a file.”  A file doesn't have the attribute that other
files have been omitted.  Making this a special file type is useful as
it allows archivers that don't implement git extensions to recover this
information in a useful way (after all, the .gitattributes file took
part in creating the archive) and, more importantly, reserves a file
type for future git extensions.

> >The GIT.path option holds the paths that are being archived. It is a bit
> >tricky to get right.  The intent of POSIX pax headers is that each key
> >is an attribute that applies to a series of files.  In the case of a
> >global header, each key applies until it is overridden with a new
> >header or with a local header.  A GIT.path key should only apply to the
> >files that correspond to this path operant to git archive.  Thus, a new
> >GIT.path should be written frequently.  There should always be at least
> >one GIT.path.
> 
> That's for the optional path parameters of git archive, right?  A
> list of included paths (GIT.include) would be simpler and should
> suffice, no?

No.  Again: An attribute in a pax header pertains a file.  It's metadata
attached to a file, not metadata attached to the whole archive, even when
part of a global header.  Thus each file should have attached what path
operand it came from.  A file doesn't have the attribute what other path
operands git received, only the path operand that caused the inclusion of
that one file is an attribute of the file.

> >It might be a good idea to be able to control the kind of metadata git
> >adds to the archive as to be able to not leak any confidential
> >information with git archive.  If you are interested I can try to make a
> >specification for these headers.
> 
> Which of the field might be sensitive?

The existence of a git-specific pax header is sensitive as it proves
that a git archive of the source code exists.  This can be a problem if
you want to plausibly deny the possession of other versions of the
source code you distribute.  The existence of export-ignore meta data
leaks information about what other files are in the repository the
archive was created from and can be critical.  The existence of
path-operand meta data can show what path structure the repository has
which can be sensitive.  Basically the existence of any information
besides the information you want to add itself is sensitive.

> Users can always go back to the original format.  At least I don't
> expect this new format becoming the default too quickly.

Sure thing.  If this is going to be implemented, I would add options
to choose what / what style of metadata to include.

> An extractor is needed -- unlike the comment field (which is at
> least menationed in the spec) I can't see any generic archiver to
> add support for the git specific fields.

For most archivers, the support comprises ignoring them (and not warning
about the unrecognized fields).  But custom software could use them in
useful ways, e.g. to verify the validity of an archive.  The comment
field has the reliability problems outlined above.  It's like a file
format without a magic number.

> René

Best regards,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-01-27 23:45       ` fuz
@ 2016-01-28  8:13         ` Johannes Schindelin
  2016-01-28  9:14           ` fuz
  2016-02-06 13:23         ` René Scharfe
  1 sibling, 1 reply; 9+ messages in thread
From: Johannes Schindelin @ 2016-01-28  8:13 UTC (permalink / raw)
  To: fuz; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

Hi Robert,

[I am not going to re-Cc: the dropped email addresses; please note that it
is pretty much frowned upon on this mailing list if you do not
reply-to-all and might affect your conversation.]

On Thu, 28 Jan 2016, fuz@fuz.su wrote:

> On Wed, Jan 27, 2016 at 09:31:15PM +0100, René Scharfe wrote:
>
> > Users can always go back to the original format.  At least I don't
> > expect this new format becoming the default too quickly.

This is the most crucial issue here, as far as I am concerned: there are
already tons of .zip files out there that were created by git archive, and
there will inevitably be loads of tons more *having the current pax header
format*.

So tools wanting to deal with Git archives will have to handle those as
well, i.e. do *precisely* as René suggested and use get-tar-commit-id. As
such, the value of changing the format *now* is a bit like closing the
barn's door after pretty much all of the horses left (except the old one
that has a few troubles getting up in the morning but that is too nice to
the kids to shoot).

> Sure thing.  If this is going to be implemented, I would add options
> to choose what / what style of metadata to include.

Why not put your money where your mouth is? I.e. get your head down into
the code and come up with a patch (because otherwise it is unlikely that
your idea will go anywhere)?

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-01-28  8:13         ` Johannes Schindelin
@ 2016-01-28  9:14           ` fuz
  0 siblings, 0 replies; 9+ messages in thread
From: fuz @ 2016-01-28  9:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git

Hello,

> > > Users can always go back to the original format.  At least I don't
> > > expect this new format becoming the default too quickly.
>
> This is the most crucial issue here, as far as I am concerned: there are
> already tons of .zip files out there that were created by git archive, and
> there will inevitably be loads of tons more *having the current pax header
> format*.
> 
> So tools wanting to deal with Git archives will have to handle those as
> well, i.e. do *precisely* as René suggested and use get-tar-commit-id. As
> such, the value of changing the format *now* is a bit like closing the
> barn's door after pretty much all of the horses left (except the old one
> that has a few troubles getting up in the morning but that is too nice to
> the kids to shoot).

That's not really an argument.  The situation you describes applies to
all file formats and it always ends in the same way:  A new file format
is designed and then slowly adopted by the rest of the users, in case of
git I imagine this to be a quick process taking maybe a year or two.
Newly created files use the new file format and old files still hang
around but their importance is dwindling until you can safely support
only the new format.  But to get there, a new file format has to be
adopted in the first place.

> > Sure thing.  If this is going to be implemented, I would add options
> > to choose what / what style of metadata to include.
> 
> Why not put your money where your mouth is? I.e. get your head down into
> the code and come up with a patch (because otherwise it is unlikely that
> your idea will go anywhere)?

I'd love to but I prefer to ask if there is interest in such a change in
the first place.  I'm not going to waste my time implementing this and
then being told that the git project is not interested in this kind of
functionality.  So can someone give me a clear signal?

> Ciao,
> Johannes

Yours sincerely,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-01-27 23:45       ` fuz
  2016-01-28  8:13         ` Johannes Schindelin
@ 2016-02-06 13:23         ` René Scharfe
  2016-02-06 14:57           ` fuz
  1 sibling, 1 reply; 9+ messages in thread
From: René Scharfe @ 2016-02-06 13:23 UTC (permalink / raw)
  To: fuz; +Cc: git

Am 28.01.2016 um 00:45 schrieb fuz@fuz.su:
>> There is git get-tar-commit-id, which prints the commit ID if it
>> finds a comment entry which looks like a hexadecimal SHA-1 hash.
>> It's better than a hex editor at least. :)
>
> This is incredibly fuzzy and can get wrong for a pleothora of reasons.
> I hope you agree though that the situation is suboptimal, git is doing
> the equivalent of using a custom file format without an easily
> recognizable magic number.

It is fuzzy in theory. But which other programs allow writing a comment 
header?  I'm not aware of any, but I have to admit that I didn't look 
too hard.

>> But I'm still interested how you got a collection of tar files with
>> unknown origin.  Just curious.
>
> Easy: Just download the (source) distribution archives of a distribution
> of choice and try to verify that the tarballs they use to compile their
> packages actually come from the project's public git repositories.

OK, that's easier than calculating checksums and comparing them with 
those published by the respective projects, but also less trustworthy.

> There are other reasons why an easily detectable git hash might be
> useful.  For example, file(1) could show that the archive comes from
> git.  Other utilities could use this to work around git-specific bugs.
> An unpacker could add corresponding meta-data when unpacking the file.

file(1) could use the same heuristic as git get-tar-commit-id. 
Something like this would work (the first line is already shipped with 
file):

	257	string	ustar\0 POSIX tar archive
	>156	string	g
	>>512	string	52\ comment=
	>>>523	regex	[0-9a-f]{40}	\b, git commit %s

NB: With Ian Darwin's file you need to use -e tar in order to turn off 
its internal tar test.

I'm very interested in hearing about any git specific bugs.

>>>>> It would be much more useful if git created a
>>>>> custom key. As per POSIX suggestions, something like this would be
>>>>> appropriate:
>>>>>
>>>>>      GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1
>>>>
>>>> This would be included in addition to the comment in order to avoid
>>>> breaking existing users, I guess.
>>>
>>> Good point.  I'm not sure how many user use the comment header at all.
>>
>> Apart from git get-tar-commit-id I don't know any program for
>> extracting pax comments.  And I don't know how widely used that is,
>> but I assume there is *someone* out there, extracting commit IDs
>> with it.
>
> Neither do I.  But remember, POSIX explicitly specifies that programs
> that parse pax file must ignore pax comments so an unpacker that
> interpretes the content of such a comment in any way is in violation of
> the pax specification.

Almost right: The spec says that *pax* shall ignore comments.  Which is 
good -- we can use this field to transport anything without pax complaining.

>>>> If you have a random archive and want to know if it was generated by
>>>> git then your next question might be which options and substitutions
>>>> were used.  That reminds me of this thread regarding verifiable
>>>> archives:
>>>>
>>>>      http://article.gmane.org/gmane.comp.version-control.git/240244
>>>
>>> Good point.  Something like this should be enough to be enough to have
>>> reproducable archives if archives with a tree ID were to have a time
>>> stamp of 0 (1970-01-01) instead of the current date:
>>>
>>>      comment=...    (for compatibility)
>>>      GIT.commit=... (like comment)
>>>      GIT.umask=...  (tar.umask)
>>>      GIT.prefix=... (--prefix=)
>>>      GIT.path=...   (see below
>>>      GIT.export-subst=1 (in extended header instead of global header)
>>>
>>> A different key such as GIT.treeish might be appropriate.  The
>>> GIT.export-subst key should be set only for those files where a
>>> substitution has taken place.
>>
>> What would GIT.export-subst contain? There can be multiple
>> replacements in a file.
>
> GIT.export-subst would only contain a 1 if substitution is turned on.
> The goal is to have reproduceable archives, not the ability to turn an
> archive back into a git repository.

OK.

>>> Maybe there should also be an
>>> GIT.original-name key.
>>
>> What would it be used for?
>
> In case an export substition changes the file name so the implementation
> can verify that the original file could plausibly have been substituted
> into the current name.  Also for the case where multiple files
> substitute into the same name to tell which file git should check
> equivalency with.

Stupid question: Could you please provide an example?  The only 
possibility for name changes that I'm aware of is using --prefix.

>>> An option GIT.export-ignore is not required.  Instead it would be more
>>> useful to have a special file type G (for git) with the convention that
>>> the file name .gitattributes means “attributes that apply to this git
>>> archive.”
>>
>> That would be a non-standard extension.  Archivers would extract
>> these as regular files.  Storing a list of excluded paths (in
>> GIT.exclude or so) might be a better idea.
>
> No, that's not a good idea as pax headers are interpreted as “attributes
> pertaining to a file.”  A file doesn't have the attribute that other
> files have been omitted.  Making this a special file type is useful as
> it allows archivers that don't implement git extensions to recover this
> information in a useful way (after all, the .gitattributes file took
> part in creating the archive) and, more importantly, reserves a file
> type for future git extensions.

We can interpret our own keywords as we see fit.  Other programs will 
ignore them (or at most print a warning).  There are precedents for 
global headers pertaining to the whole archive, e.g. SCHILY.archtype of 
star by Jörg Schilling.

Letting archivers extract meta data as regular files is annoying to 
those that are not interested in it.  Extended headers themselves (type 
g) are bad enough already in this regard for those stuck with old tar 
versions.

>>> The GIT.path option holds the paths that are being archived. It is a bit
>>> tricky to get right.  The intent of POSIX pax headers is that each key
>>> is an attribute that applies to a series of files.  In the case of a
>>> global header, each key applies until it is overridden with a new
>>> header or with a local header.  A GIT.path key should only apply to the
>>> files that correspond to this path operant to git archive.  Thus, a new
>>> GIT.path should be written frequently.  There should always be at least
>>> one GIT.path.
>>
>> That's for the optional path parameters of git archive, right?  A
>> list of included paths (GIT.include) would be simpler and should
>> suffice, no?
>
> No.  Again: An attribute in a pax header pertains a file.  It's metadata
> attached to a file, not metadata attached to the whole archive, even when
> part of a global header.  Thus each file should have attached what path
> operand it came from.  A file doesn't have the attribute what other path
> operands git received, only the path operand that caused the inclusion of
> that one file is an attribute of the file.

Not an issue; we can make our own rules for our own keywords.

>>> It might be a good idea to be able to control the kind of metadata git
>>> adds to the archive as to be able to not leak any confidential
>>> information with git archive.  If you are interested I can try to make a
>>> specification for these headers.
>>
>> Which of the field might be sensitive?
>
> The existence of a git-specific pax header is sensitive as it proves
> that a git archive of the source code exists.  This can be a problem if
> you want to plausibly deny the possession of other versions of the
> source code you distribute.  The existence of export-ignore meta data
> leaks information about what other files are in the repository the
> archive was created from and can be critical.  The existence of
> path-operand meta data can show what path structure the repository has
> which can be sensitive.  Basically the existence of any information
> besides the information you want to add itself is sensitive.

OK..

>> Users can always go back to the original format.  At least I don't
>> expect this new format becoming the default too quickly.
>
> Sure thing.  If this is going to be implemented, I would add options
> to choose what / what style of metadata to include.

Alright.  (An environment requiring these options sounds scary, though.)

René

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-02-06 13:23         ` René Scharfe
@ 2016-02-06 14:57           ` fuz
  2016-02-15 20:25             ` René Scharfe
  0 siblings, 1 reply; 9+ messages in thread
From: fuz @ 2016-02-06 14:57 UTC (permalink / raw)
  To: René Scharfe, git

On Sat, Feb 06, 2016 at 02:23:11PM +0100, René Scharfe wrote:
> Am 28.01.2016 um 00:45 schrieb fuz@fuz.su:
> >>There is git get-tar-commit-id, which prints the commit ID if it
> >>finds a comment entry which looks like a hexadecimal SHA-1 hash.
> >>It's better than a hex editor at least. :)
> >
> >This is incredibly fuzzy and can get wrong for a pleothora of reasons.
> >I hope you agree though that the situation is suboptimal, git is doing
> >the equivalent of using a custom file format without an easily
> >recognizable magic number.
> 
> It is fuzzy in theory. But which other programs allow writing a
> comment header?  I'm not aware of any, but I have to admit that I
> didn't look too hard.

Well, let's say what happens if the Mercurial folks were to implement
the same thing? Suddenly there is a conflict. Yes, of course, right now
there might be no program that uses the comment field for its own
purpose but such design decisions tend to be not future proof. There is
a very good reason why file formats typically have magic numbers and
don't just rely on people knowing that the file has a certain type and
that is the same reason why git should mark its meta data in a unique
fashion.

> >>But I'm still interested how you got a collection of tar files with
> >>unknown origin.  Just curious.
> >
> >Easy: Just download the (source) distribution archives of a distribution
> >of choice and try to verify that the tarballs they use to compile their
> >packages actually come from the project's public git repositories.
> 
> OK, that's easier than calculating checksums and comparing them with
> those published by the respective projects, but also less
> trustworthy.

If you have a known trusted archive, you could use it directly, no need
for cross-verification. The intent is to be able to check if archives
generated by someone from some sources could have plausibly been
generated from these sources.

> >There are other reasons why an easily detectable git hash might be
> >useful.  For example, file(1) could show that the archive comes from
> >git.  Other utilities could use this to work around git-specific bugs.
> >An unpacker could add corresponding meta-data when unpacking the file.
> 
> file(1) could use the same heuristic as git get-tar-commit-id.
> Something like this would work (the first line is already shipped
> with file):
> 
> 	257	string	ustar\0 POSIX tar archive
> 	>156	string	g
> 	>>512	string	52\ comment=
> 	>>>523	regex	[0-9a-f]{40}	\b, git commit %s
> 
> NB: With Ian Darwin's file you need to use -e tar in order to turn
> off its internal tar test.

Same issues as above.

> I'm very interested in hearing about any git specific bugs.

I don't know any. Bugs tens to be known only after 1000s of buggy
archives have been published (just as with some GNU tar bugs). It's
great to have a way to detect that the archive might be affected by
a bug so you know that you need to work around it.

> >>>>>It would be much more useful if git created a
> >>>>>custom key. As per POSIX suggestions, something like this would be
> >>>>>appropriate:
> >>>>>
> >>>>>     GIT.commit=57ca140635bf157354124e4e4b3c8e1bde2832f1
> >>>>
> >>>>This would be included in addition to the comment in order to avoid
> >>>>breaking existing users, I guess.
> >>>
> >>>Good point.  I'm not sure how many user use the comment header at all.
> >>
> >>Apart from git get-tar-commit-id I don't know any program for
> >>extracting pax comments.  And I don't know how widely used that is,
> >>but I assume there is *someone* out there, extracting commit IDs
> >>with it.
> >
> >Neither do I.  But remember, POSIX explicitly specifies that programs
> >that parse pax file must ignore pax comments so an unpacker that
> >interpretes the content of such a comment in any way is in violation of
> >the pax specification.
> 
> Almost right: The spec says that *pax* shall ignore comments.  Which
> is good -- we can use this field to transport anything without pax
> complaining.

The intent of the committee is that comments shall be ignored by all
software that processes tar files. Of course you can put metadata into
the comments, but that is just a perversion of the file format. POSIX
explicitly states that unknown keys in extended headers are to be
ignored by extractors, so using these should be fine, too. But you are
right, some research needs to be done as to how different archivers deal
with unexpected keys in pax headers.

> >>>Maybe there should also be an
> >>>GIT.original-name key.
> >>
> >>What would it be used for?
> >
> >In case an export substition changes the file name so the implementation
> >can verify that the original file could plausibly have been substituted
> >into the current name.  Also for the case where multiple files
> >substitute into the same name to tell which file git should check
> >equivalency with.
> 
> Stupid question: Could you please provide an example?  The only
> possibility for name changes that I'm aware of is using --prefix.

I was under the impression that export substitutions also apply to file
names but it seems like I'm wrong about this.

> >>>An option GIT.export-ignore is not required.  Instead it would be more
> >>>useful to have a special file type G (for git) with the convention that
> >>>the file name .gitattributes means “attributes that apply to this git
> >>>archive.”
> >>
> >>That would be a non-standard extension.  Archivers would extract
> >>these as regular files.  Storing a list of excluded paths (in
> >>GIT.exclude or so) might be a better idea.
> >
> >No, that's not a good idea as pax headers are interpreted as “attributes
> >pertaining to a file.”  A file doesn't have the attribute that other
> >files have been omitted.  Making this a special file type is useful as
> >it allows archivers that don't implement git extensions to recover this
> >information in a useful way (after all, the .gitattributes file took
> >part in creating the archive) and, more importantly, reserves a file
> >type for future git extensions.
> 
> We can interpret our own keywords as we see fit.  Other programs
> will ignore them (or at most print a warning).  There are precedents
> for global headers pertaining to the whole archive, e.g.
> SCHILY.archtype of star by Jörg Schilling.

Since a tar archive is semantically a concatenation of individual file
records, it makes sense that each file has the attribute “this file has
star extensions.”

Thinking about the problem a bit more and discussion with the
aforementioned Jörg Schilling we came to the conclusion that the best
way to deal with an “file omitted” attribute is to attach it to the
directory that would normally contain the omitted file.

> Letting archivers extract meta data as regular files is annoying to
> those that are not interested in it.  Extended headers themselves
> (type g) are bad enough already in this regard for those stuck with
> old tar versions.

I think we can safely assume that systems support pax headers 15 years
after they have been standardized. I was actually unable to find a
non-historical version of a serious archiver that claims to support tar
archives but doesn't support pax headers.

> >>>The GIT.path option holds the paths that are being archived. It is a bit
> >>>tricky to get right.  The intent of POSIX pax headers is that each key
> >>>is an attribute that applies to a series of files.  In the case of a
> >>>global header, each key applies until it is overridden with a new
> >>>header or with a local header.  A GIT.path key should only apply to the
> >>>files that correspond to this path operant to git archive.  Thus, a new
> >>>GIT.path should be written frequently.  There should always be at least
> >>>one GIT.path.
> >>
> >>That's for the optional path parameters of git archive, right?  A
> >>list of included paths (GIT.include) would be simpler and should
> >>suffice, no?
> >
> >No.  Again: An attribute in a pax header pertains a file.  It's metadata
> >attached to a file, not metadata attached to the whole archive, even when
> >part of a global header.  Thus each file should have attached what path
> >operand it came from.  A file doesn't have the attribute what other path
> >operands git received, only the path operand that caused the inclusion of
> >that one file is an attribute of the file.
> 
> Not an issue; we can make our own rules for our own keywords.

Well, yes, but they should still stick to the semantic concept POSIX
imposes for extended headers: headers pertain to files and the only
difference between a g header and an x header is that the former applies
until it is revoked by a new g header or overridden by an x header.
Not sticking to this concept can lead to weird problems with programs
that modify tar archives (like GNU-tar) and is not future proof. Better
stick to the standard.

> >>Users can always go back to the original format.  At least I don't
> >>expect this new format becoming the default too quickly.
> >
> >Sure thing.  If this is going to be implemented, I would add options
> >to choose what / what style of metadata to include.
> 
> Alright.  (An environment requiring these options sounds scary, though.)

Always remember: https://xkcd.com/1172/

> René

Yours,
Robert Clausecker

-- 
()  ascii ribbon campaign - for an 8-bit clean world 
/\  - against html email  - against proprietary attachments

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: git archive should use vendor extension in pax header
  2016-02-06 14:57           ` fuz
@ 2016-02-15 20:25             ` René Scharfe
  0 siblings, 0 replies; 9+ messages in thread
From: René Scharfe @ 2016-02-15 20:25 UTC (permalink / raw)
  To: fuz; +Cc: git

Am 06.02.2016 um 15:57 schrieb fuz@fuz.su:
> On Sat, Feb 06, 2016 at 02:23:11PM +0100, René Scharfe wrote:
>> Am 28.01.2016 um 00:45 schrieb fuz@fuz.su:
>>>> There is git get-tar-commit-id, which prints the commit ID if it
>>>> finds a comment entry which looks like a hexadecimal SHA-1 hash.
>>>> It's better than a hex editor at least. :)
>>>
>>> This is incredibly fuzzy and can get wrong for a pleothora of reasons.
>>> I hope you agree though that the situation is suboptimal, git is doing
>>> the equivalent of using a custom file format without an easily
>>> recognizable magic number.
>>
>> It is fuzzy in theory. But which other programs allow writing a
>> comment header?  I'm not aware of any, but I have to admit that I
>> didn't look too hard.
>
> Well, let's say what happens if the Mercurial folks were to implement
> the same thing? Suddenly there is a conflict. Yes, of course, right now
> there might be no program that uses the comment field for its own
> purpose but such design decisions tend to be not future proof. There is
> a very good reason why file formats typically have magic numbers and
> don't just rely on people knowing that the file has a certain type and
> that is the same reason why git should mark its meta data in a unique
> fashion.

Chances are good that Mercurial would do it in a way that doesn't 
conflict with git's tar comments.  I get your point, though, and agree 
that it's not ideal.  However, so far it's just a potential problem.

>>>> But I'm still interested how you got a collection of tar files with
>>>> unknown origin.  Just curious.
>>>
>>> Easy: Just download the (source) distribution archives of a distribution
>>> of choice and try to verify that the tarballs they use to compile their
>>> packages actually come from the project's public git repositories.
>>
>> OK, that's easier than calculating checksums and comparing them with
>> those published by the respective projects, but also less
>> trustworthy.
>
> If you have a known trusted archive, you could use it directly, no need
> for cross-verification. The intent is to be able to check if archives
> generated by someone from some sources could have plausibly been
> generated from these sources.

It's probably not too important, but I think I still don't fully 
understand.  So you have a tar file of unknown origin.  You hand it to 
git get-tar-commit-id or a similar tool and get back 
a08595f76159b09d57553e37a5123f1091bb13e7.  You can google this string 
and find out it's the commit ID for git v2.7.1.

Your tar file could have been modified in various ways, though, e.g. 
with tar u or tar --delete.  So you try to find a download site for the 
software that includes file hashes for archives of this release, like in
https://www.kernel.org/pub/software/scm/git/sha256sums.asc.

If the published hash and a hash of your file match then you can be 
reasonably sure the files are the same.  If they don't then it could be 
due to variations added by the compressor.  You can download the 
authoritative archive and compare it with yours.

Is that how it goes?

>> I'm very interested in hearing about any git specific bugs.
>
> I don't know any. Bugs tens to be known only after 1000s of buggy
> archives have been published (just as with some GNU tar bugs). It's
> great to have a way to detect that the archive might be affected by
> a bug so you know that you need to work around it.

That requires a field containing the git version which was used to 
create the archive, no?

> Thinking about the problem a bit more and discussion with the
> aforementioned Jörg Schilling we came to the conclusion that the best
> way to deal with an “file omitted” attribute is to attach it to the
> directory that would normally contain the omitted file.

Sounds sensible, but the ordering can be a bit tricky.  If d/a is 
included and d/b is not then it would be easy to write d/, d/a and the 
extended header that says that d/b is excluded, in that order.  Writing 
the extended header first is a bit harder and I'm not sure if it's 
needed.  And it gets tricky if more than one entry is excluded per 
directory. (Just thinking out loud here.)

>> Letting archivers extract meta data as regular files is annoying to
>> those that are not interested in it.  Extended headers themselves
>> (type g) are bad enough already in this regard for those stuck with
>> old tar versions.
>
> I think we can safely assume that systems support pax headers 15 years
> after they have been standardized. I was actually unable to find a
> non-historical version of a serious archiver that claims to support tar
> archives but doesn't support pax headers.

Well, that depends on your definition of "serious".  Plan 9's tar 
perhaps doesn't fit it, but what about 7-Zip (http://www.7-zip.org/)?

And there is no way (or did I overlook it?) to modify or display the 
comment extended header using GNU tar.  That's actually surprising to 
me: I'd think the ability to add a human-readable description to a 
backup on tape is quite important.  (But I didn't touch an actual tape 
for quite a while, and I never used tar directly with them.)

>>>>> The GIT.path option holds the paths that are being archived. It is a bit
>>>>> tricky to get right.  The intent of POSIX pax headers is that each key
>>>>> is an attribute that applies to a series of files.  In the case of a
>>>>> global header, each key applies until it is overridden with a new
>>>>> header or with a local header.  A GIT.path key should only apply to the
>>>>> files that correspond to this path operant to git archive.  Thus, a new
>>>>> GIT.path should be written frequently.  There should always be at least
>>>>> one GIT.path.
>>>>
>>>> That's for the optional path parameters of git archive, right?  A
>>>> list of included paths (GIT.include) would be simpler and should
>>>> suffice, no?
>>>
>>> No.  Again: An attribute in a pax header pertains a file.  It's metadata
>>> attached to a file, not metadata attached to the whole archive, even when
>>> part of a global header.  Thus each file should have attached what path
>>> operand it came from.  A file doesn't have the attribute what other path
>>> operands git received, only the path operand that caused the inclusion of
>>> that one file is an attribute of the file.
>>
>> Not an issue; we can make our own rules for our own keywords.
>
> Well, yes, but they should still stick to the semantic concept POSIX
> imposes for extended headers: headers pertain to files and the only
> difference between a g header and an x header is that the former applies
> until it is revoked by a new g header or overridden by an x header.
> Not sticking to this concept can lead to weird problems with programs
> that modify tar archives (like GNU-tar) and is not future proof. Better
> stick to the standard.

It's easy enough, I think: For each archive entry check if it is 
explicity mentioned in the list of paths to archive and write an 
extended header with GIT.path before proceeding as usual, no?  And for 
the common case without path specification (meaning all files are 
included) no such header would be needed.

René

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-02-15 20:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-24 15:59 git archive should use vendor extension in pax header fuz
2016-01-26 22:06 ` René Scharfe
2016-01-27 23:25   ` fuz
     [not found]   ` <20160127114634.GA1976@fuz.su>
     [not found]     ` <56A92913.3030909@web.de>
2016-01-27 23:45       ` fuz
2016-01-28  8:13         ` Johannes Schindelin
2016-01-28  9:14           ` fuz
2016-02-06 13:23         ` René Scharfe
2016-02-06 14:57           ` fuz
2016-02-15 20:25             ` René Scharfe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).