Zero padded file modes...

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Zero padded file modes...
@ 2013-09-05 14:00 John Szakmeister
  2013-09-05 15:36 ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: John Szakmeister @ 2013-09-05 14:00 UTC (permalink / raw)
  To: git

I went to clone a repository from GitHub today and discovered
something interesting:

    :: git clone https://github.com/liebke/incanter.git
    Cloning into 'incanter'...
    remote: Counting objects: 10457, done.
    remote: Compressing objects: 100% (3018/3018), done.
    error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains
zero-padded file modes
    fatal: Error in object
    fatal: index-pack failed

At first, it surprised me that no one has seen the issue before,
but then I remembered I have transfer.fsckObjects=true in my
config.  Turning it off, I was able to clone.  Running `git
fsck` I see:

    :: git fsck
    Checking object directories: 100% (256/256), done.
    warning in tree 4946e1ba09ba5655202a7a5d81ae106b08411061: contains
zero-padded file modes
    warning in tree 553c5e006e53a8360126f053c3ade3d1d063c2f5: contains
zero-padded file modes
    warning in tree 0a2e7f55d7f8e1fa5469e6d83ff20365881eed1a: contains
zero-padded file modes
    Checking objects: 100% (10560/10560), done.

So there appears to be several instances of the issue in the
tree.  Looking in the archives, I ran across this thread:

    http://comments.gmane.org/gmane.comp.version-control.git/143288

In there, Nicolas Pitre says:

> This is going to screw up pack v4 (yes, someday I'll have the
> time to make it real).

I don't know if this is still true, but given that patches are
being sent out about it, I thought it relevant.

Also, searching on the issue, you'll find that a number of
repositories have suffered from this problem, and it appears the
only fix will result in different commit ids.  Given all of
that, should Git be updated to cope with zero padded modes?  Or
is there no some way of fixing the issue that doesn't involve
changing commit ids?

Thanks!

-John

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 14:00 Zero padded file modes John Szakmeister
@ 2013-09-05 15:36 ` Jeff King
  2013-09-05 16:18   ` Duy Nguyen
                     ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Jeff King @ 2013-09-05 15:36 UTC (permalink / raw)
  To: John Szakmeister; +Cc: git

On Thu, Sep 05, 2013 at 10:00:39AM -0400, John Szakmeister wrote:

> I went to clone a repository from GitHub today and discovered
> something interesting:
> 
>     :: git clone https://github.com/liebke/incanter.git
>     Cloning into 'incanter'...
>     remote: Counting objects: 10457, done.
>     remote: Compressing objects: 100% (3018/3018), done.
>     error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains
> zero-padded file modes
>     fatal: Error in object
>     fatal: index-pack failed

Yep. These were mostly caused by a bug in Grit that is long-fixed.  But
the objects remain in many histories. It would have painful to rewrite
them back then, and it would be even more painful now.

> > This is going to screw up pack v4 (yes, someday I'll have the
> > time to make it real).
> 
> I don't know if this is still true, but given that patches are
> being sent out about it, I thought it relevant.

I haven't looked carefully at the pack v4 patches yet, but I suspect
that yes, it's still a problem. The premise of pack v4 is that we can do
better by not storing the raw git object bytes, but rather storing
specialized representations of the various components. For example, by
using an integer to store the mode rather than the ascii representation.
But that representation does not represent the "oops, I have a 0-padded
mode" quirk. And we have to be able to recover the original object, byte
for byte, from the v4 representation (to verify sha1, or to generate a
loose object or v2 pack).

There are basically two solutions:

  1. Add a single-bit flag for "I am 0-padded in the real data". We
     could probably even squeeze it into the same integer.

  2. Have a "classic" section of the pack that stores the raw object
     bytes. For objects which do not match our expectations, store them
     raw instead of in v4 format. They will not get the benefit of v4
     optimizations, but if they are the minority of objects, that will
     only end up with a slight slow-down.

As I said, I have not looked carefully at the v4 patches, so maybe they
handle this case already. But of the two solutions, I prefer (2). Doing
(1) can solve _this_ problem, but it complicates the format, and does
nothing for any future compatibility issues. Whereas (2) is easy to
implement, since it is basically just pack v2 (and implementations would
need a pack v2 reader anyway).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 15:36 ` Jeff King
@ 2013-09-05 16:18   ` Duy Nguyen
  2013-09-05 16:33     ` Jeff King
  2013-09-05 16:25   ` A Large Angry SCM
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Duy Nguyen @ 2013-09-05 16:18 UTC (permalink / raw)
  To: Jeff King; +Cc: John Szakmeister, Git Mailing List, Nicolas Pitre

On Thu, Sep 5, 2013 at 10:36 PM, Jeff King <peff@peff.net> wrote:
>> > This is going to screw up pack v4 (yes, someday I'll have the
>> > time to make it real).
>>
>> I don't know if this is still true, but given that patches are
>> being sent out about it, I thought it relevant.
>
> I haven't looked carefully at the pack v4 patches yet, but I suspect
> that yes, it's still a problem. The premise of pack v4 is that we can do
> better by not storing the raw git object bytes, but rather storing
> specialized representations of the various components. For example, by
> using an integer to store the mode rather than the ascii representation.
> But that representation does not represent the "oops, I have a 0-padded
> mode" quirk. And we have to be able to recover the original object, byte
> for byte, from the v4 representation (to verify sha1, or to generate a
> loose object or v2 pack).
>
> There are basically two solutions:
>
>   1. Add a single-bit flag for "I am 0-padded in the real data". We
>      could probably even squeeze it into the same integer.
>
>   2. Have a "classic" section of the pack that stores the raw object
>      bytes. For objects which do not match our expectations, store them
>      raw instead of in v4 format. They will not get the benefit of v4
>      optimizations, but if they are the minority of objects, that will
>      only end up with a slight slow-down.

3. Detect this situation and fall back to v2.

4. Update v4 to allow storing raw tree entries mixing with v4-encoded
tree entries. This is something between (1) and (2)

> As I said, I have not looked carefully at the v4 patches, so maybe they
> handle this case already. But of the two solutions, I prefer (2). Doing
> (1) can solve _this_ problem, but it complicates the format, and does
> nothing for any future compatibility issues. Whereas (2) is easy to
> implement, since it is basically just pack v2 (and implementations would
> need a pack v2 reader anyway).

I think (4) fits better in v4 design and probably not hard to do. Nico
recently added a code to embed a tree entry inline, but the mode must
be encoded (and can't contain leading zeros). We could have another
code to store mode in ascii. This also makes me wonder if we might
have similar problems with timezones, which are also specially encoded
in v4..

(3) is probably easiest. We need to scan through all tree entries
first when creating v4 anyway. If we detect any anomalies, just switch
back to v2 generation. The user will be force to rewrite history in
order to take full advantage of v4 (they can have a pack of weird
trees in v2 and the rest in v4 pack, but that's not optimal).
-- 
Duy

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 15:36 ` Jeff King
  2013-09-05 16:18   ` Duy Nguyen
@ 2013-09-05 16:25   ` A Large Angry SCM
  2013-09-05 17:09   ` Nicolas Pitre
  2013-09-05 17:13   ` John Szakmeister
  3 siblings, 0 replies; 10+ messages in thread
From: A Large Angry SCM @ 2013-09-05 16:25 UTC (permalink / raw)
  To: Jeff King; +Cc: John Szakmeister, git



On 09/05/2013 11:36 AM, Jeff King wrote:
[...]
>
> I haven't looked carefully at the pack v4 patches yet, but I suspect
> that yes, it's still a problem. The premise of pack v4 is that we can do
> better by not storing the raw git object bytes, but rather storing
> specialized representations of the various components. For example, by
> using an integer to store the mode rather than the ascii representation.
> But that representation does not represent the "oops, I have a 0-padded
> mode" quirk. And we have to be able to recover the original object, byte
> for byte, from the v4 representation (to verify sha1, or to generate a
> loose object or v2 pack).
>
> There are basically two solutions:
>
>    1. Add a single-bit flag for "I am 0-padded in the real data". We
>       could probably even squeeze it into the same integer.
>
>    2. Have a "classic" section of the pack that stores the raw object
>       bytes. For objects which do not match our expectations, store them
>       raw instead of in v4 format. They will not get the benefit of v4
>       optimizations, but if they are the minority of objects, that will
>       only end up with a slight slow-down.
>
> As I said, I have not looked carefully at the v4 patches, so maybe they
> handle this case already. But of the two solutions, I prefer (2). Doing
> (1) can solve _this_ problem, but it complicates the format, and does
> nothing for any future compatibility issues. Whereas (2) is easy to
> implement, since it is basically just pack v2 (and implementations would
> need a pack v2 reader anyway).

3. Keep those objects in v2 packs instead of the v4 pack. Transfers 
would have to be v3 or multi-pack transfers would need to be supported.

4. Don't use v4 packs with projects that have "crufty" objects. Projects 
with such objects may choose to pay the "cost" to "upgrade" to v4 
compatibility.

There's nothing that requires the next pack format to support all of the 
broken stuff that's happened over the years.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 16:18   ` Duy Nguyen
@ 2013-09-05 16:33     ` Jeff King
  2013-09-05 16:56       ` Nicolas Pitre
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2013-09-05 16:33 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: John Szakmeister, Git Mailing List, Nicolas Pitre

On Thu, Sep 05, 2013 at 11:18:24PM +0700, Nguyen Thai Ngoc Duy wrote:

> > There are basically two solutions:
> >
> >   1. Add a single-bit flag for "I am 0-padded in the real data". We
> >      could probably even squeeze it into the same integer.
> >
> >   2. Have a "classic" section of the pack that stores the raw object
> >      bytes. For objects which do not match our expectations, store them
> >      raw instead of in v4 format. They will not get the benefit of v4
> >      optimizations, but if they are the minority of objects, that will
> >      only end up with a slight slow-down.
> 
> 3. Detect this situation and fall back to v2.
> 
> 4. Update v4 to allow storing raw tree entries mixing with v4-encoded
> tree entries. This is something between (1) and (2)

I wouldn't want to do (3). At some point pack v4 may become the standard
format, but there will be some repositories which will never be allowed
to adopt it.

For (4), yes, that could work. But like (1), it only solves problems in
tree entries. What happens if we have a quirky commit object that needs
the same treatment (e.g., a timezone that does not fit into the commit
name dictionary properly)?

> I think (4) fits better in v4 design and probably not hard to do. Nico
> recently added a code to embed a tree entry inline, but the mode must
> be encoded (and can't contain leading zeros). We could have another
> code to store mode in ascii. This also makes me wonder if we might
> have similar problems with timezones, which are also specially encoded
> in v4..

Yeah, that might be more elegant.

> (3) is probably easiest. We need to scan through all tree entries
> first when creating v4 anyway. If we detect any anomalies, just switch
> back to v2 generation. The user will be force to rewrite history in
> order to take full advantage of v4 (they can have a pack of weird
> trees in v2 and the rest in v4 pack, but that's not optimal).

Splitting across two packs isn't great, though. What if v4 eventually
becomes the normal on-the-wire format? I'd rather have some method for
just embedding what are essentially v2 objects into the v4 pack, which
would give us future room for handling these sorts of things.

But like I said, I haven't looked closely yet, so maybe there are
complications with that. In the meantime, I'll defer to the judgement of
people who know what they are talking about. :)

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 16:33     ` Jeff King
@ 2013-09-05 16:56       ` Nicolas Pitre
  0 siblings, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2013-09-05 16:56 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, John Szakmeister, Git Mailing List

On Thu, 5 Sep 2013, Jeff King wrote:

> On Thu, Sep 05, 2013 at 11:18:24PM +0700, Nguyen Thai Ngoc Duy wrote:
> 
> > > There are basically two solutions:
> > >
> > >   1. Add a single-bit flag for "I am 0-padded in the real data". We
> > >      could probably even squeeze it into the same integer.
> > >
> > >   2. Have a "classic" section of the pack that stores the raw object
> > >      bytes. For objects which do not match our expectations, store them
> > >      raw instead of in v4 format. They will not get the benefit of v4
> > >      optimizations, but if they are the minority of objects, that will
> > >      only end up with a slight slow-down.
> > 
> > 3. Detect this situation and fall back to v2.
> > 
> > 4. Update v4 to allow storing raw tree entries mixing with v4-encoded
> > tree entries. This is something between (1) and (2)
> 
> I wouldn't want to do (3). At some point pack v4 may become the standard
> format, but there will be some repositories which will never be allowed
> to adopt it.
> 
> For (4), yes, that could work. But like (1), it only solves problems in
> tree entries. What happens if we have a quirky commit object that needs
> the same treatment (e.g., a timezone that does not fit into the commit
> name dictionary properly)?
> 
> > I think (4) fits better in v4 design and probably not hard to do. Nico
> > recently added a code to embed a tree entry inline, but the mode must
> > be encoded (and can't contain leading zeros). We could have another
> > code to store mode in ascii. This also makes me wonder if we might
> > have similar problems with timezones, which are also specially encoded
> > in v4..
> 
> Yeah, that might be more elegant.
> 
> > (3) is probably easiest. We need to scan through all tree entries
> > first when creating v4 anyway. If we detect any anomalies, just switch
> > back to v2 generation. The user will be force to rewrite history in
> > order to take full advantage of v4 (they can have a pack of weird
> > trees in v2 and the rest in v4 pack, but that's not optimal).
> 
> Splitting across two packs isn't great, though. What if v4 eventually
> becomes the normal on-the-wire format? I'd rather have some method for
> just embedding what are essentially v2 objects into the v4 pack, which
> would give us future room for handling these sorts of things.
> 
> But like I said, I haven't looked closely yet, so maybe there are
> complications with that. In the meantime, I'll defer to the judgement of
> people who know what they are talking about. :)

None of the above is particularly appealing to me.

Pack v4 has to enforce some standardization in the object encoding to be 
efficient.  Some compromizes have been applied to accommodate the fixing 
of a thin pack, although I was initially tempted to simply dodge the 
issue and allow thin packs in a repository.

On this particular mode issue, I remember making a fuss at the time when 
this was discovered because the github implementation did generate such 
tree objects at the time.

So instead of compromizing the pack v4 object encoding further, I'd 
simply suggest adding a special object type which is in fact simply the 
pack v2 representation i.e. the canonical object version, deflated.  
Right now pack v4 encodes only 5 object types: commit, tree, blob, delta 
and tag.  Only the commit and tree objects have their representation 
transcoded.  So that means we only need to add native_commit and 
native_tree object types.

Then, anything that doesn't fit the strict expectation for transcoding a 
tree or a commit object is simply included as is without transcoding 
just like in pack v2.


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 15:36 ` Jeff King
  2013-09-05 16:18   ` Duy Nguyen
  2013-09-05 16:25   ` A Large Angry SCM
@ 2013-09-05 17:09   ` Nicolas Pitre
  2013-09-05 19:10     ` Jeff King
  2013-09-05 17:13   ` John Szakmeister
  3 siblings, 1 reply; 10+ messages in thread
From: Nicolas Pitre @ 2013-09-05 17:09 UTC (permalink / raw)
  To: Jeff King; +Cc: John Szakmeister, git

On Thu, 5 Sep 2013, Jeff King wrote:

> There are basically two solutions:
> 
>   1. Add a single-bit flag for "I am 0-padded in the real data". We
>      could probably even squeeze it into the same integer.
> 
>   2. Have a "classic" section of the pack that stores the raw object
>      bytes. For objects which do not match our expectations, store them
>      raw instead of in v4 format. They will not get the benefit of v4
>      optimizations, but if they are the minority of objects, that will
>      only end up with a slight slow-down.

That is basically what I just suggested.  But instead of a special 
section, simply using a special object type number would do it.

I'm even wondering if that couldn't be used for fixing a thin pack 
instead of the special provision I just added last night.


Nicolas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 15:36 ` Jeff King
                     ` (2 preceding siblings ...)
  2013-09-05 17:09   ` Nicolas Pitre
@ 2013-09-05 17:13   ` John Szakmeister
  2013-09-05 19:35     ` Jeff King
  3 siblings, 1 reply; 10+ messages in thread
From: John Szakmeister @ 2013-09-05 17:13 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Sep 5, 2013 at 11:36 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Sep 05, 2013 at 10:00:39AM -0400, John Szakmeister wrote:
>
>> I went to clone a repository from GitHub today and discovered
>> something interesting:
>>
>>     :: git clone https://github.com/liebke/incanter.git
>>     Cloning into 'incanter'...
>>     remote: Counting objects: 10457, done.
>>     remote: Compressing objects: 100% (3018/3018), done.
>>     error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains
>> zero-padded file modes
>>     fatal: Error in object
>>     fatal: index-pack failed
>
> Yep. These were mostly caused by a bug in Grit that is long-fixed.  But
> the objects remain in many histories. It would have painful to rewrite
> them back then, and it would be even more painful now.

I guess there's still the other side of the question though.  Are
these repositories busted in the sense that something no longer works?
 I doesn't appear to be the case, but I've not used it extensively say
I can't say for certain one way or another.  In the sense that the
content is not strictly compliant, transfer.fsckObjects did its job,
but I wonder if fsck needs to be a little more tolerant now (at least
with respect to transfer objects)?

I can certainly cope with the issue--it's not a problem for me to flip
the flag on the command line.  I think it'd be nice to have
transer.fsckObjects be the default at some point, considering how
little people run fsck otherwise and how long these sorts of issues go
undiscovered.  Issues like the above seem to stand in the way of that
happening though.

-John

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 17:09   ` Nicolas Pitre
@ 2013-09-05 19:10     ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2013-09-05 19:10 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: John Szakmeister, git

On Thu, Sep 05, 2013 at 01:09:34PM -0400, Nicolas Pitre wrote:

> On Thu, 5 Sep 2013, Jeff King wrote:
> 
> > There are basically two solutions:
> > 
> >   1. Add a single-bit flag for "I am 0-padded in the real data". We
> >      could probably even squeeze it into the same integer.
> > 
> >   2. Have a "classic" section of the pack that stores the raw object
> >      bytes. For objects which do not match our expectations, store them
> >      raw instead of in v4 format. They will not get the benefit of v4
> >      optimizations, but if they are the minority of objects, that will
> >      only end up with a slight slow-down.
> 
> That is basically what I just suggested.  But instead of a special 
> section, simply using a special object type number would do it.

Yeah, I think we are in agreement. I only suggested a separate section
because I hadn't carefully read the v4 patches yet, and didn't know if
there was room in the normal sequence. A special object number seems
much more elegant.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Zero padded file modes...
  2013-09-05 17:13   ` John Szakmeister
@ 2013-09-05 19:35     ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2013-09-05 19:35 UTC (permalink / raw)
  To: John Szakmeister; +Cc: git

On Thu, Sep 05, 2013 at 01:13:40PM -0400, John Szakmeister wrote:

> > Yep. These were mostly caused by a bug in Grit that is long-fixed.  But
> > the objects remain in many histories. It would have painful to rewrite
> > them back then, and it would be even more painful now.
> 
> I guess there's still the other side of the question though.  Are
> these repositories busted in the sense that something no longer works?

No, as far as I know, everything still works fine. However, some diffs
may be suboptimal, because we may have two different sha1s for the same
subtree (so we may descend into the tree unnecessarily only to find that
they are equivalent). And by the same token, any scripts doing
non-recursive diffs may erroneously mark the trees as differing, even
though they do not contain any differing files.

But neither is a big problem in practice. If you had two clients in
active use which were flip-flopping a sub-tree back and forth between
representations, it would be a problem. But we are talking about a few
isolated incidents far back in history.

> I doesn't appear to be the case, but I've not used it extensively say
> I can't say for certain one way or another.  In the sense that the
> content is not strictly compliant, transfer.fsckObjects did its job,
> but I wonder if fsck needs to be a little more tolerant now (at least
> with respect to transfer objects)?

Fsck actually treats this as a warning, not an error. It is
transfer.fsckObjects (via "index-pack --strict") that actually treats
warnings as errors.

It's possible that this should be loosened to allow through problems
marked as FSCK_WARN (with a message, kind of like...a warning). Though
it may also make sense to revisit some of the classifications in fsck
(e.g., many of the warnings are indicative of seriously broken objects).

GitHub uses transfer.fsckObjects, rejecting all warnings[1]. In practice
it is not usually a big deal, as people are happy to fix up their
objects _before_ they get widely published. The biggest push-back we get
is when somebody tries to re-push history they got from another GitHub
repo, and then says "But why are you complaining? You served this crappy
broken history?" And it's a fair point. If you are forking (but not
joining the existing fork network) of an existing project with
irregularities in the history, it's not really an option to simply
rewrite the history you are basing on.

-Peff

[1] Actually, we do let through 0-padded modes with a warning,
    explicitly because of the problem mentioned above.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-09-05 19:35 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-05 14:00 Zero padded file modes John Szakmeister
2013-09-05 15:36 ` Jeff King
2013-09-05 16:18   ` Duy Nguyen
2013-09-05 16:33     ` Jeff King
2013-09-05 16:56       ` Nicolas Pitre
2013-09-05 16:25   ` A Large Angry SCM
2013-09-05 17:09   ` Nicolas Pitre
2013-09-05 19:10     ` Jeff King
2013-09-05 17:13   ` John Szakmeister
2013-09-05 19:35     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).