git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Filename quoting / parsing problem
@ 2010-01-01 17:44 Andreas Gruenbacher
  2010-01-01 19:50 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Gruenbacher @ 2010-01-01 17:44 UTC (permalink / raw)
  To: git

Git quotes file names as documented in the git-diff manual page:

  TAB, LF, double quote and backslash characters in pathnames are represented
  as \t, \n, \" and \\, respectively.  If there is need for such substitution
  then the whole pathname is put in double quotes.

Spaces in file names currently do not trigger quoting.  (And \r triggers 
quoting even though the man page doesn't say so).  When there are no "---" and 
"+++" lines, this can lead to a parsing problem: only the "diff --git" line 
contains the file names, sometimes with insufficient quoting.  The following 
examples show the problem:

Parseable:
    diff --git "a/foo \r" "b/foo \r"
    new file mode 100644
    index 0000000..257cc56
    --- /dev/null
    +++ "b/foo \r"
    @@ -0,0 +1 @@
   +foo

Parseable:
    diff --git a/bar  b/bar 
    new file mode 100644
    index 0000000..5716ca5
    --- /dev/null
    +++ b/bar 
    @@ -0,0 +1 @@
    +bar

Not parseable:
    diff --git a/baz  b/baz 
    new file mode 100644
    index 0000000..e69de29


Could this please be changed so that filenames with spaces are also quoted, at 
least in the "diff --git" line, and possibly also in the "---" and "+++" 
lines?  Alternatively, how about a new extended header with the file name in 
this particular case?

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-01 17:44 Filename quoting / parsing problem Andreas Gruenbacher
@ 2010-01-01 19:50 ` Junio C Hamano
  2010-01-01 20:01   ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2010-01-01 19:50 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: git

Andreas Gruenbacher <agruen@suse.de> writes:

> Git quotes file names as documented in the git-diff manual page:
>
>   TAB, LF, double quote and backslash characters in pathnames are represented
>   as \t, \n, \" and \\, respectively.  If there is need for such substitution
>   then the whole pathname is put in double quotes.
>
> Spaces in file names currently do not trigger quoting.  (And \r triggers 
> quoting even though the man page doesn't say so).

Any control character is quoted; the documentation quoted above lists only
the common ones that have non-octal representation.

        $ git init one
        Initialized empty Git repository in /var/tmp/gomi/one/.git/
        $ cd one
        $ >hello && git add hello && git commit -m 'hello'
        [master (root-commit) 5338884] hello
         0 files changed, 0 insertions(+), 0 deletions(-)
         create mode 100644 hello
        $ >'a^Afile' ; >'b file' ; 'c file '
        $ git add 'a^Afile' 'b file' 'c file '

In the above sequence that attempts to reproduce the issue, "^A" actually
is a byte with value \001 (typically you would type ^V^A to get it from
the terminal on the shell).  "^G" seems to get "\a" even though it is not
in the list (that is why I said "only the common ones" above).

        $ git ls-files -s | cat -e
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	"a\001file"$
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	b file$
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	c file $
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	hello$

I used "cat -e" to make it easier to see that "c file " not only has SP in
it but it has trailing space.  Let's try the result.

        $ git diff --cached | cat -e
        diff --git "a/a\001file" "b/a\001file"$
        new file mode 100644$
        index 0000000..e69de29$
        diff --git a/b file b/b file$
        new file mode 100644$
        index 0000000..e69de29$
        diff --git a/c file  b/c file $
        new file mode 100644$
        index 0000000..e69de29$
        $ git diff --cached >P.diff

And as you described, "b file" and "c file " are not quoted and they do
not have ---/+++ lines.

But observe this:

	$ git reset --hard
        $ ls
        P.diff	hello
	$ git ls-files -s
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	hello

We are now back in the state without any of these files, and P.diff records
a patch to recreate these three files, one with quoting and the other two
without.

	$ git apply --index P.diff
        $ git ls-files -s | cat -e
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	"a\001file"$
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	b file$
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	c file $
        100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	hello$

This demonstrates that The claim below is false, doesn't it?

> Not parseable:
>     diff --git a/baz  b/baz 
>     new file mode 100644
>     index 0000000..e69de29

Both "b file" and "c file " are parsed by "git apply" perfectly fine.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-01 19:50 ` Junio C Hamano
@ 2010-01-01 20:01   ` Junio C Hamano
  2010-01-02 11:36     ` Andreas Gruenbacher
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2010-01-01 20:01 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> I used "cat -e" to make it easier to see that "c file " not only has SP in
> it but it has trailing space.  Let's try the result.
> 
>         $ git diff --cached | cat -e
>         diff --git "a/a\001file" "b/a\001file"$
>         new file mode 100644$
>         index 0000000..e69de29$
>         diff --git a/b file b/b file$
>         new file mode 100644$
>         index 0000000..e69de29$
>         diff --git a/c file  b/c file $
>         new file mode 100644$
>         index 0000000..e69de29$
>         $ git diff --cached >P.diff
> 
> And as you described, "b file" and "c file " are not quoted and they do
> not have ---/+++ lines.
> 
> But observe this:
> ...
> We are now back in the state without any of these files, and P.diff records
> a patch to recreate these three files, one with quoting and the other two
> without.
> 
>         $ git apply --index P.diff
>         $ git ls-files -s | cat -e
>         100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       "a\001file"$
>         100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       b file$
>         100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       c file $
>         100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       hello$
> 
> This demonstrates that The claim below is false, doesn't it?
> 
> > Not parseable:
> >     diff --git a/baz  b/baz 
> >     new file mode 100644
> >     index 0000000..e69de29
> 
> Both "b file" and "c file " are parsed by "git apply" perfectly fine.

Having said all that, I don't think we would mind a change to treat a
pathname with trailing SP a bit specially (iow, quoting "c file " in the
above failed attempt to reproduce the issue).  A pathname with SP in it is
an eyesore but is a fact of life outside of sane world, but a quoted
pathname is an even worse eyesore.  A pathname with trailing SP would be
much less common and is more likely to be corrupted by MUA and cut & paste;
with quoting we can protect them a bit better.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-01 20:01   ` Junio C Hamano
@ 2010-01-02 11:36     ` Andreas Gruenbacher
  2010-01-02 18:37       ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Andreas Gruenbacher @ 2010-01-02 11:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Friday 01 January 2010 09:01:19 pm Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> > I used "cat -e" to make it easier to see that "c file " not only has SP
> > in it but it has trailing space.  Let's try the result.
> >
> >         $ git diff --cached | cat -e
> >         diff --git "a/a\001file" "b/a\001file"$
> >         new file mode 100644$
> >         index 0000000..e69de29$
> >         diff --git a/b file b/b file$
> >         new file mode 100644$
> >         index 0000000..e69de29$
> >         diff --git a/c file  b/c file $
> >         new file mode 100644$
> >         index 0000000..e69de29$
> >         $ git diff --cached >P.diff
> >
> > And as you described, "b file" and "c file " are not quoted and they do
> > not have ---/+++ lines.
> >
> > But observe this:
> > ...
> > We are now back in the state without any of these files, and P.diff
> > records a patch to recreate these three files, one with quoting and the
> > other two without.
> >
> >         $ git apply --index P.diff
> >         $ git ls-files -s | cat -e
> >         100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0      
> > "a\001file"$ 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       b
> > file$ 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       c file $
> > 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0       hello$
> >
> > This demonstrates that The claim below is false, doesn't it?
> >
> > > Not parseable:
> > >     diff --git a/baz  b/baz
> > >     new file mode 100644
> > >     index 0000000..e69de29
> >
> > Both "b file" and "c file " are parsed by "git apply" perfectly fine.

Right, the "diff --git" lines are technically still parseable when the file 
name stays the same.  With renames, lines like "diff --git a/f a/f b/f" or 
"diff --git a/f b/f b/f" are possible, but then there will also be "renamed 
from" and "renamed to" headers which will disambiguate things.  Still, it 
doesn't seem like a good idea to allow such ambiguities in the first place.

> Having said all that, I don't think we would mind a change to treat a
> pathname with trailing SP a bit specially (iow, quoting "c file " in the
> above failed attempt to reproduce the issue).

I would prefer quoting file names which contain spaces anywhere, not only at 
the end.  If quoting helps to disambiguate a program's output, I'm all for it.  
People who can't be bothered with such details can always use a pretty printer 
(side by side view, whatever).

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-02 11:36     ` Andreas Gruenbacher
@ 2010-01-02 18:37       ` Junio C Hamano
  2010-01-02 20:48         ` Andreas Gruenbacher
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2010-01-02 18:37 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: git

Andreas Gruenbacher <agruen@suse.de> writes:

> On Friday 01 January 2010 09:01:19 pm Junio C Hamano wrote:
>> > Both "b file" and "c file " are parsed by "git apply" perfectly fine.
>
> Right, the "diff --git" lines are technically still parseable when the file 
> name stays the same.  With renames, lines like "diff --git a/f a/f b/f" or 
> "diff --git a/f b/f b/f" are possible, but then there will also be "renamed 
> from" and "renamed to" headers which will disambiguate things.  Still, it 
> doesn't seem like a good idea to allow such ambiguities in the first place.

You already realized that there is no ambiguity because "diff --git" lines
are parsable and renames have explicit names.  Why do you still maintain
that we are allowing such "ambiguities" when there is none?

>> Having said all that, I don't think we would mind a change to treat a
>> pathname with trailing SP a bit specially (iow, quoting "c file " in the
>> above failed attempt to reproduce the issue).
>
> I would prefer quoting file names which contain spaces anywhere,...

The only reason I said I don't think we would mind changing the trailing
SP case is because the reduced risk of getting our patches corrupted by
MUA _might_ outweigh the benefit of not quoting to avoid an eyesore [*1*].

But what you said would add to eyesore of quoted names (which you omitted
from your quote) without any justification other than "I would prefer".
The pros-and-cons in such a change is quite different; as we have already
established that there is no ambiguity, "disambuguation" is not a "pro" in
this comparison.

[Footnote]

*1* Strictly speaking, it is not just "an eyesore" that is an issue.  Our
diff output without renames are designed to be grokkable by other people's
patch implementations (e.g. GNU patch), and the quoted pathnames are not
understandable by them.  Even though our final version of quoted path
format came from the GNU diff/patch maintainer (back then, at least):

    http://article.gmane.org/gmane.comp.version-control.git/10103

I don't think it happened in the GNU land yet, and you would be the person
to know about it ;-).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-02 18:37       ` Junio C Hamano
@ 2010-01-02 20:48         ` Andreas Gruenbacher
  2010-01-06  0:04           ` Andreas Gruenbacher
  2010-01-06  1:08           ` Junio C Hamano
  0 siblings, 2 replies; 10+ messages in thread
From: Andreas Gruenbacher @ 2010-01-02 20:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Saturday 02 January 2010 07:37:08 pm Junio C Hamano wrote:
> Andreas Gruenbacher <agruen@suse.de> writes:
> > On Friday 01 January 2010 09:01:19 pm Junio C Hamano wrote:
> >> > Both "b file" and "c file " are parsed by "git apply" perfectly fine.
> >
> > Right, the "diff --git" lines are technically still parseable when the
> > file name stays the same.  With renames, lines like "diff --git a/f a/f
> > b/f" or "diff --git a/f b/f b/f" are possible, but then there will also
> > be "renamed from" and "renamed to" headers which will disambiguate
> > things.  Still, it doesn't seem like a good idea to allow such
> > ambiguities in the first place.
> 
> You already realized that there is no ambiguity because "diff --git" lines
> are parsable and renames have explicit names.  Why do you still maintain
> that we are allowing such "ambiguities" when there is none?

Don't get so aroused ...

Right now, git generates lines like "diff --git a/f a/f b/f b/f" in some 
corner cases, and from such lines alone, it is not possible to tell what the 
two file names are (either "a/f a/f" and "b/f b/f", or "a/f a/f b/f" and 
"b/f").  I can only find that out by looking at the other header lines.

I would prefer a format which I can parse line by line without ambiguities in 
the first place, because this would keep things much simpler and easier to 
debug.  (Think of other implementations of the extended diff format which may 
not produce the exact same output as git.)

So I would be happy with either of this:

  * Also quote spaces in the "diff --git" line so that I can always reliably
    parse it, or

  * Add an additional extended header line with the file name in case there
    are no other header lines giving the file names away already (as for
    renames, copies, or when there are "---" and "+++" lines).

After our discussion so far, option two would probably be easier for everyone: 
you could add it without risking to break anything, and I could avoid parsing 
the "diff --git" line altogether.

> >> Having said all that, I don't think we would mind a change to treat a
> >> pathname with trailing SP a bit specially (iow, quoting "c file " in the
> >> above failed attempt to reproduce the issue).
> >
> > I would prefer quoting file names which contain spaces anywhere,...
> 
> The only reason I said I don't think we would mind changing the trailing
> SP case is because the reduced risk of getting our patches corrupted by
> MUA _might_ outweigh the benefit of not quoting to avoid an eyesore [*1*].
> 
> But what you said would add to eyesore of quoted names (which you omitted
> from your quote) without any justification other than "I would prefer".
> The pros-and-cons in such a change is quite different; as we have already
> established that there is no ambiguity, "disambuguation" is not a "pro" in
> this comparison.
>
> [Footnote]
> 
> *1* Strictly speaking, it is not just "an eyesore" that is an issue.  Our
> diff output without renames are designed to be grokkable by other people's
> patch implementations (e.g. GNU patch), and the quoted pathnames are not
> understandable by them.

GNU patch doesn't look at "diff --git" lines or extended header lines at all 
so far, so there are no compatibility issues yet.  Quoting spaces in "---" and 
"+++" lines would lead to problems with current GNU patch though.  (So does 
the quoting of several other characters like ", of course.)

> Even though our final version of quoted path format came from the GNU
> diff/patch maintainer (back then, at least):
> 
>     http://article.gmane.org/gmane.comp.version-control.git/10103
> 
> I don't think it happened in the GNU land yet, and you would be the person
> to know about it ;-).

I'm working on it ...

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-02 20:48         ` Andreas Gruenbacher
@ 2010-01-06  0:04           ` Andreas Gruenbacher
  2010-01-06  1:32             ` Junio C Hamano
  2010-01-06  1:08           ` Junio C Hamano
  1 sibling, 1 reply; 10+ messages in thread
From: Andreas Gruenbacher @ 2010-01-06  0:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano

On Saturday 02 January 2010 09:48:47 pm Andreas Gruenbacher wrote:
> On Saturday 02 January 2010 07:37:08 pm Junio C Hamano wrote:
> > I don't think it happened in the GNU land yet, and you would be the
> > person to know about it ;-).
> 
> I'm working on it ...

The "experimental" branch [*] now has very lightly tested code for parsing 
most extended headers ("index", "rename from", "rename to", "copy from", "copy 
to", "old mode", "new mode", "deleted file mode", "new file mode").  Most 
things should work, except:

 * Doesn't parse filenames in "diff --git" lines.  (I tried to argue why
   those lines are a problem in this thread.)

 * sha1 checksums are not verified right now.  I'm not sure when that should
   happen: always by default, or only optionally?  (Like verifying file modes,
   this is going to surprise a lot of users.)

 * Similarity and dissimilarity headers are ignored.

 * Binary diffs are no supported.  I think GNU patch should recognize them and
   give a reasonable message for now.


[*] http://git.savannah.gnu.org/cgit/patch.git/log/?h=experimental


Can you guys please do something to make parsing of filenames in "diff --git" 
lines a lot easier, or unnecessary?

Thanks!
Andreas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-02 20:48         ` Andreas Gruenbacher
  2010-01-06  0:04           ` Andreas Gruenbacher
@ 2010-01-06  1:08           ` Junio C Hamano
  2010-01-06 10:06             ` Andreas Schwab
  1 sibling, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2010-01-06  1:08 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: git

Andreas Gruenbacher <agruen@suse.de> writes:

> Right now, git generates lines like "diff --git a/f a/f b/f b/f" in some 
> corner cases, and from such lines alone, it is not possible to tell what the 
> two file names are (either "a/f a/f" and "b/f b/f", or "a/f a/f b/f" and 
> "b/f").  I can only find that out by looking at the other header lines.

I would understand "a/f a/f b/f a/f", which would be a diff of "f a/f"
between two versions.

"a/f a/f b/f b/f" could be rename from "f /a/f" to "f b/f" (or "f a/f b/f"
to "f").  But you will always get "rename from" and "rename to" in that
case.

So you can (and I think git-apply does) follow this simple rule:

    If you see +++/---/rename from/rename to/new file/deleted file, use
    the names you find there.  Otherwise, because there is no rename,
    "diff --git" lines has two identical names that follow a/ and b/, so
    use that name.

The parsing code in builtin-apply.c is not so dense; you should be able to
lift it from there, I think.

Although an output of "a/f a/f b/f b/f" without "rename from/to" is
possible with --no-index of two unrelated files, --no-index output is not
even something that is meant to be appliable, so I wouldn't worry too much
about it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-06  0:04           ` Andreas Gruenbacher
@ 2010-01-06  1:32             ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2010-01-06  1:32 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: git

Andreas Gruenbacher <agruen@suse.de> writes:

>  * sha1 checksums are not verified right now.  I'm not sure when that
>  should happen: always by default, or only optionally?

preimage and postimage sums are not something you would "verify" with
anyway, and they are not useful unless you are dealing with binary patch,
so I wouldn't worry too much about them.

>  * Similarity and dissimilarity headers are ignored.

I think it is fine to ignore them; even git doesn't use this information
other than parroting them in "apply --summary --stat" output.

>  * Binary diffs are no supported.  I think GNU patch should recognize
>  them and give a reasonable message for now.

Nice.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Filename quoting / parsing problem
  2010-01-06  1:08           ` Junio C Hamano
@ 2010-01-06 10:06             ` Andreas Schwab
  0 siblings, 0 replies; 10+ messages in thread
From: Andreas Schwab @ 2010-01-06 10:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Andreas Gruenbacher, git

Junio C Hamano <gitster@pobox.com> writes:

> So you can (and I think git-apply does) follow this simple rule:
>
>     If you see +++/---/rename from/rename to/new file/deleted file, use
>     the names you find there.  Otherwise, because there is no rename,
>     "diff --git" lines has two identical names that follow a/ and b/, so
>     use that name.

If you use diff.mnemonicprefix=true then a/ and b/ may be different.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-01-09 12:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-01 17:44 Filename quoting / parsing problem Andreas Gruenbacher
2010-01-01 19:50 ` Junio C Hamano
2010-01-01 20:01   ` Junio C Hamano
2010-01-02 11:36     ` Andreas Gruenbacher
2010-01-02 18:37       ` Junio C Hamano
2010-01-02 20:48         ` Andreas Gruenbacher
2010-01-06  0:04           ` Andreas Gruenbacher
2010-01-06  1:32             ` Junio C Hamano
2010-01-06  1:08           ` Junio C Hamano
2010-01-06 10:06             ` Andreas Schwab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).