Efficient parsing of `status -z` output

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Efficient parsing of `status -z` output
@ 2015-03-07 23:00 Matthew Rothenberg
  2015-03-08  8:14 ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Rothenberg @ 2015-03-07 23:00 UTC (permalink / raw)
  To: git

I've been working on a utility that parses the output of `git status
--porcelain` as a fundamental part of its operation.

Since I would like for this tool to be as robust as possible (and
cross-platform compatibility is a goal), I am currently trying to
migrate it from parsing the output of `--porcelain` to using the
output of `-z`, to quote from the documentation:

    > There is also an alternate -z format recommended for machine parsing. In
    > that format, the status field is the same, but some other things change.
    > First, the -> is omitted from rename entries and the field order is
    > reversed (e.g from -> to becomes to from). Second, a NUL (ASCII 0) follows
    > each filename, replacing space as a field separator and the terminating
    > newline (but a space still separates the status field from the first
    > filename). Third, filenames containing special characters are not
    > specially formatted; no quoting or backslash-escaping is performed.

I am encountering some significant issues with using this because of
one detail.  In particular, parsing output using NUL as *both* the
entry terminator and the filename separator for entries that contain
multiple files is problematic. Because of this, one cannot know in
advance how many NULs to read from the buffer until considering an
entry to be in memory for parsing.

There are two workarounds I've considered:

 1. Reading the *entire* buffer into memory, and then using a regular
expression (yikes) to split the entries. This is something I would
obviously like to avoid for performance reasons.

 2. Read from buffer until the first NUL, parse the entry status
codes, and if the entry status code represents a status that *should*
have multiple filenames, read from buffer until a second NUL is found,
and then reparse that entry with both filenames. The issues I see with
this approach:
   a.) One has to know exactly which status code combinations will end
up with two filenames, and this list has to be exhaustive. As far as I
can tell, there is no canonical documentation for this?
   b.) It seems a bit brittle, because if the logic from the above is
wrong and we miss an extended entry or ask for one when it doesn't
exist we will leave the buffer an essentially corrupt state for future
reads.

My understanding is the goal of `-z` is to make machine parsing status
from a binary stream *more* reliable, so perhaps (likely!) I am
missing something obvious?

Thanks for any assistance!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Efficient parsing of `status -z` output
  2015-03-07 23:00 Efficient parsing of `status -z` output Matthew Rothenberg
@ 2015-03-08  8:14 ` Junio C Hamano
  2015-03-09  1:41   ` Matthew Rothenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2015-03-08  8:14 UTC (permalink / raw)
  To: Matthew Rothenberg; +Cc: git

Matthew Rothenberg <mrothenberg@gmail.com> writes:

>  2. Read from buffer until the first NUL, parse the entry status
> codes, and if the entry status code represents a status that *should*
> have multiple filenames, read from buffer until a second NUL is found,
> and then reparse that entry with both filenames. The issues I see with
> this approach:
>    a.) One has to know exactly which status code combinations will end
> up with two filenames, and this list has to be exhaustive. As far as I
> can tell, there is no canonical documentation for this?
>    b.) It seems a bit brittle, because if the logic from the above is
> wrong and we miss an extended entry or ask for one when it doesn't
> exist we will leave the buffer an essentially corrupt state for future
> reads.

I think this is how -z was designed to be used, and if that isn't
clear, then the documentation must be updated to clarify.  Rename
and Copy are the only ones that needs two pathnames, and I suspect
that whoever did the original description of the short format in the
documentation knew Git too well that he forgot to mention it ;-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Efficient parsing of `status -z` output
  2015-03-08  8:14 ` Junio C Hamano
@ 2015-03-09  1:41   ` Matthew Rothenberg
  2015-03-09  6:19     ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Rothenberg @ 2015-03-09  1:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, Mar 8, 2015 at 4:14 AM, Junio C Hamano <gitster@pobox.com> wrote:
> I think this is how -z was designed to be used, and if that isn't
> clear, then the documentation must be updated to clarify.  Rename
> and Copy are the only ones that needs two pathnames, and I suspect
> that whoever did the original description of the short format in the
> documentation knew Git too well that he forgot to mention it ;-)

I see, thank you. But how would one ever get a copy operation to show
up in the output of `git status -z` to begin with? It appears copies
are only detected in `diff` and `show`, can be forced with the
--find-copies-harder option, but that `git status` does not appear to
take that option nor detect copies in any way that I can get it to
replicate and output that status code to me... a test case would be
great if you know one, thanks!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Efficient parsing of `status -z` output
  2015-03-09  1:41   ` Matthew Rothenberg
@ 2015-03-09  6:19     ` Jeff King
  2015-03-09  6:49       ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2015-03-09  6:19 UTC (permalink / raw)
  To: Matthew Rothenberg; +Cc: Junio C Hamano, git

On Sun, Mar 08, 2015 at 09:41:08PM -0400, Matthew Rothenberg wrote:

> I see, thank you. But how would one ever get a copy operation to show
> up in the output of `git status -z` to begin with? It appears copies
> are only detected in `diff` and `show`, can be forced with the
> --find-copies-harder option, but that `git status` does not appear to
> take that option nor detect copies in any way that I can get it to
> replicate and output that status code to me... a test case would be
> great if you know one, thanks!

We don't turn on copy-detection in "git status" by default (only rename
detection), and I think you are right that there is currently no way to
turn it on manually. However, it would probably be sensible to handle
"C" diffs in your parser, if only to future-proof against a day when
that changes (and because it should fairly trivial once you build "R"
support).

I don't know if anybody is actively working on such a change, but
somebody expressed interest recently-ish:

  http://thread.gmane.org/gmane.comp.version-control.git/260381

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Efficient parsing of `status -z` output
  2015-03-09  6:19     ` Jeff King
@ 2015-03-09  6:49       ` Jeff King
  2015-03-09 23:40         ` Matthew Rothenberg
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2015-03-09  6:49 UTC (permalink / raw)
  To: Matthew Rothenberg; +Cc: Junio C Hamano, git

On Mon, Mar 09, 2015 at 02:19:20AM -0400, Jeff King wrote:

> We don't turn on copy-detection in "git status" by default (only rename
> detection), and I think you are right that there is currently no way to
> turn it on manually.

Actually, I take it back. We do break-detection in git-status, which can
lead to finding a copy:

  $ git init
  $ seq 1 1000 >file && git add file && git commit -m base
  $ mv file other
  $ echo foo >file
  $ git add .
  $ git status --short
  M  file
  C  file -> other

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Efficient parsing of `status -z` output
  2015-03-09  6:49       ` Jeff King
@ 2015-03-09 23:40         ` Matthew Rothenberg
  2015-03-10  5:41           ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Rothenberg @ 2015-03-09 23:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Mon, Mar 9, 2015 at 2:49 AM, Jeff King <peff@peff.net> wrote:
>   $ git init
>   $ seq 1 1000 >file && git add file && git commit -m base
>   $ mv file other
>   $ echo foo >file
>   $ git add .
>   $ git status --short
>   M  file
>   C  file -> other

Fantastic, I am able to replicate with these steps and will build
tests around this case.

For future proofing, from the documentation for git status is appears
the other two codes I would want to check for in addition to 'C '
(which this test cases generates) may be 'CM' and 'CD'? And all of
those should always have the additional PATH2 column present?

Thank you for your help!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Efficient parsing of `status -z` output
  2015-03-09 23:40         ` Matthew Rothenberg
@ 2015-03-10  5:41           ` Jeff King
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff King @ 2015-03-10  5:41 UTC (permalink / raw)
  To: Matthew Rothenberg; +Cc: git

On Mon, Mar 09, 2015 at 07:40:43PM -0400, Matthew Rothenberg wrote:

> On Mon, Mar 9, 2015 at 2:49 AM, Jeff King <peff@peff.net> wrote:
> >   $ git init
> >   $ seq 1 1000 >file && git add file && git commit -m base
> >   $ mv file other
> >   $ echo foo >file
> >   $ git add .
> >   $ git status --short
> >   M  file
> >   C  file -> other
> 
> Fantastic, I am able to replicate with these steps and will build
> tests around this case.
> 
> For future proofing, from the documentation for git status is appears
> the other two codes I would want to check for in addition to 'C '
> (which this test cases generates) may be 'CM' and 'CD'? And all of
> those should always have the additional PATH2 column present?

Yes, you can trivially make CM and CD by changing or deleting "other" in
the example above. I don't think you can ever have 'C' or 'R' in the
second column; we don't do renames on working tree changes, since a
"new" file there is simply untracked.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-10  5:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-07 23:00 Efficient parsing of `status -z` output Matthew Rothenberg
2015-03-08  8:14 ` Junio C Hamano
2015-03-09  1:41   ` Matthew Rothenberg
2015-03-09  6:19     ` Jeff King
2015-03-09  6:49       ` Jeff King
2015-03-09 23:40         ` Matthew Rothenberg
2015-03-10  5:41           ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).