git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Rothenberg <mrothenberg@gmail.com>
To: git@vger.kernel.org
Subject: Efficient parsing of `status -z` output
Date: Sat, 7 Mar 2015 18:00:42 -0500	[thread overview]
Message-ID: <CAMJduDuxBDoJ9_ETY8FCRoANf+taAS7-1acf5CFRGXDFyL72Rg@mail.gmail.com> (raw)

I've been working on a utility that parses the output of `git status
--porcelain` as a fundamental part of its operation.

Since I would like for this tool to be as robust as possible (and
cross-platform compatibility is a goal), I am currently trying to
migrate it from parsing the output of `--porcelain` to using the
output of `-z`, to quote from the documentation:

    > There is also an alternate -z format recommended for machine parsing. In
    > that format, the status field is the same, but some other things change.
    > First, the -> is omitted from rename entries and the field order is
    > reversed (e.g from -> to becomes to from). Second, a NUL (ASCII 0) follows
    > each filename, replacing space as a field separator and the terminating
    > newline (but a space still separates the status field from the first
    > filename). Third, filenames containing special characters are not
    > specially formatted; no quoting or backslash-escaping is performed.

I am encountering some significant issues with using this because of
one detail.  In particular, parsing output using NUL as *both* the
entry terminator and the filename separator for entries that contain
multiple files is problematic. Because of this, one cannot know in
advance how many NULs to read from the buffer until considering an
entry to be in memory for parsing.

There are two workarounds I've considered:

 1. Reading the *entire* buffer into memory, and then using a regular
expression (yikes) to split the entries. This is something I would
obviously like to avoid for performance reasons.

 2. Read from buffer until the first NUL, parse the entry status
codes, and if the entry status code represents a status that *should*
have multiple filenames, read from buffer until a second NUL is found,
and then reparse that entry with both filenames. The issues I see with
this approach:
   a.) One has to know exactly which status code combinations will end
up with two filenames, and this list has to be exhaustive. As far as I
can tell, there is no canonical documentation for this?
   b.) It seems a bit brittle, because if the logic from the above is
wrong and we miss an extended entry or ask for one when it doesn't
exist we will leave the buffer an essentially corrupt state for future
reads.

My understanding is the goal of `-z` is to make machine parsing status
from a binary stream *more* reliable, so perhaps (likely!) I am
missing something obvious?

Thanks for any assistance!

             reply	other threads:[~2015-03-07 23:01 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-07 23:00 Matthew Rothenberg [this message]
2015-03-08  8:14 ` Efficient parsing of `status -z` output Junio C Hamano
2015-03-09  1:41   ` Matthew Rothenberg
2015-03-09  6:19     ` Jeff King
2015-03-09  6:49       ` Jeff King
2015-03-09 23:40         ` Matthew Rothenberg
2015-03-10  5:41           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMJduDuxBDoJ9_ETY8FCRoANf+taAS7-1acf5CFRGXDFyL72Rg@mail.gmail.com \
    --to=mrothenberg@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).