From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@jeffhostetler.com, git@vger.kernel.org, gitster@pobox.com,
lars.schneider@autodesk.com,
Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH 0/2] routines to generate JSON data
Date: Sat, 17 Mar 2018 00:00:26 +0100 [thread overview]
Message-ID: <87tvtfd3sl.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20180316211837.GB12333@sigill.intra.peff.net>
On Fri, Mar 16 2018, Jeff King jotted:
> I really like the idea of being able to send our machine-readable output
> in some "standard" syntax for which people may already have parsers. But
> one big hangup with JSON is that it assumes all strings are UTF-8.
FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding
is valid, not that it changes anything you're pointing out, but people
on Win32 could use UTF-16 as-is if their filenames were in that format.
I'm just going to use UTF-8 synonymously with "Unicode encoding" for the
rest of this mail...
> Some possible solutions I can think of:
>
> 1. Ignore the UTF-8 requirement, making a JSON-like output (which I
> think is what your patches do). I'm not sure what problems this
> might cause on the parsing side.
Maybe some JSON parsers are more permissive, but they'll commonly just
die on non-Unicode (usually UTF-8) input, e.g.:
$ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl -0666 -MJSON::XS -wE 'say decode_json(<>)->{str}'
malformed UTF-8 character in JSON string, at character offset 10 (before "\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1.
> 2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
> to know the options here, but my understanding is that numeric
> escapes are just for inserting unicode code points. _Can_ you
> actually transport arbitrary binary data across JSON without
> base64-encoding it (yech)?
There's no way to transfer binary data in JSON without it being shoved
into a UTF-8 encoding, so you'd need to know on the other side that
such-and-such a field has binary in it, i.e. you'll need to invent your
own schema.
E.g.:
head -c 10 /dev/urandom | perl -MDevel::Peek -MJSON::XS -wE 'my $in = <STDIN>; my $roundtrip = decode_json(encode_json({str => $in}))->{str}; utf8::decode($roundtrip) if $ARGV[0]; say Dump [$in, $roundtrip]' 0
You can tweak that trailing "0" to "1" to toggle the ad-hoc schema,
i.e. after we decode the JSON we go and manually UTF-8 decode it to get
back at the same binary data, otherwise we end up with an UTF-8 escaped
version of what we put in.
> 3. Some other similar format. YAML comes to mind. Last time I looked
> (quite a while ago), it seemed insanely complex, but I think you
> could implement only a reasonable subset. OTOH, I think the tools
> ecosystem for parsing JSON (e.g., jq) is much better.
The lack of fast schema-less formats that supported arrays, hashes
etc. and didn't suck when it came to mixed binary/UTF-8 led us to
implementing our own at work: https://github.com/Sereal/Sereal
I think for git's use-case we're probably best off with JSON. It's going
to work almost all of the time, and when it doesn't it's going to be on
someone's weird non-UTF-8 repo, and those people are probably used to
dealing with crap because of that anyway and can just manually decode
their thing after it gets double-encoded.
That sucks, but given that we'll be using this either for just ASCII
(telemetry) or UTF-8 most of the time, and that realistically other
formats either suck more or aren't nearly as ubiquitous...
next prev parent reply other threads:[~2018-03-16 23:00 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-16 19:40 [PATCH 0/2] routines to generate JSON data git
2018-03-16 19:40 ` [PATCH 1/2] json_writer: new routines to create data in JSON format git
2018-03-16 19:40 ` [PATCH 2/2] json-writer: unit test git
2018-03-16 21:18 ` [PATCH 0/2] routines to generate JSON data Jeff King
2018-03-16 23:00 ` Ævar Arnfjörð Bjarmason [this message]
2018-03-20 5:52 ` Jeff King
2018-03-17 7:38 ` Jacob Keller
2018-03-19 17:31 ` Jeff Hostetler
2018-03-19 10:19 ` Jeff Hostetler
2018-03-20 5:42 ` Jeff King
2018-03-20 16:44 ` Jeff Hostetler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87tvtfd3sl.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=git@jeffhostetler.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=jeffhost@microsoft.com \
--cc=lars.schneider@autodesk.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.