Re: [PATCH 0/2] Improve documentation on UTF-16

Git development
 help / color / mirror / Atom feed

From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: Johannes Sixt <j6t@kdbg.org>
Cc: git@vger.kernel.org, "Lars Schneider" <larsxschneider@gmail.com>,
	"Torsten Bögershausen" <tboegi@web.de>
Subject: Re: [PATCH 0/2] Improve documentation on UTF-16
Date: Thu, 27 Dec 2018 23:45:35 +0000	[thread overview]
Message-ID: <20181227234535.GD423984@genre.crustytoothpaste.net> (raw)
In-Reply-To: <435b6870-379c-7183-da99-35aec5cf1137@kdbg.org>

[-- Attachment #1: Type: text/plain, Size: 3455 bytes --]

On Thu, Dec 27, 2018 at 08:55:27PM +0100, Johannes Sixt wrote:
> Am 27.12.18 um 17:43 schrieb brian m. carlson:
> > You've got part of this. For UTF-16LE and UTF-16BE, a U+FEFF is part of
> > the text, as would a second one be if we had two at the beginning of a
> > UTF-16 or UTF-8 sequence. If someone produces UTF-16LE and places a
> > U+FEFF at the beginning of it, when we encode to UTF-8, we emit only one
> > U+FEFF, which has the wrong semantics.
> > 
> > To be correct here and accept a U+FEFF, we'd need to check for a U+FEFF
> > at the beginning of a UTF-16LE or UTF-16BE sequence and ensure we encode
> > an extra U+FEFF at the beginning of the UTF-8 data (one for BOM and one
> > for the text) and then strip it off when we decode. That's kind of ugly,
> > and since iconv doesn't do that itself, we'd have to.
> 
> But why do you add another U+FEFF on the way to UTF-8? There is one in the
> incoming UTF-16 data, and only *that* one must be converted. If there is no
> U+FEFF in the UTF-16 data, the should not be one in UTF-8, either.
> Puzzled...

So for UTF-16, there must be a BOM. For UTF-16LE and UTF-16BE, there
must not be a BOM. So if we do this:

  $ printf '\xfe\xff\x00\x0a' | iconv -f UTF-16BE -t UTF-16 | xxd -g1
  00000000: ff fe ff fe 0a 00                                ......

That U+FEFF we have in the input is part of the text as a ZWNBSP; it is
not a BOM. We end up with two U+FEFF values. The first is the BOM that's
required as part of UTF-16. The second is semantically part of the text
and has the semantics of a zero-width non-breaking space.

In UTF-8, if the sequence starts with U+FEFF, it has the semantics of a
BOM just like in UTF-16 (except that it's optional): it's not part of
the text, and should be stripped off. So when we receive a UTF-16LE or
UTF-16BE sequence and it contains a U+FEFF (which is part of the text),
we need to insert a BOM in front of the sequence that's part of the text
to keep the semantics.

Essentially, we have this situation:

Text (in memory):  U+FEFF U+000A
Semantics of text: ZWNBSP NL
UTF-16BE:          FE FF  00 0A
Semantics:         ZWNBSP NL
UTF-16:            FE FF FE FF  00 0A
Semantics:         BOM   ZWNBSP NL
UTF-8:             EF BB BF EF BB BF 0A
Semantics:         BOM      ZWNBSP   NL

If you don't have a U+FEFF, then things can be simpler:

Text (in memory):  U+0041 U+0042 U+0043
Semantics of text: A      B      C
UTF-16BE:          00 41 00 42 00 43
Semantics:         A     B     C
UTF-16:            FE FF 00 41 00 42 00 43
Semantics:         BOM   A     B     C
UTF-8:             41 42 43
Semantics:         A  B  C
UTF-8 (optional):  EF BB BF 41 42 43
Semantics:         BOM      A  B  C

(I have picked big-endian UTF-16 here, but little-endian is fine, too;
this is just easier for me to type.)

This is all a huge edge case involving correctly serializing code
points. By rejecting U=FEFF in UTF-16BE and UTF-16LE, we don't have to
deal with any of it.

As mentioned, I think patching Git for Windows's iconv is the smallest,
most achievable solution to this, because it means we don't have to
handle any of this edge case ourselves. Windows and WSL users can both
write "UTF-16" and get a BOM and little-endian behavior, while we can
delegate all the rest of the encoding stuff to libiconv.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

next prev parent reply	other threads:[~2018-12-27 23:45 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-27  2:17 [PATCH 0/2] Improve documentation on UTF-16 brian m. carlson
2018-12-27  2:17 ` [PATCH 1/2] Documentation: document UTF-16-related behavior brian m. carlson
2018-12-27  2:17 ` [PATCH 2/2] utf8: add comment explaining why BOMs are rejected brian m. carlson
2018-12-27 10:06 ` [PATCH 0/2] Improve documentation on UTF-16 Johannes Sixt
2018-12-27 16:43   ` brian m. carlson
2018-12-27 19:55     ` Johannes Sixt
2018-12-27 23:45       ` brian m. carlson [this message]
2018-12-28  8:59         ` Johannes Sixt
2018-12-28 20:31           ` Philip Oakley
2018-12-28  8:46 ` Ævar Arnfjörð Bjarmason
2018-12-28 20:35   ` Philip Oakley
2018-12-29 23:17   ` brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181227234535.GD423984@genre.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=git@vger.kernel.org \
    --cc=j6t@kdbg.org \
    --cc=larsxschneider@gmail.com \
    --cc=tboegi@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox