Re: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Torsten Bögershausen" <tboegi@web.de>
To: Jason Pyeron <jpyeron@pdinc.us>
Cc: git@vger.kernel.org, adrigibal@gmail.com
Subject: Re: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
Date: Wed, 30 Jan 2019 17:49:32 +0000	[thread overview]
Message-ID: <20190130174932.lhs3npztu5tusy3e@tb-raspi4> (raw)
In-Reply-To: <000901d4b8af$edaccf20$c9066d60$@pdinc.us>

On Wed, Jan 30, 2019 at 10:24:44AM -0500, Jason Pyeron wrote:
> > -----Original Message-----
> > From: git-owner@vger.kernel.org <git-owner@vger.kernel.org> On Behalf Of
> > tboegi@web.de
> > Sent: Wednesday, January 30, 2019 10:02 AM
> > To: git@vger.kernel.org; adrigibal@gmail.com
> > Cc: Torsten Bögershausen <tboegi@web.de>
> > Subject: [PATCH v3 1/1] Support working-tree-encoding "UTF-16LE-BOM"
> >
> > From: Torsten Bögershausen <tboegi@web.de>
> >
> > Users who want UTF-16 files in the working tree set the .gitattributes
> > like this:
> > test.txt working-tree-encoding=UTF-16
> >
> > The unicode standard itself defines 3 allowed ways how to encode UTF-16.
> > The following 3 versions convert all back to 'g' 'i' 't' in UTF-8:
> >
> > a) UTF-16, without BOM, big endian:
> > $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
> > 0000000    g   i   t
> >
> > b) UTF-16, with BOM, little endian:
> > $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c
> > 0000000    g   i   t
> >
> > c) UTF-16, with BOM, big endian:
> > $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c
> > 0000000    g   i   t
> >
> > Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the
> > working tree.
> > After a checkout, the resulting file has a BOM and is encoded in "UTF-16",
> > in the version (c) above.
> > This is what iconv generates, more details follow below.
> >
> > iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE:
> >
> > d) UTF-16
> > $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c
> > 0000000  376 377  \0   g  \0   i  \0   t
> >
> > e) UTF-16LE
> > $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c
> > 0000000    g  \0   i  \0   t  \0
> >
> > f)  UTF-16BE
> > $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c
> > 0000000   \0   g  \0   i  \0   t
> >
> > There is no way to generate version (b) from above in a Git working tree,
> > but that is what some applications need.
> > (All fully unicode aware applications should be able to read all 3
> > variants,
> > but in practise we are not there yet).
> >
> > When producing UTF-16 as an output, iconv generates the big endian version
> > with a BOM. (big endian is probably chosen for historical reasons).
> >
> > iconv can produce UTF-16 files with little endianess by using "UTF-16LE"
> > as encoding, and that file does not have a BOM.
> >
> > Not all users (especially under Windows) are happy with this.
> > Some tools are not fully unicode aware and can only handle version (b).
> >
> > Today there is no way to produce version (b) with iconv (or libiconv).
> > Looking into the history of iconv, it seems as if version (c) will
> > be used in all future iconv versions (for compatibility reasons).
>
>
> Reading the RFC 2781 section 3.3:
>
>    Text in the "UTF-16BE" charset MUST be serialized with the octets
>    which make up a single 16-bit UTF-16 value in big-endian order.
>    Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.
>
>    Text in the "UTF-16LE" charset MUST be serialized with the octets
>    which make up a single 16-bit UTF-16 value in little-endian order.
>    Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.
>
> I opened a bug with libiconv... https://savannah.gnu.org/bugs/index.php?55609
>

UTF-16 may be a), b) or c) from above.
Every unicode compliant system should be able to read all 3 of them.

When writing, the system/application/converter is free to choose one of those.
Probably out of historical reason, big endian is preferred (in iconv),
and to be helpful to systems/applications a BOM is written in the beginning.
This is according to the RFC, why do you think that this is a bug ?

next prev parent reply	other threads:[~2019-01-30 17:49 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-02  2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
2018-11-04 15:47 ` brian m. carlson
2018-11-04 16:37   ` Adrián Gimeno Balaguer
2018-11-04 18:38     ` brian m. carlson
2018-11-04 17:07 ` Torsten Bögershausen
2018-11-05  4:24   ` Adrián Gimeno Balaguer
2018-11-05 18:10     ` Torsten Bögershausen
2018-11-06 20:16       ` Torsten Bögershausen
2018-11-07  4:38         ` Adrián Gimeno Balaguer
2018-11-08 17:02           ` Torsten Bögershausen
2018-12-26  0:56             ` Alexandre Grigoriev
2018-12-26 19:25               ` brian m. carlson
2018-12-27  2:52                 ` Alexandre Grigoriev
2018-12-27 14:45                   ` Torsten Bögershausen
2018-12-23 14:46   ` Alexandre Grigoriev
2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
     [not found]   ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
2018-12-29 15:48     ` Adrián Gimeno Balaguer
2018-12-29 17:54       ` Philip Oakley
2019-01-20 16:43 ` [PATCH v2 " tboegi
2019-01-22 20:13   ` Junio C Hamano
2019-01-30 15:01 ` [PATCH v3 " tboegi
2019-01-30 15:24   ` Jason Pyeron
2019-01-30 17:49     ` Torsten Bögershausen [this message]
2019-03-06  5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
2019-03-07  0:24   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190130174932.lhs3npztu5tusy3e@tb-raspi4 \
    --to=tboegi@web.de \
    --cc=adrigibal@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jpyeron@pdinc.us \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).