From: "Torsten Bögershausen" <tboegi@web.de>
To: Alexandre Grigoriev <alegrigoriev@gmail.com>
Cc: "'brian m. carlson'" <sandals@crustytoothpaste.net>,
"'Adrián Gimeno Balaguer'" <adrigibal@gmail.com>,
git@vger.kernel.org
Subject: Re: git-rebase is ignoring working-tree-encoding
Date: Thu, 27 Dec 2018 15:45:25 +0100 [thread overview]
Message-ID: <20181227144525.GA2467@tor.lan> (raw)
In-Reply-To: <005601d49d8f$45c109b0$d1431d10$@gmail.com>
On Wed, Dec 26, 2018 at 06:52:56PM -0800, Alexandre Grigoriev wrote:
>
> > -----Original Message-----
> > From: brian m. carlson [mailto:sandals@crustytoothpaste.net]
> > Sent: Wednesday, December 26, 2018 11:25 AM
> > To: Alexandre Grigoriev
> > Cc: 'Torsten Bögershausen'; 'Adrián Gimeno Balaguer'; git@vger.kernel.org
> > Subject: Re: git-rebase is ignoring working-tree-encoding
> >
> > On Tue, Dec 25, 2018 at 04:56:11PM -0800, Alexandre Grigoriev wrote:
> > > Many tools in Windows still do not understand UTF-8, although it's
> > > getting better. I think Windows is about the only OS where tools still
> > > require
> > > UTF-16 for full internationalization.
> > > Many tools written in C use MSVC RTL, where fopen(), unfortunately,
> > > doesn't understand UTF-16BE (though such a rudimentary program as
> > Notepad does).
> > >
> > > For this reason, it's very reasonable to ask that the programming
> > > tools produce UTF-16 files with particular endianness, natural for the
> > > platform they're running on.
> > >
> > > The iconv programmers' boneheaded decision to always produce UTF-16BE
> > > with BOM for UTF-16 output doesn't make sense.
> > > Again, git and iconv/libiconv in Centos on x86 do the right thing and
> > > produce UTF-16LE with BOM in this case.
> >
> > A program which claims to support "UTF-16" must support both
> > endiannesses, according to RFC 2781. A program writing UTF-16-LE must not
> > write a BOM at the beginning. I realize this is inconvenient, but the bad
> > behavior of some Windows programs doesn't mean that Git should ignore
> > interoperability with non-Windows systems using UTF-16 correctly in favor of
> > Windows.
>
> OK, we have a choice either:
> a) to live in that corner of the real world where you have to use available tools, some of which have historical reasons
> to only support UTF-16LE with BOM, because nobody ever throws a different flavor of UTF-16 at them;
> Or b) to live in an ivory tower where you don't really need to use UTF-16 LE or BE or any other flavor,
> because everything is just UTF-8, and tell all those other people using that lame OS to shut up and wait until their tools start to support
> the formats you don't really have to care about;
>
> > behavior of some Windows programs doesn't mean that Git should ignore
> > interoperability with non-Windows systems using UTF-16 correctly in favor of
> > Windows.
>
> Yes, Git (actually libiconv) should not ignore interoperability.
> This means it should check out files on a *Windows* system in a format which *Windows* tools
> can understand.
> And, by the way, Centos (or RedHat?) developers understood that.
> There, on an x86 installation, when you ask for UTF-16, it produces UTF-16LE with BOM.
> Just as every user there would want.
>
>
Sorry if I feel confused here - does the problem still exist ?
If yes, does the following patch help ?
diff --git a/utf8.c b/utf8.c
index eb78587504..2facef84d4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -9,6 +9,23 @@ struct interval {
ucs_char_t last;
};
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+ return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
+static inline uint16_t default_swab16(uint16_t val)
+{
+ return (((val & 0xff00) >> 8) |
+ ((val & 0x00ff) << 8));
+}
+
size_t display_mode_esc_sequence_len(const char *s)
{
const char *p = s;
@@ -556,21 +573,19 @@ char *reencode_string_len(const char *in, size_t insz,
out = reencode_string_iconv(in, insz, conv, outsz);
iconv_close(conv);
+ if (has_bom_prefix(out, *outsz, utf16_be_bom, sizeof(utf16_be_bom))) {
+ /* UTF-16 should be little endian under Git */
+ size_t num_points = *outsz / sizeof(uint16_t);
+ uint16_t *point = (uint16_t*) out;
+ while (num_points--) {
+ *point = default_swab16(*point);
+ point++;
+ }
+ }
return out;
}
#endif
-static int has_bom_prefix(const char *data, size_t len,
- const char *bom, size_t bom_len)
-{
- return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
-}
-
-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
{
return (
next prev parent reply other threads:[~2018-12-27 14:45 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-02 2:30 git-rebase is ignoring working-tree-encoding Adrián Gimeno Balaguer
2018-11-04 15:47 ` brian m. carlson
2018-11-04 16:37 ` Adrián Gimeno Balaguer
2018-11-04 18:38 ` brian m. carlson
2018-11-04 17:07 ` Torsten Bögershausen
2018-11-05 4:24 ` Adrián Gimeno Balaguer
2018-11-05 18:10 ` Torsten Bögershausen
2018-11-06 20:16 ` Torsten Bögershausen
2018-11-07 4:38 ` Adrián Gimeno Balaguer
2018-11-08 17:02 ` Torsten Bögershausen
2018-12-26 0:56 ` Alexandre Grigoriev
2018-12-26 19:25 ` brian m. carlson
2018-12-27 2:52 ` Alexandre Grigoriev
2018-12-27 14:45 ` Torsten Bögershausen [this message]
2018-12-23 14:46 ` Alexandre Grigoriev
2018-12-29 11:09 ` [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM" tboegi
[not found] ` <CADN+U_OccLuLN7_0rjikDgLT+Zvt8hka-=xsnVVLJORjYzP78Q@mail.gmail.com>
2018-12-29 15:48 ` Adrián Gimeno Balaguer
2018-12-29 17:54 ` Philip Oakley
2019-01-20 16:43 ` [PATCH v2 " tboegi
2019-01-22 20:13 ` Junio C Hamano
2019-01-30 15:01 ` [PATCH v3 " tboegi
2019-01-30 15:24 ` Jason Pyeron
2019-01-30 17:49 ` Torsten Bögershausen
2019-03-06 5:23 ` [PATCH v1 1/1] gitattributes.txt: fix typo tboegi
2019-03-07 0:24 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181227144525.GA2467@tor.lan \
--to=tboegi@web.de \
--cc=adrigibal@gmail.com \
--cc=alegrigoriev@gmail.com \
--cc=git@vger.kernel.org \
--cc=sandals@crustytoothpaste.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.